public inbox for [email protected]
help / color / mirror / Atom feedFrom: Manni Wood <[email protected]>
To: Andrew Dunstan <[email protected]>
Cc: Nathan Bossart <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Tue, 11 Nov 2025 16:23:20 -0600
Message-ID: <CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
<CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
<[email protected]>
<CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
<[email protected]>
<aPZrg6lxb5bgy_px@nathan>
<[email protected]>
<CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com>
<aPfTiX0HwV42R6Od@nathan>
<CAN55FZ0AYP4ZEczBJ5ur-=9QuEhMysH9Yfrq5srr0ZakK1M0FA@mail.gmail.com>
<aPkvi5P7kpA8oQKc@nathan>
<[email protected]>
On Wed, Oct 29, 2025 at 5:23 PM Andrew Dunstan <[email protected]> wrote:
>
> On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
> > On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
> >> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <[email protected]>
> wrote:
> >>> I wonder if we could mitigate the regression further by spacing out the
> >>> checks a bit more. It could be worth comparing a variety of values to
> >>> identify what works best with the test data.
> >> Do you mean that instead of doubling the SIMD sleep, we should
> >> multiply it by 3 (or another factor)? Or are you referring to
> >> increasing the maximum sleep from 1024? Or possibly both?
> > I'm not sure of the precise details, but the main thrust of my suggestion
> > is to assume that whatever sampling you do to determine whether to use
> SIMD
> > is good for a larger chunk of data. That is, if you are sampling 1K
> lines
> > and then using the result to choose whether to use SIMD for the next 100K
> > lines, we could instead bump the latter number to 1M lines (or
> something).
> > That way we minimize the regression for relatively uniform data sets
> while
> > retaining some ability to adapt in case things change halfway through a
> > large table.
> >
>
>
> I'd be ok with numbers like this, although I suspect the numbers of
> cases where we see shape shifts like this in the middle of a data set
> would be vanishingly small.
>
>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>
>
>
Hello!
I wanted reproduce the results using files attached by Shinya Kato and
Ayoub Kazar. I installed a postgres compiled from master, and then I
installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
applied.
The master+v3patches postgres naturally performed better on copying into
the database: anywhere from 11% better for the t.csv file produced by
Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
created by Ayoub Kazar's simd-copy-from-bench.sql.
But here's where it gets weird. The two files created by Ayoub Kazar's
simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
respectively.
This seems impossible.
A few things I should note:
I timed the commands using the Unix time command, like so:
time psql -X -U mwood -h localhost -d postgres -c '\copy t from
/tmp/t_4096_escape.txt'
For each file, I timed the copy 6 times and took the average.
This was done on my work Linux machine while also running Chrome and an
Open Office spreadsheet; not a dedicated machine only running postgres.
All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
into postgres compiled from master) to 2 seconds (Ayoub
Kazar's t_4096_none.csv copied into postgres compiled from master plus
Nazir's v3 patches).
Perhaps I need to fiddle with the provided SQL to produce larger files to
get longer run times? Maybe sub-second differences won't tell as
interesting a story as minutes-long copy commands?
Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com
view thread (99+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
In-Reply-To: <CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox