public inbox for [email protected]
help / color / mirror / Atom feedFrom: Nathan Bossart <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Wed, 19 Nov 2025 15:01:03 -0600
Message-ID: <aR4wDwNdLc5TmcQq@nathan> (raw)
In-Reply-To: <CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
References: <aPkvi5P7kpA8oQKc@nathan>
<[email protected]>
<CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
<CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
<CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com>
<aRue0D4QQkUf2B_N@nathan>
<CAOzEurTHCGL-Txqf5rxMsPgTF=dTCOsr=uhJdXebqjEJy-0L7g@mail.gmail.com>
<CAN55FZ0+JZvKYVCnJqLhHaWF9eBGmTaF1BCEpttxw1aT3G_+Qw@mail.gmail.com>
<[email protected]>
<CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
> Thanks, done.
I took a look at the v3 patches. Here are my high-level thoughts:
+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->simd_continue)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);
I'm curious whether this actually generates different code, and if it does,
if it's actually faster. We're already branching on cstate->simd_continue
here.
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
In other places, processing 2 or 4 vectors of data at a time has proven
faster. Have you tried that here?
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
The amount of branching here catches my eye. Some branching might be
unavoidable, but in general we want to keep these SIMD paths as branch-free
as possible.
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ simd_total_advance += advance;
Do we actually need to advance here? Or could we just fall through to the
scalar path? My suspicion is that this extra code doesn't gain us much.
+ if (simd_last_sleep_cycle == 0)
+ simd_last_sleep_cycle = 1;
+ else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2)
+ simd_last_sleep_cycle = SIMD_SLEEP_MAX;
+ else
+ simd_last_sleep_cycle <<= 1;
+ cstate->simd_current_sleep_cycle = simd_last_sleep_cycle;
+ cstate->simd_last_sleep_cycle = simd_last_sleep_cycle;
IMHO we should be looking for ways to simplify this should-we-use-SIMD
code. For example, perhaps we could just disable the SIMD path for 10K or
100K lines any time a special character is found. I'm dubious that a lot
of complexity is warranted.
--
nathan
view thread (99+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
In-Reply-To: <aR4wDwNdLc5TmcQq@nathan>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox