Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nazir Bilal Yavuz <[email protected]>
To: Nathan Bossart <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Neil Conway <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Wed, 11 Mar 2026 21:49:22 +0300
Message-ID: <CAN55FZ3gdK8dGrEo0M6KFW97OaF8TUbjO_dFoxQKi63davE-jA@mail.gmail.com> (raw)
In-Reply-To: <abGv0ScUWVa6eogw@nathan>
References: <CAKWEB6qzsZEQ4Czo9QBFiMXqdXVJknHUJwg6wjRwNzLn4+Jw0g@mail.gmail.com>
	<CAN55FZ2O2Ls==sdpROHqxWRx-PMBZ0riJ6eVKoHj8=vssTavxw@mail.gmail.com>
	<aZ3kYQnF9_u6sUQp@nathan>
	<CAN55FZ3+NYF1TkKyNtpRQuLiaauSYk9G5tA+fpruOA4-14Y_ZA@mail.gmail.com>
	<aaXrGSyq4u2d9qEC@nathan>
	<CAN55FZ2DNaKCK3Kf_kHizb2pAbQvULeDYtzaiz97B_xz7YbrkQ@mail.gmail.com>
	<aa8QlTVEDhG1JU0Z@nathan>
	<CAN55FZ08kqmA+B9pzPDy-QstxAd=cK-RqjbR3cWBjPF_8-FXAw@mail.gmail.com>
	<abBQbdFa6OsG8TGu@nathan>
	<CAN55FZ3jXs7XDsP_-v_jUBquRu4uAdheN3xcmW=WhAyKwFLSjg@mail.gmail.com>
	<abGv0ScUWVa6eogw@nathan>

Hi,

On Wed, 11 Mar 2026 at 21:09, Nathan Bossart <[email protected]> wrote:
>
> On Wed, Mar 11, 2026 at 02:36:46PM +0300, Nazir Bilal Yavuz wrote:
> > 0002 has an attempt to remove some branches from SIMD code but since
> > it is kind of functional change, I wanted to attach that as another
> > patch. I think we can apply some parts of this, if not all.
>
> Could you describe what this is doing and what the performance impact is?

SIMD code check these characters:

csv mode: nl, cr, quote and possibly escape.

text mode: nl, cr and bs.

v12 checks them like that:

            if (is_csv)
            {
                match = vector8_or(vector8_eq(chunk, nl),
vector8_eq(chunk, cr));
                match = vector8_or(match, vector8_eq(chunk, quote));
                if (unique_escapec)
                    match = vector8_or(match, vector8_eq(chunk, escape));
            }
            else
            {
                match = vector8_or(vector8_eq(chunk, nl),
vector8_eq(chunk, cr));
                match = vector8_or(match, vector8_eq(chunk, bs));
            }

But actually we know that we will definitely check nl, cr and one of
the quote or bs characters in the code. So, we can introduce a new
variable named bs_or_quote, it will be equal to bs if the mode is text
and it will be equal to quote if the mode is csv. Then, we can remove
the 'if (is_csv)' check and only check for escape ('if
(unique_escapec)'). Now code will look like that:

            match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
            match = vector8_or(match, vector8_eq(chunk, bs_or_quote));
            if (unique_escapec)
                match = vector8_or(match, vector8_eq(chunk, escape));

That is what v13-0002 does. I saw 1%-2% speedups with this change and
there was no regression.

Regardless of introducing the bs_or_quote variable, we can move 'match
= vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));' outside
of the if checks, though.

--
Regards,
Nazir Bilal Yavuz
Microsoft

view thread (114+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ3gdK8dGrEo0M6KFW97OaF8TUbjO_dFoxQKi63davE-jA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox