Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Manni Wood <[email protected]>
To: Nathan Bossart <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Tue, 25 Nov 2025 18:09:42 -0600
Message-ID: <CAKWEB6qx9mEd8a-QqDe1xqqyuoR=NzUPwJvyc59sUbLc18RHUQ@mail.gmail.com> (raw)
In-Reply-To: <aSTVOe6BIe5f1l3i@nathan>
References: <CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
	<CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
	<CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com>
	<aRue0D4QQkUf2B_N@nathan>
	<CAOzEurTHCGL-Txqf5rxMsPgTF=dTCOsr=uhJdXebqjEJy-0L7g@mail.gmail.com>
	<CAN55FZ0+JZvKYVCnJqLhHaWF9eBGmTaF1BCEpttxw1aT3G_+Qw@mail.gmail.com>
	<[email protected]>
	<CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
	<aR4wDwNdLc5TmcQq@nathan>
	<CAN55FZ0e_L_O2O5W4E39vap1rz=OJjVqT7w--7gYeHpHK0a2aQ@mail.gmail.com>
	<aSTVOe6BIe5f1l3i@nathan>

Hello.

I tried Ayoub Kazar's test files again, using Nazir Bilal Yavuz's v3
patches, but with one difference since my last attempt: this time, I used 5
million lines per file. For each 5 million line file, I ran the import 5
times and averaged the results.

(I found that even using 1 million lines could sometimes produce surprising
speedups where the newer algorithm should be at least a tiny bit slower
than the non-simd version.)

The text file with no special characters is 30% faster. The CSV file with
no special characters is 39% faster. The text file with roughly 1/3rd
special characters is 0.5% slower. The CSV file with roughly 1/3rd special
characters is 2.7% slower.

I also tried files that alternated lines with no special characters and
lines with 1/3rd special characters, thinking I could force the algorithm
to continually check whether or not it should use simd and therefore force
more overhead in the try-simd/don't-try-simd housekeeping code. The text
file was still 50% faster. The CSV file was still 13% faster.



On Mon, Nov 24, 2025 at 3:59 PM Nathan Bossart <[email protected]>
wrote:

> On Thu, Nov 20, 2025 at 03:55:43PM +0300, Nazir Bilal Yavuz wrote:
> > On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <[email protected]>
> wrote:
> >> +            /* Load a chunk of data into a vector register */
> >> +            vector8_load(&chunk, (const uint8 *)
> &copy_input_buf[input_buf_ptr]);
> >>
> >> In other places, processing 2 or 4 vectors of data at a time has proven
> >> faster.  Have you tried that here?
> >
> > Sorry, I could not find the related code piece. I only saw the
> > vector8_load() inside of hex_decode_safe() function and its comment
> > says:
> >
> > /*
> >  * We must process 2 vectors at a time since the output will be half the
> >  * length of the input.
> >  */
> >
> > But this does not mention any speedup from using 2 vectors at a time.
> > Could you please show the related code?
>
> See pg_lfind32().
>
> --
> nathan
>


-- 
-- Manni Wood EDB: https://www.enterprisedb.com

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAKWEB6qx9mEd8a-QqDe1xqqyuoR=NzUPwJvyc59sUbLc18RHUQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox