public inbox for [email protected]  
help / color / mirror / Atom feed
From: Manni Wood <[email protected]>
To: Nathan Bossart <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Neil Conway <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Fri, 13 Feb 2026 21:34:13 -0600
Message-ID: <CAKWEB6p-Y54yWA5kq6OXEYV=ABdHenJ559i0MshOoYkP4i=o5A@mail.gmail.com> (raw)
In-Reply-To: <aY-vJe_ENCB-fux9@nathan>
References: <CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com>
	<CA+K2Ru=C_woAnd-3-pGHoNSTR8FOf=7eeSWE1xaLt9ojVWndVg@mail.gmail.com>
	<CAN55FZ0FRB2OD6-oEESLvgUT4bLZQVD72pAqUqzdw7Rx5cN0ig@mail.gmail.com>
	<CA+K2Run1VdLnmp-5_Qv2Fax0KgT7LLJMH-uzjaaf-NZD1oU-=w@mail.gmail.com>
	<aYZdKSTw6N3khsVE@nathan>
	<CAN55FZ2DOeLjSXE2Jos99bgHG-Zeo3KjStrSgoA8Rf=2Mu+hFA@mail.gmail.com>
	<aYZvdsXPElQvwWOA@nathan>
	<CAN55FZ1=O6TjeZM2CUT7T2tu66uJT+w3G9FiRXVs+gt_ousFxQ@mail.gmail.com>
	<aY0FL4rXUl6ykn-a@nathan>
	<CAN55FZ3g6QaiC8G4GMjdJ24egvgc-HG_xpoOztxnM_wnQNn5aw@mail.gmail.com>
	<aY-vJe_ENCB-fux9@nathan>

Hello!

I ran some COPY FROM tests using master and then Nazir's v7-0001 and
v7-0002 patches applied to master.

x86 master
TXT :                 29222.524250 ms
CSV :                 36162.588500 ms
TXT with 1/3 escapes: 32922.649750 ms
CSV with 1/3 quotes:  47631.423750 ms

x86 v7-0001
TXT :                 23247.834250 ms  20.445496% improvement
CSV :                 23162.711750 ms  35.948413% improvement
TXT with 1/3 escapes: 31786.386000 ms  3.451313% improvement
CSV with 1/3 quotes:  43330.475500 ms  9.029645% improvement

x86 v7-0002
TXT :                 22394.812500 ms  23.364552% improvement
CSV :                 22374.645750 ms  38.127643% improvement
TXT with 1/3 escapes: 32378.929750 ms  1.651507% improvement
CSV with 1/3 quotes:  47139.171750 ms  1.033461% improvement

arm master
TXT :                 9448.900500 ms
CSV :                 11135.871500 ms
TXT with 1/3 escapes: 10786.418750 ms
CSV with 1/3 quotes:  14115.335500 ms

arm v7-0001
TXT :                 7271.170500 ms  23.047443% improvement
CSV :                 7259.866750 ms  34.806479% improvement
TXT with 1/3 escapes: 10894.445500 ms  -1.001507% regression
CSV with 1/3 quotes:  13398.444000 ms  5.078813% improvement

arm v7-0002
TXT :                 7165.707250 ms  24.163587% improvement
CSV :                 7140.497250 ms  35.878416% improvement
TXT with 1/3 escapes: 10308.782250 ms  4.428129% improvement
CSV with 1/3 quotes:  12576.179500 ms  10.904140% improvement

v7-0001 + v7-0002 applied to master certainly seems promising: nice to see
speed improvements across the board on both x86 and arm!

On Fri, Feb 13, 2026 at 5:09 PM Nathan Bossart <[email protected]>
wrote:

> On Fri, Feb 13, 2026 at 02:45:30PM +0300, Nazir Bilal Yavuz wrote:
> > Also, if I change this code to:
> >
> >     if (cstate->simd_enabled)
> >     {
> >         if (is_csv)
> >             result = CopyReadLineText(cstate, true, true);
> >         else
> >             result = CopyReadLineText(cstate, false, true);
> >     }
> >     else
> >     {
> >         if (is_csv)
> >             result = CopyReadLineText(cstate, true, false);
> >         else
> >             result = CopyReadLineText(cstate, false, false);
> >     }
> >
> > then I see ~%5 performance improvement in scalar path compared to master.
>
> Hm.  What difference do you see if you just do
>
>         if (is_csv)
>                 result = CopyReadLineText(cstate, true);
>         else
>                 result = CopyReadLineText(cstate, false);
>
> both with and without the SIMD stuff?  IIUC this is allowing the compiler
> to remove several branches in CopyReadLineText(), which might be a nice
> improvement on its own.  That being said, I'm less convinced that adding a
> simd_enabled parameter to CopyReadLineText() helps, because 1) it's
> involved in fewer branches and 2) we change it within the function, so the
> compiler can't remove the branches, anyway.  But perhaps I'm missing
> something.
>
> Some other random thoughts:
>
> +                    match = vector8_or(vector8_eq(chunk, nl),
> vector8_eq(chunk, cr));
>
> +                match = vector8_or(vector8_eq(chunk, nl),
> vector8_eq(chunk, cr));
>
> Since \n and \r are well below "normal" ASCII values, I wonder if we could
> simplify these to something like
>
>         match = vector8_gt(... vector with all lanes set to \r + 1 ...,
> chunk);
>
> +            /* Check if we found any special characters */
> +            mask = vector8_highbit_mask(match);
> +            if (mask != 0)
>
> vector8_highbit_mask() is somewhat expensive on AArch64, so I wonder if
> waiting until we enter the "if" block to calculate it has any benefit.
>
> +                simd_hit_eol = (c1 == '\r' || c1 == '\n') && (!is_csv ||
> !in_quote);
>
> If (is_csv && in_quote), we shouldn't have picked up \r or \n in the first
> place, right?
>
> +                simd_hit_eof = c1 == '\\' && c2 == '.' && !is_csv;
> +
> +                /*
> +                 * Do not disable SIMD when we hit EOL or EOF characters.
> In
> +                 * practice, it does not matter for EOF because parsing
> ends
> +                 * there, but we keep the behavior consistent.
> +                 */
> +                if (!(simd_hit_eof || simd_hit_eol))
>
> I'd think that doing less unnecessary work would outweigh the benefits of
> consistency for the EOF case.
>
> --
> nathan
>


-- 
-- Manni Wood EDB: https://www.enterprisedb.com


view thread (21+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAKWEB6p-Y54yWA5kq6OXEYV=ABdHenJ559i0MshOoYkP4i=o5A@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox