Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nazir Bilal Yavuz <[email protected]>
To: KAZAR Ayoub <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: Mark Wong <[email protected]>
Cc: Nathan Bossart <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Wed, 31 Dec 2025 16:04:15 +0300
Message-ID: <CAN55FZ3fWSk0h09Yfbb2eO4COfQDSL7Ofk7xF3q_Wg4ags3kPw@mail.gmail.com> (raw)
In-Reply-To: <CA+K2RumOaH-daBGN6uTo6+_0XSg7HQ10Na8OzScCV5j6eKkFgA@mail.gmail.com>
References: <CAOzEurTHCGL-Txqf5rxMsPgTF=dTCOsr=uhJdXebqjEJy-0L7g@mail.gmail.com>
	<CAN55FZ0+JZvKYVCnJqLhHaWF9eBGmTaF1BCEpttxw1aT3G_+Qw@mail.gmail.com>
	<[email protected]>
	<CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
	<aR4wDwNdLc5TmcQq@nathan>
	<CA+K2Rump8NoMRZRZ2r4jHXUJwByasy_c3_b0oaO+TLkSbMD-jw@mail.gmail.com>
	<CAKWEB6rLxPVtN4ffZ3CMTL518zhk_BWzzBt6ZE2oUSaErdphxA@mail.gmail.com>
	<CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com>
	<CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com>
	<CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
	<aTx-LDyiHV-7wfOP@ltdrgnflg2>
	<CAKWEB6r=axZsG-s7zyWURZ-s9-s1dTV9ohkZXO0ynfLEU5ha3Q@mail.gmail.com>
	<CAN55FZ2DE2XSrFUhsOqbpBo+BtzTwsJWOD0MffvdGnHtbsPRuw@mail.gmail.com>
	<CA+K2RumOaH-daBGN6uTo6+_0XSg7HQ10Na8OzScCV5j6eKkFgA@mail.gmail.com>

Hi,

On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub <[email protected]> wrote:
>
> Hello,
> Following the same path of optimizing COPY FROM using SIMD, i found that COPY TO can also benefit from this.
>
> I attached a small patch that uses SIMD to skip data and advance as far as the first special character is found, then fallback to scalar processing for that character and re-enter the SIMD path again...
> There's two ways to do this:
> 1) Essentially we do SIMD until we find a special character, then continue scalar path without re-entering SIMD again.
> - This gives from 10% to 30% speedups depending on the weight of special characters in the attribute, we don't lose anything here since it advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 specials chars).
>
> 2) Do SIMD path, then use scalar path when we hit a special character, keep re-entering the SIMD path each time.
> - This is equivalent to the COPY FROM story, we'll need to find the same heuristic to use for both COPY FROM/TO to reduce the regressions (same regressions: around from 20% to 30% with 1/3, 2/3 specials chars).
>
> Something else to note is that the scalar path for COPY TO isn't as heavy as the state machine in COPY FROM.
>
> So if we find the sweet spot for the heuristic, doing the same for COPY TO will be trivial and always beneficial.
> Attached is 0004 which is option 1 (SIMD without re-entering), 0005 is the second one.

Patches look correct to me. I think we could move these SIMD code
portions into a shared function to remove duplication, although that
might have a performance impact. I have not benchmarked these patches
yet.

Another consideration is that these patches might need their own
thread, though I am not completely sure about this yet.

One question: what do you think about having a 0004-style approach for
COPY FROM? What I have in mind is running SIMD for each line & column,
stopping SIMD once it can no longer skip an entire chunk, and then
continuing with the next line & column.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ3fWSk0h09Yfbb2eO4COfQDSL7Ofk7xF3q_Wg4ags3kPw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox