Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nazir Bilal Yavuz <[email protected]>
To: Nathan Bossart <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Neil Conway <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Sat, 7 Feb 2026 01:19:16 +0300
Message-ID: <CAN55FZ2DOeLjSXE2Jos99bgHG-Zeo3KjStrSgoA8Rf=2Mu+hFA@mail.gmail.com> (raw)
In-Reply-To: <aYZdKSTw6N3khsVE@nathan>
References: <CAKWEB6rLxPVtN4ffZ3CMTL518zhk_BWzzBt6ZE2oUSaErdphxA@mail.gmail.com>
	<CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com>
	<CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com>
	<CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
	<CAKWEB6oZdQhhBV3ojHLBwjQgKzfDw0fkqncurt9oi7vNsq41ww@mail.gmail.com>
	<CAN55FZ1p5UyUdTRO7iWR_ukjhJDOnpOR2rYNOq=+hcC45OuahQ@mail.gmail.com>
	<CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com>
	<CA+K2Ru=C_woAnd-3-pGHoNSTR8FOf=7eeSWE1xaLt9ojVWndVg@mail.gmail.com>
	<CAN55FZ0FRB2OD6-oEESLvgUT4bLZQVD72pAqUqzdw7Rx5cN0ig@mail.gmail.com>
	<CA+K2Run1VdLnmp-5_Qv2Fax0KgT7LLJMH-uzjaaf-NZD1oU-=w@mail.gmail.com>
	<aYZdKSTw6N3khsVE@nathan>

Hi,

Thank you for sharing your thoughts!

On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <[email protected]> wrote:
>
> It looks like a lot of energy has been put into benchmarking and refining
> the heuristic for deciding when to use the SIMD path so that we avoid large
> regressions when there are special characters.  I think this is all
> valuable work, but I'm a bit concerned that we are putting the cart before
> the horse.  IMHO it would be better to first get the SIMD code committed
> with the absolute simplest heuristic we can think of (e.g., as soon as we
> see a special character, switch to the scalar path for the remainder of
> COPY FROM).  My hope is that would be far easier to reason about from a
> performance angle.  If we immediately fall back to the existing code path,
> we don't need to worry about how many special characters there are and
> whether they are sparse or clustered or whatever.  We just need to measure
> the overhead of the new branches and ensure they don't produce meaningful
> regressions.  Assuming that all looks good, we can then focus on the SIMD
> code itself and make sure that is correct and optimal.  And once we get
> that portion committed, we could then consider more sophisticated
> heuristics.

I have three possible approaches in my mind, they are actually similar
to each other.

1- After encountering a special character, disable SIMD for the rest
of the current line and also for the rest of the data.

2- It is a mixed version of the current heuristic and #1. After
encountering a special character, skip SIMD for the current line (let'
say line 1) and for the next line (line 2). Then try running SIMD for
the next line (line 3), if there is no special character continue to
run SIMD but if there is a special character then skip running SIMD
for two lines this time. And it goes like that, everytime special
character is encountered in the SIMD run, skipped SIMD lines are
doubled.

3- This version is a bit different from #2. Instead of calculating the
number of lines to skip dynamically, skip the constant N number of
lines and then try to run SIMD again after these lines. N could be
something like 100, 1000, or 10000 etc.. Actually, you and Andrew
suggested this approach before [1].

I think what you suggested is closer to #1 or #3. I just wanted to
hear your opinions, and whether you think any of these approaches are
good to implement / work on.

[1] https://postgr.es/m/aR4wDwNdLc5TmcQq%40nathan

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

view thread (21+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ2DOeLjSXE2Jos99bgHG-Zeo3KjStrSgoA8Rf=2Mu+hFA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox