Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nazir Bilal Yavuz <[email protected]>
To: Nathan Bossart <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Wed, 22 Oct 2025 15:33:37 +0300
Message-ID: <CAN55FZ0AYP4ZEczBJ5ur-=9QuEhMysH9Yfrq5srr0ZakK1M0FA@mail.gmail.com> (raw)
In-Reply-To: <aPfTiX0HwV42R6Od@nathan>
References: <CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
	<CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
	<CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
	<CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
	<[email protected]>
	<CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
	<[email protected]>
	<aPZrg6lxb5bgy_px@nathan>
	<[email protected]>
	<CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com>
	<aPfTiX0HwV42R6Od@nathan>

Hi,

On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <[email protected]> wrote:
>
> On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
> > I think the problem is deciding how many lines to process before
> > deciding for the rest. 1000 lines could work for the small sized data
> > but it might not work for the big sized data. Also, it might cause a
> > worse regressions for the small sized data.
>
> IMHO we have some leeway with smaller amounts of data.  If COPY FROM for
> 1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
> unlikely users would be inconvenienced all that much.  (Those numbers are
> completely made up in order to illustrate my point.)
>
> > Because of this reason, I
> > tried to implement a heuristic that will work regardless of the size
> > of the data. The last heuristic I suggested will run SIMD for
> > approximately (#number_of_lines / 1024 [1024 is the max number of
> > lines to sleep before running SIMD again]) lines if all characters in
> > the data are special characters.
>
> I wonder if we could mitigate the regression further by spacing out the
> checks a bit more.  It could be worth comparing a variety of values to
> identify what works best with the test data.

Do you mean that instead of doubling the SIMD sleep, we should
multiply it by 3 (or another factor)? Or are you referring to
increasing the maximum sleep from 1024? Or possibly both?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ0AYP4ZEczBJ5ur-=9QuEhMysH9Yfrq5srr0ZakK1M0FA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox