Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nathan Bossart <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Tue, 21 Oct 2025 13:40:09 -0500
Message-ID: <aPfTiX0HwV42R6Od@nathan> (raw)
In-Reply-To: <CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com>
References: <CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
	<CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
	<CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
	<CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
	<[email protected]>
	<CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
	<[email protected]>
	<aPZrg6lxb5bgy_px@nathan>
	<[email protected]>
	<CAN55FZ2GonAeSJHn-c2nJgUO-v6sDMOQzn97evVdZbcHeu3ihw@mail.gmail.com>

On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
> I think the problem is deciding how many lines to process before
> deciding for the rest. 1000 lines could work for the small sized data
> but it might not work for the big sized data. Also, it might cause a
> worse regressions for the small sized data.

IMHO we have some leeway with smaller amounts of data.  If COPY FROM for
1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
unlikely users would be inconvenienced all that much.  (Those numbers are
completely made up in order to illustrate my point.)

> Because of this reason, I
> tried to implement a heuristic that will work regardless of the size
> of the data. The last heuristic I suggested will run SIMD for
> approximately (#number_of_lines / 1024 [1024 is the max number of
> lines to sleep before running SIMD again]) lines if all characters in
> the data are special characters.

I wonder if we could mitigate the regression further by spacing out the
checks a bit more.  It could be worth comparing a variety of values to
identify what works best with the test data.

-- 
nathan

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <aPfTiX0HwV42R6Od@nathan>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox