Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Andrew Dunstan <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Mon, 20 Oct 2025 10:02:23 -0400
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
References: <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
	<CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
	<CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>
	<CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
	<CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
	<CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com>
	<CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
	<CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
	<CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
	<CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
	<[email protected]>
	<CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>


On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
> Hi,
>
> On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan<[email protected]> wrote:
>>
>> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
>>> Hi,
>>>
>>> On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz<[email protected]> wrote:
>>>> I am able to reproduce the regression you mentioned but both
>>>> regressions are %20 on my end. I found that (by experimenting) SIMD
>>>> causes a regression if it advances less than 5 characters.
>>>>
>>>> So, I implemented a small heuristic. It works like that:
>>>>
>>>> - If advance < 5 -> insert a sleep penalty (n cycles).
>>> 'sleep' might be a poor word choice here. I meant skipping SIMD for n
>>> number of times.
>>>
>> I was thinking a bit about that this morning. I wonder if it might be
>> better instead of having a constantly applied heuristic like this, it
>> might be better to do a little extra accounting in the first, say, 1000
>> lines of an input file, and if less than some portion of the input is
>> found to be special characters then switch to the SIMD code. What that
>> portion should be would need to be determined by some experimentation
>> with a variety of typical workloads, but given your findings 20% seems
>> like a good starting point.
> I implemented a heuristic something similar to this. It is a mix of
> previous heuristic and your idea, it works like that:
>
> Overall logic is that we will not run SIMD for the entire line and we
> decide if it is worth it to run SIMD for the next lines.
>
> 1 - We will try SIMD and decide if it is worth it to run SIMD.
> 1.1 - If it is worth it, we will continue to run SIMD and we will
> halve the simd_last_sleep_cycle variable.
> 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
> and we will not run SIMD for these many lines.
> 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
> Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
> each 1024 lines at max.
>
> With this heuristic the regression is limited by %2 in the worst case.
>

My worry is that the worst case is actually quite common. Sparse data 
sets dominated by a lot of null values (and hence lots of special 
characters) are very common. Are people prepared to accept a 2% 
regression on load times for such data sets?


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox