Content-Type: multipart/alternative;
 boundary="------------rIPA3bHtfK7RQfVKhiCu2jrV"
Message-ID: <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net>
Date: Mon, 20 Oct 2025 10:02:23 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Nazir Bilal Yavuz <byavuz81@gmail.com>
Cc: KAZAR Ayoub <ma_kazar@esi.dz>, Shinya Kato <shinya11.kato@gmail.com>,
 pgsql-hackers@postgresql.org
References: 
 <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
 <CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
 <CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>
 <CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
 <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
 <CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com>
 <CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
 <CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
 <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
 <CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
 <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net>
 <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
From: Andrew Dunstan <andrew@dunslane.net>
Content-Language: en-US
Autocrypt: addr=andrew@dunslane.net; keydata=
 xsBNBE7KWFkBCAClridxur2AIc7eW2AR7izbfp3EnNefie2HbLF0izW5Ik5UjX2HBXBx4syI
 gY6b0ugohXrr274+baoAlvSbq6cAoQuEVrk5IZFzt20b1Xkx65FwGSEj526yiKLocqkJceSq
 Xr9xcA5SGY+FZv441chh5SU92v4q6z+6LPpoHOh97ptAVXZYNTtU0LevyvD5lja0TzbvJm6C
 eFXitJfnm1pLEr0DGJCR/iUOl/N62Kh4855zZC7NHIjQHPOvV5Stz/l5ilDhvGVk+xkXFPys
 SjZoUr1rXhYLpiyi5sR0X9FHXT0KnGuz1F5ERO7ZTLSSQ6fJwPj6gOk9K+vvoKvoeql5ABEB
 AAHNJEFuZHJldyBEdW5zdGFuIDxhbmRyZXdAZHVuc2xhbmUubmV0PsLAlwQTAQgAQQIbAwIX
 gAIZAQULCQgHAwUVCgkICwUWAgMBAAIeBRYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNGd
 BQkdEO8nAAoJEJn6f8tZ/DuBq74H/jkTR4Zi3stbw+xC7v2u3QozssK7MYPL2AsVfh7OealS
 h182fiWXpfvmmAB7WUHbhk9GC2RAOnHI/2d2jgKaMLAHsGYOT0YopTVIwRY43fCw/mK67yxc
 wmDcX+zyKfLaivNbf5A7QPLNwda98bEAMSJ8Sn652Uc6cA8t3uKGsVzbRBQOoYzjgvBCfSrE
 9ql3PDNg0l4BfAqabd2f70ZUm9VAMEPrgv/v2xI7M2XiL4g5BVmqLCOwxLM8RMCotCuoweUr
 VO43DeBCIDwLxotMJKvGWDjBzQYlU1NPUAtNcz/gN9ITUe1VUGjyvGj4u1lxBOcQQUw7l1+T
 5moZ4iZxXzvOwE0ETspYWQEIANGc4zQULOxhbqO2dyD51YhqCNRmm9oKWaqf+wmW4tpDe/VV
 cxAnNizd4LWCHfzpb5cHAtGkOPePMfzWVf6nvdF7d3eglbtf59+zG7O7llV0xSSoFiieQBsr
 GvqDInXYX/4mRRXMtyhM353/tixC9RWLs1oofyYmCPPXXY7h9R7en3B8BoVrRFcdzlIY/NFN
 hFGW/9dkEiGjgna2Rk6e15kln4ZvFBWUg23p93w/pqXcxY6+k/8TEk+C4R+M6w7o2PLGOjdZ
 +kPiUcw5H85zf/yZJwQXzisXaNduwWB6Vads9YC9dj6kPR1c4VGRqAaYL++LAEOqrlvm2Tvq
 QqZRtnEAEQEAAcLAfAQYAQgAJgIbDBYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNI2BQkd
 EODdAAoJEJn6f8tZ/DuBfw0IAKTsfD40teP/pp+bsLLMSxPXUYrrprTj7WFB5v61p6dkpSr/
 qXmMlyahdxQFaPmfVgVirB1Vk/kHiWNnnGjfUV9nB2Zg9LI0Xb9/ts3LsUiRWXzG3tkMY6XL
 vsVOxW4XFRND9l2q+WW93aZ1DZl+fqWfYgMvsusFRhmGFOKTRfKPta2Pkv+AhA24N4+PrR5p
 bU4k2MO8PAGiK8eaYKGFG1bHKuAvoDoF7WXJ3FHxuWqLnKEt4dfOLm5pAe3zq1Lt6q8azT9i
 QWGpSAK5vQUWQHBHpiDjdPeqKZ6HiAXIIKfSmb+jrvXBqoP+D6/K7rUjG2aXiRtTIAXms9sm
 VRu7cmw=
In-Reply-To: 
 <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/673d92f7-2489-475f-a208-9414ea35d4d8%40dunslane.net>
Precedence: bulk

This is a multi-part message in MIME format.
--------------rIPA3bHtfK7RQfVKhiCu2jrV
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
> Hi,
>
> On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan<andrew@dunslane.net> wrote:
>>
>> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
>>> Hi,
>>>
>>> On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz<byavuz81@gmail.com> wrote:
>>>> I am able to reproduce the regression you mentioned but both
>>>> regressions are %20 on my end. I found that (by experimenting) SIMD
>>>> causes a regression if it advances less than 5 characters.
>>>>
>>>> So, I implemented a small heuristic. It works like that:
>>>>
>>>> - If advance < 5 -> insert a sleep penalty (n cycles).
>>> 'sleep' might be a poor word choice here. I meant skipping SIMD for n
>>> number of times.
>>>
>> I was thinking a bit about that this morning. I wonder if it might be
>> better instead of having a constantly applied heuristic like this, it
>> might be better to do a little extra accounting in the first, say, 1000
>> lines of an input file, and if less than some portion of the input is
>> found to be special characters then switch to the SIMD code. What that
>> portion should be would need to be determined by some experimentation
>> with a variety of typical workloads, but given your findings 20% seems
>> like a good starting point.
> I implemented a heuristic something similar to this. It is a mix of
> previous heuristic and your idea, it works like that:
>
> Overall logic is that we will not run SIMD for the entire line and we
> decide if it is worth it to run SIMD for the next lines.
>
> 1 - We will try SIMD and decide if it is worth it to run SIMD.
> 1.1 - If it is worth it, we will continue to run SIMD and we will
> halve the simd_last_sleep_cycle variable.
> 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
> and we will not run SIMD for these many lines.
> 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
> Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
> each 1024 lines at max.
>
> With this heuristic the regression is limited by %2 in the worst case.
>

My worry is that the worst case is actually quite common. Sparse data 
sets dominated by a lot of null values (and hence lots of special 
characters) are very common. Are people prepared to accept a 2% 
regression on load times for such data sets?


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

--------------rIPA3bHtfK7RQfVKhiCu2jrV
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2025-10-16 Th 10:29 AM, Nazir Bilal
      Yavuz wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com">
      <pre wrap="" class="moz-quote-pre">Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <a class="moz-txt-link-rfc2396E" href="mailto:andrew@dunslane.net">&lt;andrew@dunslane.net&gt;</a> wrote:
</pre>
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">

On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
</pre>
        <blockquote type="cite">
          <pre wrap="" class="moz-quote-pre">Hi,

On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <a class="moz-txt-link-rfc2396E" href="mailto:byavuz81@gmail.com">&lt;byavuz81@gmail.com&gt;</a> wrote:
</pre>
          <blockquote type="cite">
            <pre wrap="" class="moz-quote-pre">I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance &lt; 5 -&gt; insert a sleep penalty (n cycles).
</pre>
          </blockquote>
          <pre wrap="" class="moz-quote-pre">'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.

</pre>
        </blockquote>
        <pre wrap="" class="moz-quote-pre">
I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">
I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.

</pre>
    </blockquote>
    <p><br>
    </p>
    <p>My worry is that the worst case is actually quite common. Sparse
      data sets dominated by a lot of null values (and hence lots of 
      special characters) are very common. Are people prepared to accept
      a 2% regression on load times for such data sets?</p>
    <p><br>
    </p>
    <p>cheers</p>
    <p><br>
    </p>
    <p>andrew<br>
    </p>
    <p><span style="white-space: pre-wrap">
</span></p>
    <pre class="moz-signature"
    signature-switch-id="d0437855-2267-4610-80b3-83167ec45b0b" cols="72">--
Andrew Dunstan
EDB: <a class="moz-txt-link-freetext" href="https://www.enterprisedb.com">https://www.enterprisedb.com</a>
</pre>
  </body>
</html>

--------------rIPA3bHtfK7RQfVKhiCu2jrV--