Content-Type: multipart/alternative;
 boundary="------------6M6F2h30AbqHmkN007Qrdguo"
Message-ID: <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net>
Date: Mon, 20 Oct 2025 16:31:58 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
To: Nathan Bossart <nathandbossart@gmail.com>
Cc: Nazir Bilal Yavuz <byavuz81@gmail.com>, KAZAR Ayoub <ma_kazar@esi.dz>,
 Shinya Kato <shinya11.kato@gmail.com>, pgsql-hackers@postgresql.org
References: 
 <CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
 <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
 <CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com>
 <CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
 <CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
 <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>
 <CAN55FZ109W90Ux_EBEqkkU2TyNqBNhdhN_1XPRGo3iiZ2L9b=A@mail.gmail.com>
 <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net>
 <CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com>
 <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <aPZrg6lxb5bgy_px@nathan>
From: Andrew Dunstan <andrew@dunslane.net>
Content-Language: en-US
Autocrypt: addr=andrew@dunslane.net; keydata=
 xsBNBE7KWFkBCAClridxur2AIc7eW2AR7izbfp3EnNefie2HbLF0izW5Ik5UjX2HBXBx4syI
 gY6b0ugohXrr274+baoAlvSbq6cAoQuEVrk5IZFzt20b1Xkx65FwGSEj526yiKLocqkJceSq
 Xr9xcA5SGY+FZv441chh5SU92v4q6z+6LPpoHOh97ptAVXZYNTtU0LevyvD5lja0TzbvJm6C
 eFXitJfnm1pLEr0DGJCR/iUOl/N62Kh4855zZC7NHIjQHPOvV5Stz/l5ilDhvGVk+xkXFPys
 SjZoUr1rXhYLpiyi5sR0X9FHXT0KnGuz1F5ERO7ZTLSSQ6fJwPj6gOk9K+vvoKvoeql5ABEB
 AAHNJEFuZHJldyBEdW5zdGFuIDxhbmRyZXdAZHVuc2xhbmUubmV0PsLAlwQTAQgAQQIbAwIX
 gAIZAQULCQgHAwUVCgkICwUWAgMBAAIeBRYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNGd
 BQkdEO8nAAoJEJn6f8tZ/DuBq74H/jkTR4Zi3stbw+xC7v2u3QozssK7MYPL2AsVfh7OealS
 h182fiWXpfvmmAB7WUHbhk9GC2RAOnHI/2d2jgKaMLAHsGYOT0YopTVIwRY43fCw/mK67yxc
 wmDcX+zyKfLaivNbf5A7QPLNwda98bEAMSJ8Sn652Uc6cA8t3uKGsVzbRBQOoYzjgvBCfSrE
 9ql3PDNg0l4BfAqabd2f70ZUm9VAMEPrgv/v2xI7M2XiL4g5BVmqLCOwxLM8RMCotCuoweUr
 VO43DeBCIDwLxotMJKvGWDjBzQYlU1NPUAtNcz/gN9ITUe1VUGjyvGj4u1lxBOcQQUw7l1+T
 5moZ4iZxXzvOwE0ETspYWQEIANGc4zQULOxhbqO2dyD51YhqCNRmm9oKWaqf+wmW4tpDe/VV
 cxAnNizd4LWCHfzpb5cHAtGkOPePMfzWVf6nvdF7d3eglbtf59+zG7O7llV0xSSoFiieQBsr
 GvqDInXYX/4mRRXMtyhM353/tixC9RWLs1oofyYmCPPXXY7h9R7en3B8BoVrRFcdzlIY/NFN
 hFGW/9dkEiGjgna2Rk6e15kln4ZvFBWUg23p93w/pqXcxY6+k/8TEk+C4R+M6w7o2PLGOjdZ
 +kPiUcw5H85zf/yZJwQXzisXaNduwWB6Vads9YC9dj6kPR1c4VGRqAaYL++LAEOqrlvm2Tvq
 QqZRtnEAEQEAAcLAfAQYAQgAJgIbDBYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNI2BQkd
 EODdAAoJEJn6f8tZ/DuBfw0IAKTsfD40teP/pp+bsLLMSxPXUYrrprTj7WFB5v61p6dkpSr/
 qXmMlyahdxQFaPmfVgVirB1Vk/kHiWNnnGjfUV9nB2Zg9LI0Xb9/ts3LsUiRWXzG3tkMY6XL
 vsVOxW4XFRND9l2q+WW93aZ1DZl+fqWfYgMvsusFRhmGFOKTRfKPta2Pkv+AhA24N4+PrR5p
 bU4k2MO8PAGiK8eaYKGFG1bHKuAvoDoF7WXJ3FHxuWqLnKEt4dfOLm5pAe3zq1Lt6q8azT9i
 QWGpSAK5vQUWQHBHpiDjdPeqKZ6HiAXIIKfSmb+jrvXBqoP+D6/K7rUjG2aXiRtTIAXms9sm
 VRu7cmw=
In-Reply-To: <aPZrg6lxb5bgy_px@nathan>
Archived-At: 
 <https://www.postgresql.org/message-id/8e045899-2023-48b1-bd91-f8cdffeb511d%40dunslane.net>
Precedence: bulk

This is a multi-part message in MIME format.
--------------6M6F2h30AbqHmkN007Qrdguo
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit


On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
> On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
>> On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
>>> With this heuristic the regression is limited by %2 in the worst case.
>> My worry is that the worst case is actually quite common. Sparse data sets
>> dominated by a lot of null values (and hence lots of special characters) are
>> very common. Are people prepared to accept a 2% regression on load times for
>> such data sets?
> Without knowing how common it is, I think it's difficult to judge whether
> 2% is a reasonable trade-off.  If <5% of workloads might see a small
> regression while the other >95% see double-digit percentage improvements,
> then I might argue that it's fine.  But I'm not sure we have any way to
> know those sorts of details at the moment.


I guess what I don't understand is why we actually need to do the test 
continuously, even using an adaptive algorithm. Data files in my 
experience usually have lines with fairly similar shapes. It's highly 
unlikely that you will get the the first 1000 (say) lines of a file that 
are rich in special characters and then some later significant section 
that isn't, or vice versa. Therefore, doing the test once should yield 
the correct answer that can be applied to the rest of the file. That 
should reduce the worst case regression to ~0% without sacrificing any 
of the performance gains. I appreciate the elegance of what Bilal has 
done here, but it does seem like overkill.

> I'm also at least a little skeptical about the 2% number.  IME that's
> generally within the noise range and can vary greatly between machines and
> test runs.
>

Fair point.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

--------------6M6F2h30AbqHmkN007Qrdguo
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2025-10-20 Mo 1:04 PM, Nathan
      Bossart wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:aPZrg6lxb5bgy_px@nathan">
      <pre wrap="" class="moz-quote-pre">On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
</pre>
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
</pre>
        <blockquote type="cite">
          <pre wrap="" class="moz-quote-pre">With this heuristic the regression is limited by %2 in the worst case.
</pre>
        </blockquote>
        <pre wrap="" class="moz-quote-pre">
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">
Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off.  If &lt;5% of workloads might see a small
regression while the other &gt;95% see double-digit percentage improvements,
then I might argue that it's fine.  But I'm not sure we have any way to
know those sorts of details at the moment.</pre>
    </blockquote>
    <p><br>
    </p>
    <p>I guess what I don't understand is why we actually need to do the
      test continuously, even using an adaptive algorithm. Data files in
      my experience usually have lines with fairly similar shapes. It's
      highly unlikely that you will get the the first 1000 (say) lines
      of a file that are rich in special characters and then some later
      significant section that isn't, or vice versa. Therefore, doing
      the test once should yield the correct answer that can be applied
      to the rest of the file. That should reduce the worst case
      regression to ~0% without sacrificing any of the performance
      gains. I appreciate the elegance of what Bilal has done here, but
      it does seem like overkill.<br>
    </p>
    <p><span style="white-space: pre-wrap">
</span></p>
    <blockquote type="cite" cite="mid:aPZrg6lxb5bgy_px@nathan">
      <pre wrap="" class="moz-quote-pre">
I'm also at least a little skeptical about the 2% number.  IME that's
generally within the noise range and can vary greatly between machines and
test runs.

</pre>
    </blockquote>
    <p><br>
    </p>
    <p>Fair point.</p>
    <p><br>
    </p>
    <p>cheers</p>
    <p><br>
    </p>
    <p>andrew<br>
    </p>
    <pre class="moz-signature"
    signature-switch-id="d0437855-2267-4610-80b3-83167ec45b0b" cols="72">--
Andrew Dunstan
EDB: <a class="moz-txt-link-freetext" href="https://www.enterprisedb.com">https://www.enterprisedb.com</a>
</pre>
  </body>
</html>

--------------6M6F2h30AbqHmkN007Qrdguo--