Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vAqU6-006WRK-GM for pgsql-hackers@arkaria.postgresql.org; Mon, 20 Oct 2025 14:03:38 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vAqU5-000esG-FG for pgsql-hackers@arkaria.postgresql.org; Mon, 20 Oct 2025 14:03:36 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vAqT1-000ahQ-KY for pgsql-hackers@lists.postgresql.org; Mon, 20 Oct 2025 14:02:30 +0000 Received: from mail-qv1-xf36.google.com ([2607:f8b0:4864:20::f36]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vAqSy-002naI-0R for pgsql-hackers@postgresql.org; Mon, 20 Oct 2025 14:02:29 +0000 Received: by mail-qv1-xf36.google.com with SMTP id 6a1803df08f44-87c21c5d3eeso61857436d6.2 for ; Mon, 20 Oct 2025 07:02:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dunslane-net.20230601.gappssmtp.com; s=20230601; t=1760968946; x=1761573746; darn=postgresql.org; h=in-reply-to:autocrypt:content-language:from:references:cc:to :subject:user-agent:mime-version:date:message-id:from:to:cc:subject :date:message-id:reply-to; bh=H3Cwe2KqLISyha1ytwLR88OHBf3bzpwmyjni2U+P2QE=; b=g+qFzhxBne39aEWKNoe1QcASHCtC4/24IWLELbe0Y2CoV/FJk0Bby2dYv0mY4x9q1W PEsm0tNTriSHpkKPjEI3AVG88ty/9Ul8kWjlLIw68C2IgP8PoqDJxia1JaD1htgRcIkJ h03e8bmPJ+niA4TJfBJ5kCEj/HneA4mS57X7h3LiWGly1D6UDrXSLozrHkGQATFTM5mD XjBvSzevzUcu8/wx80U7+l/puf3dnpJR8ei0xG5L44l8dZSw1BguQ/sevYz2pjSlOubb e026UQVbRFl3CZrIeukH1ZR+ujKRKXZzVtC4gfo/rYunDwNoR6MYQbLsIpFpSQUgYOa3 QFwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760968946; x=1761573746; h=in-reply-to:autocrypt:content-language:from:references:cc:to :subject:user-agent:mime-version:date:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=H3Cwe2KqLISyha1ytwLR88OHBf3bzpwmyjni2U+P2QE=; b=AFHzP0552P9sho2vIVhMLfo4iiEQdr/7msSYxFouWovSQ/G/Lxw6uHVEm0of5edbWZ 8IrRAofuwH11A7L18QGnbRs74YQrtF/7ED1MBe8JiL82Vf6eT6KjcBnLGHO5lJ+LFFdV U06/WXfAvpoqJtsYyZG9PxJDzO216WVLkRINCpuaEp2bOIzQH3xhOPwxnyeWBHrc2naH p47UjIdb/glWRjNp3k0uFjkT5zi9U2Ku37vHncLv05NGmb8pxDKjhiBMAH/+On97Y6pV TNvBmYm4EMoKA9beJbIF+/D1hG9AUvrNheqiaFht3lk80YGa66tggur9INCKT4/jUUnE CBpQ== X-Forwarded-Encrypted: i=1; AJvYcCWpTbA90GvnKUg21Pth5Or8M6Bkbh9Rsm1ZW50XOktcvfMyriEn+0oGx/AinUoUYpTjPRzAKfqkvT7rIUdZ@postgresql.org X-Gm-Message-State: AOJu0YyU+JNEU6UX+e4ZpdlSWKRM0vrX+/CCcRFPvuv1tQKqrui5imkM /F30w9ZNivC+wgn0hrV/MmlNu0WXGpGeIRE1kt0GdIJoXW7rfjbaPSWRKd6WpU2hdvk= X-Gm-Gg: ASbGncvje57ADkG06kaZuv1vVZDr8beltL9Lc4nvOwxiD6iXS65sXZ4fFOf+2w4yC9T EqCljasu1S8p0FWXs3mpNAKSyg8S30KaCF4yUC5+uf77WiWTA5TCGMiq/CFrzbC+mZuMk4LBFxW A1iYxd0uSlDD4HArzvsn2YTbV5tGZVJD23VxY/NISf6Dfq/a6BxXI8No8ZWVlltUYS3lbJhGKe9 40yXt7gwQjshZxfXmEcGe1l4wGXpvV179fBYyPm/iSa6Bmh3pp1oodz4lnLfdc2RNzXwszBZ3sw H90UQ7yvG80qlXG6mTs9M9rejo/BLRXEBCc1YV0DXaeeRU52e6thJAPsbi1odLVR5abx9ieVFVO JlN2G84vZvXjsT9sYipVtTYqe4hkDRExcgWMecTbFaSnkLeJ/WTy3t6cc/YH+a23wEiOnGK0gJs T6BhlqAOI4+MS2lujj5Q== X-Google-Smtp-Source: AGHT+IGhw3AkXDTiPddp0vyzsrtrxUNPHP8YWDlXX61DmxgEkgFbohL3agpw4YpHoGbP6avNtALlig== X-Received: by 2002:a05:6214:2aa6:b0:81c:ded8:7099 with SMTP id 6a1803df08f44-87c2081e90bmr191786646d6.58.1760968945629; Mon, 20 Oct 2025 07:02:25 -0700 (PDT) Received: from ?IPV6:2605:a601:a6b0:500::1cb? ([2605:a601:a6b0:500::1cb]) by smtp.googlemail.com with ESMTPSA id 6a1803df08f44-87d02d8e5a2sm51290266d6.60.2025.10.20.07.02.24 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 20 Oct 2025 07:02:24 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------rIPA3bHtfK7RQfVKhiCu2jrV" Message-ID: <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> Date: Mon, 20 Oct 2025 10:02:23 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nazir Bilal Yavuz Cc: KAZAR Ayoub , Shinya Kato , pgsql-hackers@postgresql.org References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> From: Andrew Dunstan Content-Language: en-US Autocrypt: addr=andrew@dunslane.net; keydata= xsBNBE7KWFkBCAClridxur2AIc7eW2AR7izbfp3EnNefie2HbLF0izW5Ik5UjX2HBXBx4syI gY6b0ugohXrr274+baoAlvSbq6cAoQuEVrk5IZFzt20b1Xkx65FwGSEj526yiKLocqkJceSq Xr9xcA5SGY+FZv441chh5SU92v4q6z+6LPpoHOh97ptAVXZYNTtU0LevyvD5lja0TzbvJm6C eFXitJfnm1pLEr0DGJCR/iUOl/N62Kh4855zZC7NHIjQHPOvV5Stz/l5ilDhvGVk+xkXFPys SjZoUr1rXhYLpiyi5sR0X9FHXT0KnGuz1F5ERO7ZTLSSQ6fJwPj6gOk9K+vvoKvoeql5ABEB AAHNJEFuZHJldyBEdW5zdGFuIDxhbmRyZXdAZHVuc2xhbmUubmV0PsLAlwQTAQgAQQIbAwIX gAIZAQULCQgHAwUVCgkICwUWAgMBAAIeBRYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNGd BQkdEO8nAAoJEJn6f8tZ/DuBq74H/jkTR4Zi3stbw+xC7v2u3QozssK7MYPL2AsVfh7OealS h182fiWXpfvmmAB7WUHbhk9GC2RAOnHI/2d2jgKaMLAHsGYOT0YopTVIwRY43fCw/mK67yxc wmDcX+zyKfLaivNbf5A7QPLNwda98bEAMSJ8Sn652Uc6cA8t3uKGsVzbRBQOoYzjgvBCfSrE 9ql3PDNg0l4BfAqabd2f70ZUm9VAMEPrgv/v2xI7M2XiL4g5BVmqLCOwxLM8RMCotCuoweUr VO43DeBCIDwLxotMJKvGWDjBzQYlU1NPUAtNcz/gN9ITUe1VUGjyvGj4u1lxBOcQQUw7l1+T 5moZ4iZxXzvOwE0ETspYWQEIANGc4zQULOxhbqO2dyD51YhqCNRmm9oKWaqf+wmW4tpDe/VV cxAnNizd4LWCHfzpb5cHAtGkOPePMfzWVf6nvdF7d3eglbtf59+zG7O7llV0xSSoFiieQBsr GvqDInXYX/4mRRXMtyhM353/tixC9RWLs1oofyYmCPPXXY7h9R7en3B8BoVrRFcdzlIY/NFN hFGW/9dkEiGjgna2Rk6e15kln4ZvFBWUg23p93w/pqXcxY6+k/8TEk+C4R+M6w7o2PLGOjdZ +kPiUcw5H85zf/yZJwQXzisXaNduwWB6Vads9YC9dj6kPR1c4VGRqAaYL++LAEOqrlvm2Tvq QqZRtnEAEQEAAcLAfAQYAQgAJgIbDBYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNI2BQkd EODdAAoJEJn6f8tZ/DuBfw0IAKTsfD40teP/pp+bsLLMSxPXUYrrprTj7WFB5v61p6dkpSr/ qXmMlyahdxQFaPmfVgVirB1Vk/kHiWNnnGjfUV9nB2Zg9LI0Xb9/ts3LsUiRWXzG3tkMY6XL vsVOxW4XFRND9l2q+WW93aZ1DZl+fqWfYgMvsusFRhmGFOKTRfKPta2Pkv+AhA24N4+PrR5p bU4k2MO8PAGiK8eaYKGFG1bHKuAvoDoF7WXJ3FHxuWqLnKEt4dfOLm5pAe3zq1Lt6q8azT9i QWGpSAK5vQUWQHBHpiDjdPeqKZ6HiAXIIKfSmb+jrvXBqoP+D6/K7rUjG2aXiRtTIAXms9sm VRu7cmw= In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------rIPA3bHtfK7RQfVKhiCu2jrV Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote: > Hi, > > On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan wrote: >> >> On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote: >>> Hi, >>> >>> On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz wrote: >>>> I am able to reproduce the regression you mentioned but both >>>> regressions are %20 on my end. I found that (by experimenting) SIMD >>>> causes a regression if it advances less than 5 characters. >>>> >>>> So, I implemented a small heuristic. It works like that: >>>> >>>> - If advance < 5 -> insert a sleep penalty (n cycles). >>> 'sleep' might be a poor word choice here. I meant skipping SIMD for n >>> number of times. >>> >> I was thinking a bit about that this morning. I wonder if it might be >> better instead of having a constantly applied heuristic like this, it >> might be better to do a little extra accounting in the first, say, 1000 >> lines of an input file, and if less than some portion of the input is >> found to be special characters then switch to the SIMD code. What that >> portion should be would need to be determined by some experimentation >> with a variety of typical workloads, but given your findings 20% seems >> like a good starting point. > I implemented a heuristic something similar to this. It is a mix of > previous heuristic and your idea, it works like that: > > Overall logic is that we will not run SIMD for the entire line and we > decide if it is worth it to run SIMD for the next lines. > > 1 - We will try SIMD and decide if it is worth it to run SIMD. > 1.1 - If it is worth it, we will continue to run SIMD and we will > halve the simd_last_sleep_cycle variable. > 1.2 - If it is not worth it, we will double the simd_last_sleep_cycle > and we will not run SIMD for these many lines. > 1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1. > Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for > each 1024 lines at max. > > With this heuristic the regression is limited by %2 in the worst case. > My worry is that the worst case is actually quite common. Sparse data sets dominated by a lot of null values (and hence lots of special characters) are very common. Are people prepared to accept a 2% regression on load times for such data sets? cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com --------------rIPA3bHtfK7RQfVKhiCu2jrV Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit


On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
Hi,

On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,

On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.

I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:

Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.

1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.

With this heuristic the regression is limited by %2 in the worst case.


My worry is that the worst case is actually quite common. Sparse data sets dominated by a lot of null values (and hence lots of  special characters) are very common. Are people prepared to accept a 2% regression on load times for such data sets?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
--------------rIPA3bHtfK7RQfVKhiCu2jrV--