Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vAwY5-0083Rh-4h for pgsql-hackers@arkaria.postgresql.org; Mon, 20 Oct 2025 20:32:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vAwY3-003sHg-Ra for pgsql-hackers@arkaria.postgresql.org; Mon, 20 Oct 2025 20:32:06 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vAwY3-003sHX-Bd for pgsql-hackers@lists.postgresql.org; Mon, 20 Oct 2025 20:32:06 +0000 Received: from mail-qk1-x733.google.com ([2607:f8b0:4864:20::733]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vAwXz-003IKS-0Y for pgsql-hackers@postgresql.org; Mon, 20 Oct 2025 20:32:05 +0000 Received: by mail-qk1-x733.google.com with SMTP id af79cd13be357-8909f01bd00so531950885a.0 for ; Mon, 20 Oct 2025 13:32:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dunslane-net.20230601.gappssmtp.com; s=20230601; t=1760992321; x=1761597121; darn=postgresql.org; h=in-reply-to:autocrypt:content-language:from:references:cc:to :subject:user-agent:mime-version:date:message-id:from:to:cc:subject :date:message-id:reply-to; bh=Jp3Vs6ArnfNDD73GpepBWURmmm4d8m9oOZGXRsbXDIs=; b=cxlHD2YA1l/4f4XBnem5SwqxBKcIMTtaJWduxUHvy0mct09/Si0eQqnmUBRhKuzKvi dW/LeJw6LTi/MQhku8VPXcnUhztIdhTbamUDyYMIW06c8lkMHrzEAjUEqkkRZIqTywAu 1lxIUw2dLYC08SP/Clf8Ymxy4vp4nAdD7luAtot0QvhH2hSHJYbvZ4jWcLolwgvE00dG fbcDLvLBSf+6HeDFenBI74jSvXVYUqq4TN3ih6R7AmIzNq1zHONx26BPx9p4cw5f0pzG 8CIaQq+r4bkFQudEwsc6hKmQGMiYTPUNbdMdZDeYyQ4OSYh2kVqZxuPQlMZzDiTvs5rz RVyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760992321; x=1761597121; h=in-reply-to:autocrypt:content-language:from:references:cc:to :subject:user-agent:mime-version:date:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Jp3Vs6ArnfNDD73GpepBWURmmm4d8m9oOZGXRsbXDIs=; b=YAy1BSimsj1tYa0jpRdbmv54AV2vcFBa7eqbVyXDa4+bY41Ofx6URflYw7s7qT4rYC DXuzgGLYoPb51tr5dm2GVhIKOVbdT3S1MbBgkH4T9Y0oBLrasDbAnnZUGEWa7njvnc/n 45iJYHat/nZ+WQqc8k9npHUzB0MPUtEF4cz0vb99UNkdgGI2yQvWtXqtRnI4WAuHD2Di au5Mvl1wVRyQK2BUGxNAw+XIVrMDjRok+yO/C8frEcxCTJY9o6ecnedU71RI3fPP3iGn 2doOGEKg/SfpnYSTCNKUrw03DDyauhLhzejX0s14GC7UPriYEJqgC/JaqFzODXzqHQs6 8dvw== X-Forwarded-Encrypted: i=1; AJvYcCXCokTj5YGxCsVqk/K+1paH3kimLvTvypcGWYowzmTBliUDeHpiUqFq9fA6txkQVNdJEQ86tUNHdIIVswyX@postgresql.org X-Gm-Message-State: AOJu0Yyv8I4De56rN5zYZVkJT3eNjb1pstbGQ7g0xkx+AbT1CPX7TIOK resrXEw8JitIitt0MzHI+sSTW/AS26ZOsrDTVvKSf28C/2R3lB5zMdblGe/qo5grMbg= X-Gm-Gg: ASbGncs80DowZv8wW7FY52mQCFyX+bOlbS77C0ZeL2Gs3tFa9rEHkHXx+WNLH+oVvcb H5Gw34TMKvFrGBkXtfY4e75h3Ady4g2QviRDFiCoEiQ3FvOH4/OukvFu4lQdXra06+M8DNo1u/t IZAm2wiqP6uPiFQ4XlENmuqZfwIgtlsvuPCJyZ7AmNf8QbQeJkG2bGwIj7TIZJZeAaaEWwDLwA6 XiaUJJXUqz51+rV9QdaN0P1QpTX3bhnqSxqmRNnzkau8k45s3wwEWaC/ou2ykMyklO8VjgdDkrt qqK0WpktQyIP5Td4cEoIVIMA+qWqiZAANvQnwSrRzhhnmBXR1lfLr430HkrIc++YS1aBHPycQfn 0SQJrUNWBdN5iFAptUwWSTG9gvjxx7TBrc4GrTnLBYo1kiYIJurJ8ZPTT+Q4JVZ7fuMGqfIpP8e IC5YVYQK8nJ9zRpTgAWA== X-Google-Smtp-Source: AGHT+IFhQoeOkOwqbV1lBXMgXkR4vzm6z4UUZkhqP+i0rbx5iBj28IpU8ZnNWnVIaOPZX64ZeOtIVg== X-Received: by 2002:a05:622a:8604:b0:4e8:a2dd:34b1 with SMTP id d75a77b69052e-4e8a2dd3974mr99582621cf.65.1760992320426; Mon, 20 Oct 2025 13:32:00 -0700 (PDT) Received: from ?IPV6:2605:a601:a6b0:500::1cb? ([2605:a601:a6b0:500::1cb]) by smtp.googlemail.com with ESMTPSA id d75a77b69052e-4e8b678c47bsm46857021cf.16.2025.10.20.13.31.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 20 Oct 2025 13:31:59 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------6M6F2h30AbqHmkN007Qrdguo" Message-ID: <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> Date: Mon, 20 Oct 2025 16:31:58 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nathan Bossart Cc: Nazir Bilal Yavuz , KAZAR Ayoub , Shinya Kato , pgsql-hackers@postgresql.org References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> From: Andrew Dunstan Content-Language: en-US Autocrypt: addr=andrew@dunslane.net; keydata= xsBNBE7KWFkBCAClridxur2AIc7eW2AR7izbfp3EnNefie2HbLF0izW5Ik5UjX2HBXBx4syI gY6b0ugohXrr274+baoAlvSbq6cAoQuEVrk5IZFzt20b1Xkx65FwGSEj526yiKLocqkJceSq Xr9xcA5SGY+FZv441chh5SU92v4q6z+6LPpoHOh97ptAVXZYNTtU0LevyvD5lja0TzbvJm6C eFXitJfnm1pLEr0DGJCR/iUOl/N62Kh4855zZC7NHIjQHPOvV5Stz/l5ilDhvGVk+xkXFPys SjZoUr1rXhYLpiyi5sR0X9FHXT0KnGuz1F5ERO7ZTLSSQ6fJwPj6gOk9K+vvoKvoeql5ABEB AAHNJEFuZHJldyBEdW5zdGFuIDxhbmRyZXdAZHVuc2xhbmUubmV0PsLAlwQTAQgAQQIbAwIX gAIZAQULCQgHAwUVCgkICwUWAgMBAAIeBRYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNGd BQkdEO8nAAoJEJn6f8tZ/DuBq74H/jkTR4Zi3stbw+xC7v2u3QozssK7MYPL2AsVfh7OealS h182fiWXpfvmmAB7WUHbhk9GC2RAOnHI/2d2jgKaMLAHsGYOT0YopTVIwRY43fCw/mK67yxc wmDcX+zyKfLaivNbf5A7QPLNwda98bEAMSJ8Sn652Uc6cA8t3uKGsVzbRBQOoYzjgvBCfSrE 9ql3PDNg0l4BfAqabd2f70ZUm9VAMEPrgv/v2xI7M2XiL4g5BVmqLCOwxLM8RMCotCuoweUr VO43DeBCIDwLxotMJKvGWDjBzQYlU1NPUAtNcz/gN9ITUe1VUGjyvGj4u1lxBOcQQUw7l1+T 5moZ4iZxXzvOwE0ETspYWQEIANGc4zQULOxhbqO2dyD51YhqCNRmm9oKWaqf+wmW4tpDe/VV cxAnNizd4LWCHfzpb5cHAtGkOPePMfzWVf6nvdF7d3eglbtf59+zG7O7llV0xSSoFiieQBsr GvqDInXYX/4mRRXMtyhM353/tixC9RWLs1oofyYmCPPXXY7h9R7en3B8BoVrRFcdzlIY/NFN hFGW/9dkEiGjgna2Rk6e15kln4ZvFBWUg23p93w/pqXcxY6+k/8TEk+C4R+M6w7o2PLGOjdZ +kPiUcw5H85zf/yZJwQXzisXaNduwWB6Vads9YC9dj6kPR1c4VGRqAaYL++LAEOqrlvm2Tvq QqZRtnEAEQEAAcLAfAQYAQgAJgIbDBYhBOQ+WEYd/Hy/RGkVpZn6f8tZ/DuBBQJoGNI2BQkd EODdAAoJEJn6f8tZ/DuBfw0IAKTsfD40teP/pp+bsLLMSxPXUYrrprTj7WFB5v61p6dkpSr/ qXmMlyahdxQFaPmfVgVirB1Vk/kHiWNnnGjfUV9nB2Zg9LI0Xb9/ts3LsUiRWXzG3tkMY6XL vsVOxW4XFRND9l2q+WW93aZ1DZl+fqWfYgMvsusFRhmGFOKTRfKPta2Pkv+AhA24N4+PrR5p bU4k2MO8PAGiK8eaYKGFG1bHKuAvoDoF7WXJ3FHxuWqLnKEt4dfOLm5pAe3zq1Lt6q8azT9i QWGpSAK5vQUWQHBHpiDjdPeqKZ6HiAXIIKfSmb+jrvXBqoP+D6/K7rUjG2aXiRtTIAXms9sm VRu7cmw= In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------6M6F2h30AbqHmkN007Qrdguo Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote: > On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote: >> On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote: >>> With this heuristic the regression is limited by %2 in the worst case. >> My worry is that the worst case is actually quite common. Sparse data sets >> dominated by a lot of null values (and hence lots of special characters) are >> very common. Are people prepared to accept a 2% regression on load times for >> such data sets? > Without knowing how common it is, I think it's difficult to judge whether > 2% is a reasonable trade-off. If <5% of workloads might see a small > regression while the other >95% see double-digit percentage improvements, > then I might argue that it's fine. But I'm not sure we have any way to > know those sorts of details at the moment. I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm. Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will get the the first 1000 (say) lines of a file that are rich in special characters and then some later significant section that isn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the rest of the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. I appreciate the elegance of what Bilal has done here, but it does seem like overkill. > I'm also at least a little skeptical about the 2% number. IME that's > generally within the noise range and can vary greatly between machines and > test runs. > Fair point. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com --------------6M6F2h30AbqHmkN007Qrdguo Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 7bit


On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?
Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off.  If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine.  But I'm not sure we have any way to
know those sorts of details at the moment.


I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm. Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will get the the first 1000 (say) lines of a file that are rich in special characters and then some later significant section that isn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the rest of the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. I appreciate the elegance of what Bilal has done here, but it does seem like overkill.

I'm also at least a little skeptical about the 2% number.  IME that's
generally within the noise range and can vary greatly between machines and
test runs.


Fair point.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
--------------6M6F2h30AbqHmkN007Qrdguo--