Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vT5xR-00Cjm2-2Q for pgsql-hackers@arkaria.postgresql.org; Tue, 09 Dec 2025 22:13:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vT5xQ-007bgU-0a for pgsql-hackers@arkaria.postgresql.org; Tue, 09 Dec 2025 22:13:20 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vT5xP-007bfw-2a for pgsql-hackers@lists.postgresql.org; Tue, 09 Dec 2025 22:13:20 +0000 Received: from mail-pg1-x529.google.com ([2607:f8b0:4864:20::529]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vT5xM-0047fc-1R for pgsql-hackers@postgresql.org; Tue, 09 Dec 2025 22:13:19 +0000 Received: by mail-pg1-x529.google.com with SMTP id 41be03b00d2f7-c03ec27c42eso1272487a12.1 for ; Tue, 09 Dec 2025 14:13:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1765318393; x=1765923193; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Hi3qOUSL/v3AwvhqwZJoKsRdPka8R8zwTw4ZgKykZTU=; b=bsahJXyAWLlBnQX8Xn7+4okmyGbpd/ZEALU4pZ8SXUdyxABmDMa7bmRB9LQZr1Bz6y GAPIJlmde+kERoTWF9j4MY+HL7332hXb4KyWwKX3dm7UX9xzlwsUYcvZB8cTt506sBAB LiRj3p0TzZWx6wiee4FYRsxBN2qDr+DIZzpmhMjDAjRLB2+X7EmdpA42F/OvvBw+4zdx a7IZtSIS1stkIHcaTkBCYlvCIXuMojai6ECICv3Vk8jDEudTg+uV4U7oMAzyjuF9F+s5 jmm/CWqiTpLT8As8nxf+zz+kjLikiWF1xJVSFnHosj8SkbkjkIOSEF5JISVU0MvMfvGd zjbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765318393; x=1765923193; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Hi3qOUSL/v3AwvhqwZJoKsRdPka8R8zwTw4ZgKykZTU=; b=pMR5/Q4qK8xBjRV6DWwzV+w3MP60SHtyo4QIcKb8Q7qrCL1bhq53M0tdyghwQ32Elb 52pf5bTttxJxpYsm4GZUkt55Zxfeu/LeId6SRHCtIZwIXHHMYYlvelgtc6ECAMPfs1Ct bhZrm4OF/1X0eUlm+z9gDzMHOQG0sICRbqWYdDVDdQFMsBmuBerwbZRZwN6hxAk871ny A/w5kHeFRbJYYsbioMvR7bL4rzLU/a4XgcNrGF+MPI0YKTqIqJe2OINRuQKr7FkA3Ld+ hmWD5qgw5gHkobSvw9xlhWaVvlS0Uptgnq79jPfxInrlLMjO1hESN5wrNdAJvOyvtXii mryg== X-Forwarded-Encrypted: i=1; AJvYcCX6gh8C451dFhEDKWh4GVUHueayCy449lC8jC9u0N/hiysAc6oiotATP/lg64MsGcmO33YN0akNZTwDBpFZ@postgresql.org X-Gm-Message-State: AOJu0YwZhI3WSXDhBYXXN6VDnR9xhhnLe6p1VG7Uc/WJ1lSvYHK35QFQ yjRM0vdP5R2l7onnlB329a5knDowgheKqG+aPQjiu/97LKrtnmkU0o+TCNHRuKrstcouHrolYI2 oE9Te8MaE3382S0Zu4R75ES4AMPXbCowxV69+T1fn X-Gm-Gg: AY/fxX6oRpMe4vlODbUFvPtB+4gLz3C4CPtITL3kJOQFTk9LENKSCnsWepeI4CNhvbI TXbAOe9HCWBqyYkPsVrZoMxWfT+e717TzblTE1OS0czzB/Gb/TtM1C/VolaDL/Kt5LPCmYyToP6 FaOwTCt4TKsemIyo+ELAsKGM+RtnKtbbeAtPz4EXMNS0fVtSTPU5o/E9fHeLBLs8S4VoZoUnLfk w2NcuaFI7gJseE01xXSyMeumFO/uvl/WghYpcH635VMtqHXdBh70S5sCBq8wUKxw6ROUbX/ X-Google-Smtp-Source: AGHT+IHGJvozrMIX7T+cqWoEayWE0AzVxZwcRj9J83BKuogDmRh0/XAGHu0dxmAOxIvEoyNzOFKf4sztjiti6esVJ90= X-Received: by 2002:a05:7300:ac83:b0:2a4:3593:4684 with SMTP id 5a478bee46e88-2ac0569c99emr463823eec.32.1765318393469; Tue, 09 Dec 2025 14:13:13 -0800 (PST) MIME-Version: 1.0 References: <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net> <8e226753-57af-489a-bfbe-caa23dd71286@dunslane.net> In-Reply-To: From: Manni Wood Date: Tue, 9 Dec 2025 16:13:02 -0600 X-Gm-Features: AQt7F2pLzoy48djoaZQiUL26DHFnlxQnOvZUDkXh9Z1lHDqQM3UrJb5E4FwwkT4 Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Bilal Yavuz Cc: KAZAR Ayoub , Nathan Bossart , Andrew Dunstan , Shinya Kato , PostgreSQL-development Content-Type: multipart/alternative; boundary="000000000000c22d6806458c37b5" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000c22d6806458c37b5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Dec 9, 2025 at 7:40=E2=80=AFAM Bilal Yavuz wro= te: > Hi, > > On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz wrote: > > > > Hi, > > > > On Sat, 6 Dec 2025 at 04:40, Manni Wood > wrote: > > > Hello, all. > > > > > > Andrew, I tried your suggestion of just reading the first chunk of th= e > copy file to determine if SIMD is worth using. Attached are v4 versions o= f > the patches showing a first attempt at doing that. > > > > Thank you for doing this! > > > > > I attached test.sh.txt to show how I've been testing, with 5 million > lines of the various copy file variations introduced by Ayub Kazar. > > > > > > The text copy with no special chars is 30% faster. The CSV copy with > no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. > The CSV with 1/3rd quotes is 0.27% slower. > > > > > > This set of patches follows the simplest suggestion of just testing > the first N lines (actually first N bytes) of the file and then deciding > whether or not to enable SIMD. This set of patches does not follow Andrew= 's > later suggestion of maybe checking again every million lines or so. > > > > My input-generation script is not ready to share yet, but the inputs > > follow this format: text_${n}.input, where n represents the number of > > normal characters before the delimiter. For example: > > > > n =3D 0 -> "\n\n\n\n\n..." (no normal characters) > > n =3D 1 -> "a\n..." (1 normal character before the delimiter) > > ... > > n =3D 5 -> "aaaaa\n..." > > =E2=80=A6 continuing up to n =3D 32. > > > > Each line has 4096 chars and there are a total of 100000 lines in each > > input file. > > > > I only benchmarked the text format. I compared the latest heuristic I > > shared [1] with the current method. The benchmarks show roughly a ~16% > > regression at the worst case (n =3D 2), with regressions up to n =3D 5. > > For the remaining values, performance was similar. > > I tried to improve the v4 patchset. My changes are: > > 1 - I changed CopyReadLineText() to an inline function and sent the > use_simd variable as an argument to get help from inlining. > > 2 - A main for loop in the CopyReadLineText() function is called many > times, so I moved the use_simd check to the CopyReadLine() function. > > 3 - Instead of 'bytes_processed', I used 'chars_processed' because > cstate->bytes_processed is increased before we process them and this > can cause wrong results. > > 4 - Because of #2 and #3, instead of having > 'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed / > special_chars_encountered' to determine whether we want to use SIMD. > > 5 - cstate->special_chars_encountered is incremented wrongly for the > CSV case. It is not incremented for the quote and escape delimiters. I > moved all increments of cstate->special_chars_encountered to the > central place and tried to optimize it but it still causes a > regression as it creates one more branching. > > With these changes, I am able to decrease the regression to %10 from > %16. Regression decreases to %7 if I modify #5 for the only text input > but I did not do that. > > My changes are in the 0003. > > -- > Regards, > Nazir Bilal Yavuz > Microsoft > Bilal Yavuz (Nazir Bilal Yavuz?), I did not get a chance to do any work on this today, but wanted to thank you for finding my logic errors in counting special chars for CSV, and hacking on my naive solution to make it faster. By attempting Andrew Dunstan's suggestion, I got a better feel for the reality that the "housekeeping" code produces a significant amount of overhead. --=20 -- Manni Wood EDB: https://www.enterprisedb.com --000000000000c22d6806458c37b5 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Tue, Dec 9, = 2025 at 7:40=E2=80=AFAM Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,

On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> = wrote:
> > Hello, all.
> >
> > Andrew, I tried your suggestion of just reading the first chunk o= f the copy file to determine if SIMD is worth using. Attached are v4 versio= ns of the patches showing a first attempt at doing that.
>
> Thank you for doing this!
>
> > I attached test.sh.txt to show how I've been testing, with 5 = million lines of the various copy file variations introduced by Ayub Kazar.=
> >
> > The text copy with no special chars is 30% faster. The CSV copy w= ith no special chars is 48% faster. The text with 1/3rd escapes is 3% slowe= r. The CSV with 1/3rd quotes is 0.27% slower.
> >
> > This set of patches follows the simplest suggestion of just testi= ng the first N lines (actually first N bytes) of the file and then deciding= whether or not to enable SIMD. This set of patches does not follow Andrew&= #39;s later suggestion of maybe checking again every million lines or so. >
> My input-generation script is not ready to share yet, but the inputs > follow this format: text_${n}.input, where n represents the number of<= br> > normal characters before the delimiter. For example:
>
> n =3D 0 -> "\n\n\n\n\n..." (no normal characters)
> n =3D 1 -> "a\n..." (1 normal character before the delimi= ter)
> ...
> n =3D 5 -> "aaaaa\n..."
> =E2=80=A6 continuing up to n =3D 32.
>
> Each line has 4096 chars and there are a total of 100000 lines in each=
> input file.
>
> I only benchmarked the text format. I compared the latest heuristic I<= br> > shared [1] with the current method. The benchmarks show roughly a ~16%=
> regression at the worst case (n =3D 2), with regressions up to n =3D 5= .
> For the remaining values, performance was similar.

I tried to improve the v4 patchset. My changes are:

1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.

2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.

3 - Instead of 'bytes_processed', I used 'chars_processed' = because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.

4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_proce= ssed /
special_chars_encountered' to determine whether we want to use SIMD.
5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.

With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.

My changes are in the 0003.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Bilal Yavuz (Nazir Bil= al Yavuz?), I did not get a chance to do any work on this today, but wanted= to thank you for finding my logic errors in counting special chars for CSV= , and hacking on my naive solution to make it faster. By attempting Andrew = Dunstan's suggestion, I got a better feel for the reality that the &quo= t;housekeeping" code produces a significant amount of overhead.
<= span class=3D"gmail_signature_prefix">--
-- Manni Wood EDB: https://www.enterprisedb.com
--000000000000c22d6806458c37b5--