Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umNeQ-00Eg3r-FH for pgsql-hackers@arkaria.postgresql.org; Thu, 14 Aug 2025 02:25:10 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1umNeO-002BeA-VO for pgsql-hackers@arkaria.postgresql.org; Thu, 14 Aug 2025 02:25:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1umNeO-002Bdf-GK for pgsql-hackers@lists.postgresql.org; Thu, 14 Aug 2025 02:25:08 +0000 Received: from mail-ed1-x52c.google.com ([2a00:1450:4864:20::52c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1umNeL-000Sml-0R for pgsql-hackers@postgresql.org; Thu, 14 Aug 2025 02:25:07 +0000 Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-6188b65a2f0so883964a12.1 for ; Wed, 13 Aug 2025 19:25:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=esi.dz; s=google; t=1755138302; x=1755743102; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=kOEiHb2b2eaGcBeVTr/8mu8tUXSZGYeVPfshQocb1HQ=; b=CG9NdgJS3QQYicozjWdNXxezSQ0j3gR+y39ehaL9es8+LMGavnCCe9GakEXKgN1D38 Ew7XLC6OE3aQSIYk3sIjGFfUyiXVeCqhX2Kqq3tW5197kJfVKFCowrLKzAWQgt7G2jOb KW44inqQsKitxyJDkj0l8iYOus4JrMoO5yJLbQZDETepBjKofZw+19EZ/GklsFeKcRSp 0ryvFovYBmI3AhJ3dbzQjQKnJDEjZrflevn3vP23b8a+Z77+unp4njrevIlPTQXnZXAq 3CqVp3sDrNUr32wileHNLQ6gCwSnXK5Bu914MwHvEqTNTfumB3kc7Rb7Q1fSYHRt7AvB oqkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1755138302; x=1755743102; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kOEiHb2b2eaGcBeVTr/8mu8tUXSZGYeVPfshQocb1HQ=; b=C8tAs1ZX8egCnvKMuewvDjN5Ki9x2dl1CgUyCQ071Yj88soBwcUNSZgM/SL/sXKo0i g+hXeiFGdU2HU5UNf35ku4ku7jjWeiJN8NiJenbnqUUtz+ivOtP9nlUUqnStZUNqWxrm lYkgbLeIIlS2eZGgK2pqnqP6BaramEPGgzKM9GY2/gcb/CfMy0O8fGFtREjzdm7ldNNd FIlIm4mLyTk5S01pUJubmeVRoQcLnm24kBEzP4qb95E4hXnuJSCNXTIW8gtp3L0DXmqU kGDECiKviXgIIXKKfLRsbyFzbBw98e8GUVRXJnieAmy8Z317XAGBT5nMX18fEJtZYmRF yxLQ== X-Forwarded-Encrypted: i=1; AJvYcCXy5fy5mAPxNLLEb48/vvEqONiVPLX9mdpuqu5EDnywk4nvHvIi3MXi2Tw4/2SpjtMefgKQYa30gL3go6Ve@postgresql.org X-Gm-Message-State: AOJu0Yx9vv9a4uu4L6RlXjTPl60BE6kPvnWVcHxu2c5d4qyNQFv+6gAd 4SzFAA0axhCO3pjL6BkoMg37umWPPogp3cW8Ui0SvzG3Vm3bwrIy2wqjWP0KV627Lu/16h8SYEm EtTuwyN4sLk9ZkPcpJoHebU/XcNtZhn5fp3z8EIFT X-Gm-Gg: ASbGncsEpht4kn9MtL3rq6elniy4DYWeSaQlxRPKlzFeGz634fu2joH+nhnB/gDY+eB cP83MpauKnBmmI+zxjfenI4p1HqKHDKYjji1Xj8twrOl99F2HIJ1NfOFmUMqWKL2mwk2HimQagb d9EYyGGTR5L0bwKxEcS0ZEwe4w/wBfCmsByBN1lIdl+iRlXeqvUpebfd+o7Faw8+GJsOKrIMJfq yXq8/WK X-Google-Smtp-Source: AGHT+IHyHyhwIF28YP25wV9lHdWVRgjlaT5MqTgeh6roUaBPq1gB9ZEBdrcT0GFvhv+gTcVvQOnyFdmokPrkoPVnkQs= X-Received: by 2002:a05:6402:2751:b0:606:a26c:6f46 with SMTP id 4fb4d7f45d1cf-618919b54bbmr649141a12.19.1755138301972; Wed, 13 Aug 2025 19:25:01 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: KAZAR Ayoub Date: Thu, 14 Aug 2025 03:24:50 +0100 X-Gm-Features: Ac12FXxryrQwIrT-W-GfEiktjQYVwgjlWaOYvdp1R2q5fJwGtTY6zsHxfIFHCqs Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Shinya Kato Cc: Nazir Bilal Yavuz , pgsql-hackers@postgresql.org Content-Type: multipart/alternative; boundary="000000000000056b3b063c49fbbb" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000056b3b063c49fbbb Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on both TEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters. Results are con good as expected and similar to previous benchmarks ~30.9% faster copy in TEXT format ~32.4% faster copy in CSV format 20%-30% reduces cycles per instructions In the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviously expect regressions here because of the overhead of many fallbacks to scalar processing. Results for a 1/3 of line length of special characters: ~43.9% slower copy in TEXT format ~16.7% slower copy in CSV format So for even less occurrences of special characters or wider distance between there might still be some regressions in this case, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes. I hope this helps more and confirms the patch. Regards, Ayoub Kazar Le jeu. 14 ao=C3=BBt 2025 =C3=A0 01:55, Shinya Kato a =C3=A9crit : > On Tue, Aug 12, 2025 at 4:25=E2=80=AFPM Shinya Kato > wrote: > > > > + * However, SIMD optimization cannot be applied in the > following cases: > > > + * - Inside quoted fields, where escape sequences and closin= g > quotes > > > + * require sequential processing to handle correctly. > > > > > > I think you can continue SIMD inside quoted fields. Only important > > > thing is you need to set last_was_esc to false when SIMD skipped the > > > chunk. > > > > That's a clever point that last_was_esc should be reset to false when > > a SIMD chunk is skipped. You're right about that specific case. > > > > However, the core challenge is not what happens when we skip a chunk, > > but what happens when a chunk contains special characters like quotes > > or escapes. The main reason we avoid SIMD inside quoted fields is that > > the parsing logic becomes fundamentally sequential and > > context-dependent. > > > > To correctly parse a "" as a single literal quote, we must perform a > > lookahead to check the next character. This is an inherently > > sequential operation that doesn't map well to SIMD's parallel nature. > > > > Trying to handle this stateful logic with SIMD would lead to > > significant implementation complexity, especially with edge cases like > > an escape character falling on the last byte of a chunk. > > Ah, you're right. My apologies, I misunderstood the implementation. It > appears that SIMD can be used even within quoted strings. > > I think it would be better not to use the SIMD path when last_was_esc > is true. The next character is likely to be a special character, and > handling this case outside the SIMD loop would also improve > readability by consolidating the last_was_esc toggle logic in one > place. > > Furthermore, when inside a quote (in_quote) in CSV mode, the detection > of \n and \r can be disabled. > > + last_was_esc =3D false; > > Regarding the implementation, I believe we must set last_was_esc to > false when advancing input_buf_ptr, as shown in the code below. For > this reason, I think it=E2=80=99s best to keep the current logic for togg= ling > last_was_esc. > > + int advance =3D pg_rightmost_one_pos32(mask); > + input_buf_ptr +=3D advance; > > I've attached a new patch that includes these changes. Further > modifications are still in progress. > > -- > Best regards, > Shinya Kato > NTT OSS Center > --000000000000056b3b063c49fbbb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Following Nazir's findings ab= out 4096 bytes being the performant line length, I did more benchmarks from= my side on both TEXT and CSV formats with two different cases of=C2=A0norm= al data (no special characters) and data with many special characters.
<= br>
Results are con good as expected and similar to previous benchmark= s
=C2=A0~30.9% faster copy in TEXT format
=C2=A0~32.4% faster= copy in CSV format
20%-30% reduces cycles per instructions
In the case of doing a lot of special characters in the lines (= e.g., tables with large numbers of columns maybe), we obviously expect regr= essions here because of the overhead=C2=A0of many fallbacks to scalar proce= ssing.
Results for a 1/3 of line length of special characters:
~43.9%= slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences=C2=A0of special characters or wider distanc= e between there might still be some regressions in this case, a non-signifi= cant case maybe, but can be treated in other patches if we consider to not = use SIMD path sometimes.

I hope this helps more and confi= rms the patch.

Regards,
Ayoub Kazar

Le=C2=A0jeu. 14 ao=C3=BBt 2025 =C3=A0=C2=A001:55, Shiny= a Kato <shinya11.kato@gmail.c= om> a =C3=A9crit=C2=A0:
On Tue, Aug 12, 2025 at 4:25=E2=80=AFPM Shinya Kato <shinya11.kato@gmai= l.com> wrote:

> > +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* However, SIMD optimization c= annot be applied in the following cases:
> > +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* - Inside quoted fields, wher= e escape sequences and closing quotes
> > +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*=C2=A0 =C2=A0require sequenti= al processing to handle correctly.
> >
> > I think you can continue SIMD inside quoted fields. Only importan= t
> > thing is you need to set last_was_esc to false when SIMD skipped = the
> > chunk.
>
> That's a clever point that last_was_esc should be reset to false w= hen
> a SIMD chunk is skipped. You're right about that specific case. >
> However, the core challenge is not what happens when we skip a chunk,<= br> > but what happens when a chunk contains special characters like quotes<= br> > or escapes. The main reason we avoid SIMD inside quoted fields is that=
> the parsing logic becomes fundamentally sequential and
> context-dependent.
>
> To correctly parse a "" as a single literal quote, we must p= erform a
> lookahead to check the next character. This is an inherently
> sequential operation that doesn't map well to SIMD's parallel = nature.
>
> Trying to handle this stateful logic with SIMD would lead to
> significant implementation complexity, especially with edge cases like=
> an escape character falling on the last byte of a chunk.

Ah, you're right. My apologies, I misunderstood the implementation. It<= br> appears that SIMD can be used even within quoted strings.

I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.

Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.

+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0last_was_esc =3D fa= lse;

Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it=E2=80=99s best to keep the current logic for toggli= ng
last_was_esc.

+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int advance =3D pg_= rightmost_one_pos32(mask);
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0input_buf_ptr +=3D = advance;

I've attached a new patch that includes these changes. Further
modifications are still in progress.

--
Best regards,
Shinya Kato
NTT OSS Center
--000000000000056b3b063c49fbbb--