public inbox for [email protected]  
help / color / mirror / Atom feed
From: Nazir Bilal Yavuz <[email protected]>
To: KAZAR Ayoub <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Tue, 19 Aug 2025 15:33:38 +0300
Message-ID: <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com> (raw)
In-Reply-To: <CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>
References: <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
	<CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
	<CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>
	<CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
	<CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>
	<CA+K2RumC79NwWxBdofHOYo8SCSs0YCJic05Du=xOszRmoPf9FA@mail.gmail.com>
	<CAN55FZ0houfWHn8_MEEefhprZvc33jr07GrBYo+Bp2yw=TVnKA@mail.gmail.com>
	<CA+K2Ru=jHuz_Wpgar4Sobtxeb33qxx=o59ToOhZ=vpmkMqErnA@mail.gmail.com>

Hi,

On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <[email protected]> wrote:
>> Thanks for running that benchmark! Would you mind sharing a reproducer
>> for the regression you observed?
>
> Of course, I attached the sql to generate the text and csv test files.
> If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce some regressions of course for the same idea.

Thank you so much!

I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.

So, I implemented a small heuristic. It works like that:

- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.

I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.

--
Regards,
Nazir Bilal Yavuz
Microsoft

From aa55843b0c64bed9f72cf8cd7854df9df7ef989b Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Tue, 19 Aug 2025 15:16:02 +0300
Subject: [PATCH v1] COPY SIMD: add heuristic to avoid regression on small
 advances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When SIMD advances fewer than 5 characters, performance regresses.
To mitigate this, introduce a heuristic:

- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.
---
 src/backend/commands/copyfromparse.c | 42 ++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..e58d7d4e353 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1263,6 +1263,9 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 	Vector8		bs = vector8_broadcast('\\');
 	Vector8		quote;
 	Vector8		escape;
+
+	int			sleep_cyle = 0;
+	int			last_sleep_cyle = 1;
 #endif
 
 	if (is_csv)
@@ -1359,7 +1362,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		 *   vector register, as SIMD operations require processing data in
 		 *   fixed-size chunks.
 		 */
-		if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+		if (sleep_cyle <= 0 && !in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
 		{
 			Vector8		chunk;
 			Vector8		match;
@@ -1390,14 +1393,49 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 				 */
 				int advance = pg_rightmost_one_pos32(mask);
 				input_buf_ptr += advance;
+
+				/*
+				 * If we advance less than 5 characters we cause regression.
+				 * Sleep a bit then try again. Sleep time increases
+				 * exponentially.
+				 */
+				if (advance < 5)
+				{
+					if (last_sleep_cyle >= PG_INT16_MAX / 2)
+						last_sleep_cyle = PG_INT16_MAX;
+					else
+						last_sleep_cyle = last_sleep_cyle << 1;
+
+					sleep_cyle = last_sleep_cyle;
+				}
+
+				/*
+				 * If we advance more than 4 charactes this means we have
+				 * performance improvement. Halve sleep time for next sleep.
+				 */
+				else
+				{
+					last_sleep_cyle = Max(last_sleep_cyle >> 1, 1);
+					sleep_cyle = 0;
+				}
 			}
 			else
 			{
-				/* No special characters found, so skip the entire chunk */
+				/*
+				 * No special characters found, so skip the entire chunk and
+				 * halve sleep time for next sleep.
+				 */
 				input_buf_ptr += sizeof(Vector8);
+				last_sleep_cyle = Max(last_sleep_cyle >> 1, 1);
 				continue;
 			}
 		}
+
+		/*
+		 * Vulnerable to overflow if we are in quote for more than INT16_MAX
+		 * characters.
+		 */
+		sleep_cyle--;
 #endif
 
 		/* OK to fetch a character */
-- 
2.50.1



Attachments:

  [text/plain] COPY-SIMD-add-heuristic-to-avoid-regression-on-sm.txt (2.8K, 2-COPY-SIMD-add-heuristic-to-avoid-regression-on-sm.txt)
  download | inline diff:
From aa55843b0c64bed9f72cf8cd7854df9df7ef989b Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Tue, 19 Aug 2025 15:16:02 +0300
Subject: [PATCH v1] COPY SIMD: add heuristic to avoid regression on small
 advances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When SIMD advances fewer than 5 characters, performance regresses.
To mitigate this, introduce a heuristic:

- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.
---
 src/backend/commands/copyfromparse.c | 42 ++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..e58d7d4e353 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1263,6 +1263,9 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 	Vector8		bs = vector8_broadcast('\\');
 	Vector8		quote;
 	Vector8		escape;
+
+	int			sleep_cyle = 0;
+	int			last_sleep_cyle = 1;
 #endif
 
 	if (is_csv)
@@ -1359,7 +1362,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		 *   vector register, as SIMD operations require processing data in
 		 *   fixed-size chunks.
 		 */
-		if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+		if (sleep_cyle <= 0 && !in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
 		{
 			Vector8		chunk;
 			Vector8		match;
@@ -1390,14 +1393,49 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 				 */
 				int advance = pg_rightmost_one_pos32(mask);
 				input_buf_ptr += advance;
+
+				/*
+				 * If we advance less than 5 characters we cause regression.
+				 * Sleep a bit then try again. Sleep time increases
+				 * exponentially.
+				 */
+				if (advance < 5)
+				{
+					if (last_sleep_cyle >= PG_INT16_MAX / 2)
+						last_sleep_cyle = PG_INT16_MAX;
+					else
+						last_sleep_cyle = last_sleep_cyle << 1;
+
+					sleep_cyle = last_sleep_cyle;
+				}
+
+				/*
+				 * If we advance more than 4 charactes this means we have
+				 * performance improvement. Halve sleep time for next sleep.
+				 */
+				else
+				{
+					last_sleep_cyle = Max(last_sleep_cyle >> 1, 1);
+					sleep_cyle = 0;
+				}
 			}
 			else
 			{
-				/* No special characters found, so skip the entire chunk */
+				/*
+				 * No special characters found, so skip the entire chunk and
+				 * halve sleep time for next sleep.
+				 */
 				input_buf_ptr += sizeof(Vector8);
+				last_sleep_cyle = Max(last_sleep_cyle >> 1, 1);
 				continue;
 			}
 		}
+
+		/*
+		 * Vulnerable to overflow if we are in quote for more than INT16_MAX
+		 * characters.
+		 */
+		sleep_cyle--;
 #endif
 
 		/* OK to fetch a character */
-- 
2.50.1



view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ1J+6eM=F5GreWEBMJcNV_gifYyYY1b6xpYzun=nWPhMQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox