Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nazir Bilal Yavuz <[email protected]>
To: Shinya Kato <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Mon, 11 Aug 2025 11:52:25 +0300
Message-ID: <CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com> (raw)
In-Reply-To: <CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
References: <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
	<CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>

Hi,

On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <[email protected]> wrote:
>
> On Thu, 7 Aug 2025 at 04:49, Shinya Kato <[email protected]> wrote:
> >
> > I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
> > text}) command and observed approximately a 5% performance
> > improvement. Please see the detailed test results below.
>
> Also, I did a benchmark on text format. I created a benchmark for line
> length in a table being from 1 byte to 1 megabyte.The peak improvement
> is line length being 4096 and the improvement is more than 20% [1], I
> saw no regression on your patch.

I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.

> What do you think about adding SIMD to CopyReadAttributesText() and
> CopyReadAttributesCSV() functions? When I add your SIMD approach to
> CopyReadAttributesText() function, the improvement on the 4096 byte
> line length input [1] goes from 20% to 30%.

I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.

CopyReadAttributesCSV() changes are attached as feedback v2.

--
Regards,
Nazir Bilal Yavuz
Microsoft

From 203d648c4cf64c6d629f2abc719a371dd0393e22 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Thu, 7 Aug 2025 13:27:34 +0300
Subject: [PATCH v2] Feedback

---
 src/backend/commands/copyfromparse.c | 176 ++++++++++++++++++++++++---
 1 file changed, 160 insertions(+), 16 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..7b83e64e23b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -670,8 +670,12 @@ CopyLoadInputBuf(CopyFromState cstate)
 		/* If we now have some unconverted data, try to convert it */
 		CopyConvertBuf(cstate);
 
-		/* If we now have some more input bytes ready, return them */
-		if (INPUT_BUF_BYTES(cstate) > nbytes)
+		/*
+		 * If we now have at least sizeof(Vector8) input bytes ready, return
+		 * them. This is beneficial for SIMD processing in the
+		 * CopyReadLineText() function.
+		 */
+		if (INPUT_BUF_BYTES(cstate) > nbytes + sizeof(Vector8))
 			return;
 
 		/*
@@ -1322,7 +1326,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		 * unsafe with the old v2 COPY protocol, but we don't support that
 		 * anymore.
 		 */
-		if (input_buf_ptr >= copy_buf_len || need_data)
+		if (input_buf_ptr + sizeof(Vector8) >= copy_buf_len || need_data)
 		{
 			REFILL_LINEBUF;
 
@@ -1345,21 +1349,22 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		}
 
 #ifndef USE_NO_SIMD
+
 		/*
-		 * SIMD instructions are used here to efficiently scan the input buffer
-		 * for special characters (e.g., newline, carriage return, quotes, or
-		 * escape characters). This approach significantly improves performance
-		 * compared to byte-by-byte iteration, especially for large input
-		 * buffers.
+		 * SIMD instructions are used here to efficiently scan the input
+		 * buffer for special characters (e.g., newline, carriage return,
+		 * quotes, or escape characters). This approach significantly improves
+		 * performance compared to byte-by-byte iteration, especially for
+		 * large input buffers.
 		 *
-		 * However, SIMD optimization cannot be applied in the following cases:
-		 * - Inside quoted fields, where escape sequences and closing quotes
-		 *   require sequential processing to handle correctly.
-		 * - When the remaining buffer size is smaller than the size of a SIMD
-		 *   vector register, as SIMD operations require processing data in
-		 *   fixed-size chunks.
+		 * However, SIMD optimization cannot be applied in the following
+		 * cases: - Inside quoted fields, where escape sequences and closing
+		 * quotes require sequential processing to handle correctly. - When
+		 * the remaining buffer size is smaller than the size of a SIMD vector
+		 * register, as SIMD operations require processing data in fixed-size
+		 * chunks.
 		 */
-		if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+		if (copy_buf_len - input_buf_ptr >= sizeof(Vector8))
 		{
 			Vector8		chunk;
 			Vector8		match;
@@ -1388,13 +1393,15 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 				 * Found a special character. Advance up to that point and let
 				 * the scalar code handle it.
 				 */
-				int advance = pg_rightmost_one_pos32(mask);
+				int			advance = pg_rightmost_one_pos32(mask);
+
 				input_buf_ptr += advance;
 			}
 			else
 			{
 				/* No special characters found, so skip the entire chunk */
 				input_buf_ptr += sizeof(Vector8);
+				last_was_esc = false;
 				continue;
 			}
 		}
@@ -1650,6 +1657,11 @@ CopyReadAttributesText(CopyFromState cstate)
 	char	   *cur_ptr;
 	char	   *line_end_ptr;
 
+#ifndef USE_NO_SIMD
+	Vector8		bs = vector8_broadcast('\\');
+	Vector8		delim = vector8_broadcast(delimc);
+#endif
+
 	/*
 	 * We need a special case for zero-column tables: check that the input
 	 * line is empty, and return.
@@ -1717,6 +1729,44 @@ CopyReadAttributesText(CopyFromState cstate)
 		{
 			char		c;
 
+#ifndef USE_NO_SIMD
+			if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+			{
+				Vector8		chunk;
+				Vector8		match;
+				uint32		mask;
+
+				/* Load a chunk of data into a vector register */
+				vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+				/* Create a mask of all special characters we need to stop at */
+				match = vector8_or(vector8_eq(chunk, bs), vector8_eq(chunk, delim));
+
+				/* Check if we found any special characters */
+				mask = vector8_highbit_mask(match);
+				if (mask != 0)
+				{
+					/*
+					 * Found a special character. Advance up to that point and
+					 * let the scalar code handle it.
+					 */
+					int			advance = pg_rightmost_one_pos32(mask);
+
+					memcpy(output_ptr, cur_ptr, advance);
+					output_ptr += advance;
+					cur_ptr += advance;
+				}
+				else
+				{
+					/* No special characters found, so skip the entire chunk */
+					memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+					output_ptr += sizeof(Vector8);
+					cur_ptr += sizeof(Vector8);
+					continue;
+				}
+			}
+#endif
+
 			end_ptr = cur_ptr;
 			if (cur_ptr >= line_end_ptr)
 				break;
@@ -1906,6 +1956,12 @@ CopyReadAttributesCSV(CopyFromState cstate)
 	char	   *cur_ptr;
 	char	   *line_end_ptr;
 
+#ifndef USE_NO_SIMD
+	Vector8		quote = vector8_broadcast(quotec);
+	Vector8		delim = vector8_broadcast(delimc);
+	Vector8		escape = vector8_broadcast(escapec);
+#endif
+
 	/*
 	 * We need a special case for zero-column tables: check that the input
 	 * line is empty, and return.
@@ -1972,6 +2028,50 @@ CopyReadAttributesCSV(CopyFromState cstate)
 			/* Not in quote */
 			for (;;)
 			{
+#ifndef USE_NO_SIMD
+				if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+				{
+					Vector8		chunk;
+					Vector8		match;
+					uint32		mask;
+
+					/* Load a chunk of data into a vector register */
+					vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+					/*
+					 * Create a mask of all special characters we need to stop
+					 * at
+					 */
+					match = vector8_or(vector8_eq(chunk, quote), vector8_eq(chunk, delim));
+
+					/* Check if we found any special characters */
+					mask = vector8_highbit_mask(match);
+					if (mask != 0)
+					{
+						/*
+						 * Found a special character. Advance up to that point
+						 * and let the scalar code handle it.
+						 */
+						int			advance = pg_rightmost_one_pos32(mask);
+
+						memcpy(output_ptr, cur_ptr, advance);
+						output_ptr += advance;
+						cur_ptr += advance;
+					}
+					else
+					{
+						/*
+						 * No special characters found, so skip the entire
+						 * chunk
+						 */
+						memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+						output_ptr += sizeof(Vector8);
+						cur_ptr += sizeof(Vector8);
+						continue;
+					}
+				}
+#endif
+
 				end_ptr = cur_ptr;
 				if (cur_ptr >= line_end_ptr)
 					goto endfield;
@@ -1995,6 +2095,50 @@ CopyReadAttributesCSV(CopyFromState cstate)
 			/* In quote */
 			for (;;)
 			{
+#ifndef USE_NO_SIMD
+				if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+				{
+					Vector8		chunk;
+					Vector8		match;
+					uint32		mask;
+
+					/* Load a chunk of data into a vector register */
+					vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+					/*
+					 * Create a mask of all special characters we need to stop
+					 * at
+					 */
+					match = vector8_or(vector8_eq(chunk, quote), vector8_eq(chunk, escape));
+
+					/* Check if we found any special characters */
+					mask = vector8_highbit_mask(match);
+					if (mask != 0)
+					{
+						/*
+						 * Found a special character. Advance up to that point
+						 * and let the scalar code handle it.
+						 */
+						int			advance = pg_rightmost_one_pos32(mask);
+
+						memcpy(output_ptr, cur_ptr, advance);
+						output_ptr += advance;
+						cur_ptr += advance;
+					}
+					else
+					{
+						/*
+						 * No special characters found, so skip the entire
+						 * chunk
+						 */
+						memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+						output_ptr += sizeof(Vector8);
+						cur_ptr += sizeof(Vector8);
+						continue;
+					}
+				}
+#endif
+
 				end_ptr = cur_ptr;
 				if (cur_ptr >= line_end_ptr)
 					ereport(ERROR,
-- 
2.50.1



Attachments:

  [text/plain] v2-0001-Feedback.txt (7.9K, 2-v2-0001-Feedback.txt)
  download | inline diff:
From 203d648c4cf64c6d629f2abc719a371dd0393e22 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <[email protected]>
Date: Thu, 7 Aug 2025 13:27:34 +0300
Subject: [PATCH v2] Feedback

---
 src/backend/commands/copyfromparse.c | 176 ++++++++++++++++++++++++---
 1 file changed, 160 insertions(+), 16 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..7b83e64e23b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -670,8 +670,12 @@ CopyLoadInputBuf(CopyFromState cstate)
 		/* If we now have some unconverted data, try to convert it */
 		CopyConvertBuf(cstate);
 
-		/* If we now have some more input bytes ready, return them */
-		if (INPUT_BUF_BYTES(cstate) > nbytes)
+		/*
+		 * If we now have at least sizeof(Vector8) input bytes ready, return
+		 * them. This is beneficial for SIMD processing in the
+		 * CopyReadLineText() function.
+		 */
+		if (INPUT_BUF_BYTES(cstate) > nbytes + sizeof(Vector8))
 			return;
 
 		/*
@@ -1322,7 +1326,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		 * unsafe with the old v2 COPY protocol, but we don't support that
 		 * anymore.
 		 */
-		if (input_buf_ptr >= copy_buf_len || need_data)
+		if (input_buf_ptr + sizeof(Vector8) >= copy_buf_len || need_data)
 		{
 			REFILL_LINEBUF;
 
@@ -1345,21 +1349,22 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		}
 
 #ifndef USE_NO_SIMD
+
 		/*
-		 * SIMD instructions are used here to efficiently scan the input buffer
-		 * for special characters (e.g., newline, carriage return, quotes, or
-		 * escape characters). This approach significantly improves performance
-		 * compared to byte-by-byte iteration, especially for large input
-		 * buffers.
+		 * SIMD instructions are used here to efficiently scan the input
+		 * buffer for special characters (e.g., newline, carriage return,
+		 * quotes, or escape characters). This approach significantly improves
+		 * performance compared to byte-by-byte iteration, especially for
+		 * large input buffers.
 		 *
-		 * However, SIMD optimization cannot be applied in the following cases:
-		 * - Inside quoted fields, where escape sequences and closing quotes
-		 *   require sequential processing to handle correctly.
-		 * - When the remaining buffer size is smaller than the size of a SIMD
-		 *   vector register, as SIMD operations require processing data in
-		 *   fixed-size chunks.
+		 * However, SIMD optimization cannot be applied in the following
+		 * cases: - Inside quoted fields, where escape sequences and closing
+		 * quotes require sequential processing to handle correctly. - When
+		 * the remaining buffer size is smaller than the size of a SIMD vector
+		 * register, as SIMD operations require processing data in fixed-size
+		 * chunks.
 		 */
-		if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+		if (copy_buf_len - input_buf_ptr >= sizeof(Vector8))
 		{
 			Vector8		chunk;
 			Vector8		match;
@@ -1388,13 +1393,15 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 				 * Found a special character. Advance up to that point and let
 				 * the scalar code handle it.
 				 */
-				int advance = pg_rightmost_one_pos32(mask);
+				int			advance = pg_rightmost_one_pos32(mask);
+
 				input_buf_ptr += advance;
 			}
 			else
 			{
 				/* No special characters found, so skip the entire chunk */
 				input_buf_ptr += sizeof(Vector8);
+				last_was_esc = false;
 				continue;
 			}
 		}
@@ -1650,6 +1657,11 @@ CopyReadAttributesText(CopyFromState cstate)
 	char	   *cur_ptr;
 	char	   *line_end_ptr;
 
+#ifndef USE_NO_SIMD
+	Vector8		bs = vector8_broadcast('\\');
+	Vector8		delim = vector8_broadcast(delimc);
+#endif
+
 	/*
 	 * We need a special case for zero-column tables: check that the input
 	 * line is empty, and return.
@@ -1717,6 +1729,44 @@ CopyReadAttributesText(CopyFromState cstate)
 		{
 			char		c;
 
+#ifndef USE_NO_SIMD
+			if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+			{
+				Vector8		chunk;
+				Vector8		match;
+				uint32		mask;
+
+				/* Load a chunk of data into a vector register */
+				vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+				/* Create a mask of all special characters we need to stop at */
+				match = vector8_or(vector8_eq(chunk, bs), vector8_eq(chunk, delim));
+
+				/* Check if we found any special characters */
+				mask = vector8_highbit_mask(match);
+				if (mask != 0)
+				{
+					/*
+					 * Found a special character. Advance up to that point and
+					 * let the scalar code handle it.
+					 */
+					int			advance = pg_rightmost_one_pos32(mask);
+
+					memcpy(output_ptr, cur_ptr, advance);
+					output_ptr += advance;
+					cur_ptr += advance;
+				}
+				else
+				{
+					/* No special characters found, so skip the entire chunk */
+					memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+					output_ptr += sizeof(Vector8);
+					cur_ptr += sizeof(Vector8);
+					continue;
+				}
+			}
+#endif
+
 			end_ptr = cur_ptr;
 			if (cur_ptr >= line_end_ptr)
 				break;
@@ -1906,6 +1956,12 @@ CopyReadAttributesCSV(CopyFromState cstate)
 	char	   *cur_ptr;
 	char	   *line_end_ptr;
 
+#ifndef USE_NO_SIMD
+	Vector8		quote = vector8_broadcast(quotec);
+	Vector8		delim = vector8_broadcast(delimc);
+	Vector8		escape = vector8_broadcast(escapec);
+#endif
+
 	/*
 	 * We need a special case for zero-column tables: check that the input
 	 * line is empty, and return.
@@ -1972,6 +2028,50 @@ CopyReadAttributesCSV(CopyFromState cstate)
 			/* Not in quote */
 			for (;;)
 			{
+#ifndef USE_NO_SIMD
+				if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+				{
+					Vector8		chunk;
+					Vector8		match;
+					uint32		mask;
+
+					/* Load a chunk of data into a vector register */
+					vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+					/*
+					 * Create a mask of all special characters we need to stop
+					 * at
+					 */
+					match = vector8_or(vector8_eq(chunk, quote), vector8_eq(chunk, delim));
+
+					/* Check if we found any special characters */
+					mask = vector8_highbit_mask(match);
+					if (mask != 0)
+					{
+						/*
+						 * Found a special character. Advance up to that point
+						 * and let the scalar code handle it.
+						 */
+						int			advance = pg_rightmost_one_pos32(mask);
+
+						memcpy(output_ptr, cur_ptr, advance);
+						output_ptr += advance;
+						cur_ptr += advance;
+					}
+					else
+					{
+						/*
+						 * No special characters found, so skip the entire
+						 * chunk
+						 */
+						memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+						output_ptr += sizeof(Vector8);
+						cur_ptr += sizeof(Vector8);
+						continue;
+					}
+				}
+#endif
+
 				end_ptr = cur_ptr;
 				if (cur_ptr >= line_end_ptr)
 					goto endfield;
@@ -1995,6 +2095,50 @@ CopyReadAttributesCSV(CopyFromState cstate)
 			/* In quote */
 			for (;;)
 			{
+#ifndef USE_NO_SIMD
+				if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+				{
+					Vector8		chunk;
+					Vector8		match;
+					uint32		mask;
+
+					/* Load a chunk of data into a vector register */
+					vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+					/*
+					 * Create a mask of all special characters we need to stop
+					 * at
+					 */
+					match = vector8_or(vector8_eq(chunk, quote), vector8_eq(chunk, escape));
+
+					/* Check if we found any special characters */
+					mask = vector8_highbit_mask(match);
+					if (mask != 0)
+					{
+						/*
+						 * Found a special character. Advance up to that point
+						 * and let the scalar code handle it.
+						 */
+						int			advance = pg_rightmost_one_pos32(mask);
+
+						memcpy(output_ptr, cur_ptr, advance);
+						output_ptr += advance;
+						cur_ptr += advance;
+					}
+					else
+					{
+						/*
+						 * No special characters found, so skip the entire
+						 * chunk
+						 */
+						memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+						output_ptr += sizeof(Vector8);
+						cur_ptr += sizeof(Vector8);
+						continue;
+					}
+				}
+#endif
+
 				end_ptr = cur_ptr;
 				if (cur_ptr >= line_end_ptr)
 					ereport(ERROR,
-- 
2.50.1

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox