Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Shinya Kato <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Wed, 13 Aug 2025 15:21:06 +0900
Message-ID: <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com> (raw)
In-Reply-To: <CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>
References: <CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com>
	<CAN55FZ247JdiT8Sd1SRiyOJxk3Ei=pDCL4kpdP=HqLRjOhKf1Q@mail.gmail.com>
	<CAN55FZ2AxiwSah7TiQoMB==r=JKT0bOtooCB7ov4xRrGkVmJ1A@mail.gmail.com>
	<CAOzEurR5nFt=-SijfU7y0BHVcrT6RG9ovvdVfKt_uBZfEQew9w@mail.gmail.com>

On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <[email protected]> wrote:

> > +         * However, SIMD optimization cannot be applied in the following cases:
> > +         * - Inside quoted fields, where escape sequences and closing quotes
> > +         *   require sequential processing to handle correctly.
> >
> > I think you can continue SIMD inside quoted fields. Only important
> > thing is you need to set last_was_esc to false when SIMD skipped the
> > chunk.
>
> That's a clever point that last_was_esc should be reset to false when
> a SIMD chunk is skipped. You're right about that specific case.
>
> However, the core challenge is not what happens when we skip a chunk,
> but what happens when a chunk contains special characters like quotes
> or escapes. The main reason we avoid SIMD inside quoted fields is that
> the parsing logic becomes fundamentally sequential and
> context-dependent.
>
> To correctly parse a "" as a single literal quote, we must perform a
> lookahead to check the next character. This is an inherently
> sequential operation that doesn't map well to SIMD's parallel nature.
>
> Trying to handle this stateful logic with SIMD would lead to
> significant implementation complexity, especially with edge cases like
> an escape character falling on the last byte of a chunk.

Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.

I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.

Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.

+               last_was_esc = false;

Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.

+               int advance = pg_rightmost_one_pos32(mask);
+               input_buf_ptr += advance;

I've attached a new patch that includes these changes. Further
modifications are still in progress.

-- 
Best regards,
Shinya Kato
NTT OSS Center


Attachments:

  [application/octet-stream] v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch (3.4K, 2-v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patch)
  download | inline diff:
From 69e16f8c7a52d967385a1dc9b1602bbd4472df60 Mon Sep 17 00:00:00 2001
From: Shinya Kato <[email protected]>
Date: Mon, 28 Jul 2025 22:08:20 +0900
Subject: [PATCH v2] Speed up COPY FROM text/CSV parsing using SIMD

---
 src/backend/commands/copyfromparse.c | 71 ++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b1ae97b833d..f1a6ea81dd1 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_bitutils.h"
 #include "port/pg_bswap.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 	char		quotec = '\0';
 	char		escapec = '\0';
 
+#ifndef USE_NO_SIMD
+	Vector8		nl = vector8_broadcast('\n');
+	Vector8		cr = vector8_broadcast('\r');
+	Vector8		bs = vector8_broadcast('\\');
+	Vector8		quote;
+	Vector8		escape;
+#endif
+
 	if (is_csv)
 	{
 		quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 		/* ignore special escape processing if it's the same as quotec */
 		if (quotec == escapec)
 			escapec = '\0';
+
+#ifndef USE_NO_SIMD
+		quote = vector8_broadcast(quotec);
+		if (quotec != escapec)
+			escape = vector8_broadcast(escapec);
+#endif
 	}
 
 	/*
@@ -1328,6 +1344,61 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
 			need_data = false;
 		}
 
+#ifndef USE_NO_SIMD
+		/*
+		 * Use SIMD instructions to efficiently scan the input buffer for
+		 * special characters (e.g., newline, carriage return, quote, and
+		 * escape). This is faster than byte-by-byte iteration, especially on
+		 * large buffers.
+		 *
+		 * We do not apply the SIMD fast path in either of the following cases:
+		 * - When the previously processed character was an escape character
+		 *   (last_was_esc), since the next byte must be examined sequentially.
+		 * - The remaining buffer is smaller than one vector width
+		 *   (sizeof(Vector8)); SIMD operates on fixed-size chunks.
+		 */
+		if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+		{
+			Vector8		chunk;
+			Vector8		match;
+			uint32		mask;
+
+			/* Load a chunk of data into a vector register */
+			vector8_load(&chunk, (const uint8 *) &copy_input_buf[input_buf_ptr]);
+
+			/* \n and \r are not special inside quotes */
+			if (!in_quote)
+				match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+			if (is_csv)
+			{
+				match = vector8_or(match, vector8_eq(chunk, quote));
+				if (escapec != '\0')
+					match = vector8_or(match, vector8_eq(chunk, escape));
+			}
+			else
+				match = vector8_or(match, vector8_eq(chunk, bs));
+
+			/* Check if we found any special characters */
+			mask = vector8_highbit_mask(match);
+			if (mask != 0)
+			{
+				/*
+				 * Found a special character. Advance up to that point and let
+				 * the scalar code handle it.
+				 */
+				int advance = pg_rightmost_one_pos32(mask);
+				input_buf_ptr += advance;
+			}
+			else
+			{
+				/* No special characters found, so skip the entire chunk */
+				input_buf_ptr += sizeof(Vector8);
+				continue;
+			}
+		}
+#endif
+
 		/* OK to fetch a character */
 		prev_raw_ptr = input_buf_ptr;
 		c = copy_input_buf[input_buf_ptr++];
-- 
2.47.3

view thread (99+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAOzEurSqgA69er9SzhPnXwmsVpO7-piUOuOy3dXcHOi__nSQcg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox