Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vifAd-00CfaY-1E for pgsql-hackers@arkaria.postgresql.org; Wed, 21 Jan 2026 20:51:20 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vif9c-009GVB-2U for pgsql-hackers@arkaria.postgresql.org; Wed, 21 Jan 2026 20:50:17 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vif9c-009GV3-1Q for pgsql-hackers@lists.postgresql.org; Wed, 21 Jan 2026 20:50:16 +0000 Received: from mail-dl1-x122d.google.com ([2607:f8b0:4864:20::122d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vif9a-001mdJ-2W for pgsql-hackers@postgresql.org; Wed, 21 Jan 2026 20:50:16 +0000 Received: by mail-dl1-x122d.google.com with SMTP id a92af1059eb24-1233b172f02so499707c88.0 for ; Wed, 21 Jan 2026 12:50:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1769028612; cv=none; d=google.com; s=arc-20240605; b=MgwmxjUxeJtNU6MHiod4y2Ex4QADhgVRsd4b4VApcmVLjJgvbJamN/CA1mgoWhngu1 KlXt+8xIl/+RVdJYxETtMFx1Y0M7PqfesjUS68h/DlEnUj8SWGxpllDloL4TuaN9bhip S04xaT+nsFLm/GweKCbaPFK/CbV/EDCoW6tzvJpEv4HjrvMmv1vkgOqfxGXKlanlSJyy T27JeAi4UvExLHFuVTFRypPwhbMArq1nh+O+Wr5vMiKGOQ5IiUmC+P5YJGTW4LFR1L6/ bAVcODTt/1e1tDQbsvT0+ZyqWweNmZuAez6D8D04hQ67dYWVu0VHt3Fzf+iuFNt7w/j6 Uitg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=crXZrm58JIXGHOnSYoxQcCSBOuJBDPYd+onyDYVTxgs=; fh=TETmzTkh+hlyThBPMQ2cSZtzwjZF5JV8lRtCBhsYmB0=; b=LM+BBoG/WoYnUewJ1zX96kmkc8rcgjvGG21iAnsSUPhy6nofHnheC6qatl2revzqI7 jcff/Mb8j482NJl/Q7aZt+JX0/jBJhEoV/h9qrla9IQsJTQlEtwN+4La/GB3fW5DTIbp zRlR90bjTE68X8bdYe28IwB0RCzrk4xa7DiyzkdzNWgN2i41P8IfgALpUNt6UvN8lxAD zzBUb1fyebNWiYcVaTzpUK+Lm51IvSyPGPk1Cxx/q5ujaY0+i3Phsqdw5x0gb814FGip 25RhBQsUJBPh29qhjdkV5Rq26E5Rs4zHiYnrIDU4IBeMlLjRp1fPxQntV6nvTvE+f03z nluA==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769028612; x=1769633412; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=crXZrm58JIXGHOnSYoxQcCSBOuJBDPYd+onyDYVTxgs=; b=Q2D+DI8onw5MEMgQ7KQXFBNRFmNvw60NNd1hhZ1eX6vMgvYSoUpFNNv2cWlEP6z0l2 wtAoheuOE4VbClB3VVnuF6u2Fm/i8zFe8JB5WmrxbaX1Qy6plxBSjlqu0ClBro9srmPW vPrYy6tbEoCoPL4Vg37NXJgiV/T235oR13ZP4JBtg/cGvDiCcQNO7VLRX6Dk3FUxHBqQ e5NDJ/RNzcyNvSCrSJaryY0IG2yRV1UKdIT+lMb/hq25sO+jsKwqGy8N4LFnFrP/1hs5 qhXAIcd3x6p8hBB/J8ydLqkmYaqBvR3BSqFTy5m45sYTlaOF4Lh0+nPwP320GjXnKDfw jWqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769028612; x=1769633412; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=crXZrm58JIXGHOnSYoxQcCSBOuJBDPYd+onyDYVTxgs=; b=C6dtPbXKX1C3eL3bKOh+IpDCpRKc8riDSx00SUnZq02MQooBN8tRiU27BUKZS4lzy9 m0256w+McJOVAQesKLCbUyDk+UlMC0VYOzMzQDZ98bqf8/BiSW2qyvOpJ+kWSxxTI0XE IT7W33ZEX5MiJ7ZWMGGlDTJqNtfyDwTZDqCZvAEK+WDw6f1SQI5oXJw4iq03ZsK9ptpU M67zKnEvCfvbpZsZ4IXqPQa8MjYBlnSvmh9yDFD2uRQO80VsMpxyaCPuIEXpBWHl3/kc zdtFR4Qadj6g3zZLZE1WiyK6URqMr85dYNiAx1DWydsWyhJ1SM/C9hDHEi4kiL0LNBSc vReg== X-Forwarded-Encrypted: i=1; AJvYcCUGUoRQWKulQQsx/xZ+RjShZMJ1EEOjEFegB9Adm+/RnwLQa7rk3g4wgxysUVGlHAuB2oQOtQdbZi2Qk5Ow@postgresql.org X-Gm-Message-State: AOJu0YxvIAuLt0c5WCK5ZgdQREgGOhVU0DXgMGbiiRkZQsfk0iGI6eVe kppxILwgNXIoUF3iogELL6R6YLkYV8i5KGS56ayKYu7jBG+Lwn2Lx2Ah/Pr1/khTdOjnJ0sC4MN RBnqcsvtENBoZxy5p5UOd8VHdxELVgKw= X-Gm-Gg: AZuq6aLBnMiPuSBw2AWlY3iSZ4RqN5omyXvFGFEFK2g2b5oVU76a/bv0fMux4o+Dry0 g1XiT6EY0JpeliliqGwyUUsCuOdIsOj6YszrP2N5GjJNqB8xIUbkw7WK++M7LsoNnQXQ8UcOtBF 6XPxd24FOP2e5CpXjHOZ7c2WYiHHB0R28akR8YWHsC3NsWNfzraSVbk3myOiL0mQKtcm5u4iDTh awRrLSGpucnVm5fjq50jfL2sht1ktE7wxXpPzMavEdY6NxueH3NKSmkmi7Z8Pc51/zMQLexShtN s9pWZw== X-Received: by 2002:a05:7300:3b24:b0:2ae:5bd5:c22e with SMTP id 5a478bee46e88-2b6fd7c5c87mr3323845eec.30.1769028611947; Wed, 21 Jan 2026 12:50:11 -0800 (PST) MIME-Version: 1.0 References: <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net> <8e226753-57af-489a-bfbe-caa23dd71286@dunslane.net> In-Reply-To: From: Neil Conway Date: Wed, 21 Jan 2026 15:49:59 -0500 X-Gm-Features: AZwV_QgN7XfP5TF9vG1wLvUL_T6qWkSi9tL0THcO6luAGrRo94PP7PV4ppB84IQ Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nazir Bilal Yavuz Cc: Manni Wood , KAZAR Ayoub , Nathan Bossart , Andrew Dunstan , Shinya Kato , PostgreSQL-development Content-Type: multipart/alternative; boundary="0000000000000342450648ec12c5" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000000342450648ec12c5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable A few suggestions: * I'm curious if we'll see better performance on large inputs if we flush to `line_buf` periodically (e.g., at least every few thousand bytes or so). Otherwise we might see poor data cache behavior if large inputs with no control characters get evicted before we've copied them over. See the approach taken in escape_json_with_len() in utils/adt/json.c * Did you compare the approach taken in the patch with a simpler approach that just does if (!(vector8_has(chunk, '\\') || vector8_has(chunk, '\r') || vector8_has(chunk, '\n') /* and so on, accounting for CSV / escapec / quotec stuff */)) { /* skip chunk */ } That's roughly what we do elsewhere (e.g., escape_json_with_len). It has the advantage of being more readable, along with potentially having fewer data dependencies. Neil On Wed, Dec 10, 2025 at 7:00=E2=80=AFAM Nazir Bilal Yavuz wrote: > Hi, > > On Wed, 10 Dec 2025 at 01:13, Manni Wood > wrote: > > > > Bilal Yavuz (Nazir Bilal Yavuz?), > > It is Nazir Bilal Yavuz, I changed some settings on my phone and it > seems that it affected my mail account, hopefully it should be fixed > now. > > > I did not get a chance to do any work on this today, but wanted to than= k > you for finding my logic errors in counting special chars for CSV, and > hacking on my naive solution to make it faster. By attempting Andrew > Dunstan's suggestion, I got a better feel for the reality that the > "housekeeping" code produces a significant amount of overhead. > > You are welcome! v4.1 has some problems with in_quote case in SIMD > handling code and counting cstate->chars_processed variable. I fixed > them in v4.2. > > -- > Regards, > Nazir Bilal Yavuz > Microsoft > --0000000000000342450648ec12c5 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
A few suggestions:

* I'm curious if we'll s= ee better performance on large inputs if we flush to `line_buf` periodicall= y (e.g., at least every few thousand bytes or so). Otherwise we might see p= oor data cache behavior if large inputs with no control characters get evic= ted before we've copied them over. See the approach taken in=C2=A0escap= e_json_with_len() in utils/adt/json.c

* Did you compare = the approach taken in the patch with a simpler approach that just does

if (!(vector8_has(chunk, '\\') ||
= =C2=A0 =C2=A0 =C2=A0 vector8_has(chunk, '\r') ||
=C2=A0 = =C2=A0 =C2=A0 vector8_has(chunk, '\n') /* and so on, accounting for= CSV / escapec / quotec stuff */))
{
=C2=A0 =C2=A0 /* s= kip chunk */
}

That's roughly what w= e do elsewhere (e.g., escape_json_with_len). It has the advantage of being = more readable, along with potentially having fewer data dependencies.
=

Neil

On Wed, Dec 10, 2025 at= 7:00=E2=80=AFAM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,

On Wed, 10 Dec 2025 at 01:13, Manni Wood <manni.wood@enterprisedb.com> wrot= e:
>
> Bilal Yavuz (Nazir Bilal Yavuz?),

It is Nazir Bilal Yavuz, I changed some settings on my phone and it
seems that it affected my mail account, hopefully it should be fixed
now.

> I did not get a chance to do any work on this today, but wanted to tha= nk you for finding my logic errors in counting special chars for CSV, and h= acking on my naive solution to make it faster. By attempting Andrew Dunstan= 's suggestion, I got a better feel for the reality that the "house= keeping" code produces a significant amount of overhead.

You are welcome! v4.1 has some problems with in_quote case in SIMD
handling code and counting cstate->chars_processed variable. I fixed
them in v4.2.

--
Regards,
Nazir Bilal Yavuz
Microsoft
--0000000000000342450648ec12c5--