Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vRn9H-00536K-35 for pgsql-hackers@arkaria.postgresql.org; Sat, 06 Dec 2025 07:56:11 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vRn9F-00B9NN-1T for pgsql-hackers@arkaria.postgresql.org; Sat, 06 Dec 2025 07:56:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vRn9F-00B9NE-0V for pgsql-hackers@lists.postgresql.org; Sat, 06 Dec 2025 07:56:09 +0000 Received: from mail-pg1-x533.google.com ([2607:f8b0:4864:20::533]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vRn99-003PJu-2E for pgsql-hackers@postgresql.org; Sat, 06 Dec 2025 07:56:08 +0000 Received: by mail-pg1-x533.google.com with SMTP id 41be03b00d2f7-b4755f37c3eso1900350a12.3 for ; Fri, 05 Dec 2025 23:56:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1765007762; x=1765612562; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zW+nT2tQswkCwTnSFJqKA5hK7ZB1KDU0FcJOF9aCPd8=; b=agIQ8V6ginBGbsa1gUfbkGyZI4YelKSQhKIhgRACmik35JmNIVQm/IMOlP4GWRymvI i0hhphdXW5bwRyuZs64yLCpa4C65DjtouQIETOuv24W0oXCD4gKKXzLdfOa+et1rdKEl qcyVcfGRRSOZ1dQk612PpIMHeS156NFt9Oy+fGaiHjRviZ5HlohIvC3YtcHF5LMJE2jS HSss7DT2awPBWaEd/Cm5Y2/z3yNowPA7U0Lb2JpS2jtTk4QnEM+CWBWa2Xbb3+MX8B/7 5PNUGflCDi32gFpYUGhYue0xMrGZowha2lhVVm0jXax+ZfNxsUPgY7Lu8GnhrcoYAgRC AZvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765007762; x=1765612562; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=zW+nT2tQswkCwTnSFJqKA5hK7ZB1KDU0FcJOF9aCPd8=; b=LacyBNXQ9xbI9FEW4TnB4p+xhlDd73DnDxIXDfPe2blw3gT0NUskOsFStvEMIPQrs9 uq1Y0g6aQaxzpT8K8nHIxQqpHsjRT6Hfbu1FvS4Vansx8dvZdfCQJwESJ1vxk3K4Mkh9 VMNM2CKNn7ZHq+/b25I55Ix3lhQDbZ9zmVDhGqborpxeDwKRAsfsnxiXTMm7OikOtb3s Ygxb7062qjBqGfqCG6NqTUeMyTwTHaPWJgp1Ppdb31Jy/DJkTLe2nX02lindx1mPsecK YsV579vpuGr1z3yjV8VKfAA/0jS7PmWEWvu+gGh1aDiRoO3ChZt/J8TjxnNjPc3MQ8yT IFZg== X-Forwarded-Encrypted: i=1; AJvYcCVyQWuYPGiZolNPr4s82oW4j1H/iFk+uUDhW7d0S41PX7NnUeQvb7ap99Fu0kzSABbun47J2r0Rvf/gk0oc@postgresql.org X-Gm-Message-State: AOJu0YwtbH6Fw3vwzQ7lEvtPmmQJYhCTXLwjqMjOCb898IDsJukSkFZ1 J37Zer/5rUlRoL9q4ZtuP298R7pH2Y85ufu0D8k05TTd2Xt2nPL4O49zkIVVa6J1cWZanjv+9kT P075eINjsTYWCcwtVtkH14vx1dnzqI1U= X-Gm-Gg: ASbGncu58sSump6WB/LhD/L5Z/I7NP/hq44B02BSbx36IN7wm1bHiatkbBSs4GGpIga D31SdnJ42xyolGG05zeoULUg6wAoIQxKEE2kwi54gcNLzZxAvjnPSSfb8robPF3pRHWiikl2V5w PWUV1LZYcQ23t3bpBbLePM7sbdpxdw8BlUwHPsIE3IUyhtNqwpbjllOlpmQxE4b1XpC66yIbhHl b2MTxX6gh3m8u8oLNdaaoCqMwXn6db9n+6wwsW3qERnIHkD+qnr+jRGs/1yjdXrpZ+QjQ== X-Google-Smtp-Source: AGHT+IF7A+NCOZfuabfmfO+8Al7TptOOS5hQX2jaMo6tgRLdaxNOmL04OhBf9LDkSEkpsTIN1+noFtilbufJnDTa6Dk= X-Received: by 2002:a05:7300:ce92:b0:2a4:3593:c7cf with SMTP id 5a478bee46e88-2abc712f255mr904948eec.15.1765007762330; Fri, 05 Dec 2025 23:56:02 -0800 (PST) MIME-Version: 1.0 References: <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net> <8e226753-57af-489a-bfbe-caa23dd71286@dunslane.net> In-Reply-To: From: Bilal Yavuz Date: Sat, 6 Dec 2025 10:55:50 +0300 X-Gm-Features: AQt7F2oWVOwuZhcQMWWY83fQdwYHjL7-9bGXx9UNGpP_9w2TNqXXdfOe-UU0KwA Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Manni Wood Cc: KAZAR Ayoub , Nathan Bossart , Andrew Dunstan , Shinya Kato , PostgreSQL-development Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On Sat, 6 Dec 2025 at 04:40, Manni Wood wrote= : > Hello, all. > > Andrew, I tried your suggestion of just reading the first chunk of the co= py file to determine if SIMD is worth using. Attached are v4 versions of th= e patches showing a first attempt at doing that. Thank you for doing this! > I attached test.sh.txt to show how I've been testing, with 5 million line= s of the various copy file variations introduced by Ayub Kazar. > > The text copy with no special chars is 30% faster. The CSV copy with no s= pecial chars is 48% faster. The text with 1/3rd escapes is 3% slower. The C= SV with 1/3rd quotes is 0.27% slower. > > This set of patches follows the simplest suggestion of just testing the f= irst N lines (actually first N bytes) of the file and then deciding whether= or not to enable SIMD. This set of patches does not follow Andrew's later = suggestion of maybe checking again every million lines or so. My input-generation script is not ready to share yet, but the inputs follow this format: text_${n}.input, where n represents the number of normal characters before the delimiter. For example: n =3D 0 -> "\n\n\n\n\n..." (no normal characters) n =3D 1 -> "a\n..." (1 normal character before the delimiter) ... n =3D 5 -> "aaaaa\n..." =E2=80=A6 continuing up to n =3D 32. Each line has 4096 chars and there are a total of 100000 lines in each input file. I only benchmarked the text format. I compared the latest heuristic I shared [1] with the current method. The benchmarks show roughly a ~16% regression at the worst case (n =3D 2), with regressions up to n =3D 5. For the remaining values, performance was similar. Actual comparison of timings (in ms): current method / heuristic n =3D 0 -> 3252.7253 / 2856.2753 (%12) n =3D 1 -> 2910.321 / 2520.7717 (%13) n =3D 2 -> 2865.008 / 2403.2017 (%16) n =3D 3 -> 2608.649 / 2353.1477 (%9) n =3D 4 -> 2460.74 / 2300.1783 (%6) n =3D 5 -> 2451.696 / 2362.1573 (%3) No difference for the rest. Side note: Sorry for the delay in responding, I will continue working on this next week. [1] https://postgr.es/m/CAN55FZ1KF7XNpm2XyG%3DM-sFUODai%3D6Z8a11xE3s4YRBeBK= Y3tA%40mail.gmail.com --=20 Regards, Nazir Bilal Yavuz Microsoft