Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vavsS-00FrhV-2i for pgsql-hackers@arkaria.postgresql.org; Wed, 31 Dec 2025 13:04:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vavsP-006IBy-1k for pgsql-hackers@arkaria.postgresql.org; Wed, 31 Dec 2025 13:04:34 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vavsP-006IBn-0k for pgsql-hackers@lists.postgresql.org; Wed, 31 Dec 2025 13:04:34 +0000 Received: from mail-pg1-x530.google.com ([2607:f8b0:4864:20::530]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vavsN-003mAD-1q for pgsql-hackers@postgresql.org; Wed, 31 Dec 2025 13:04:33 +0000 Received: by mail-pg1-x530.google.com with SMTP id 41be03b00d2f7-c0c24d0f4ceso6344565a12.1 for ; Wed, 31 Dec 2025 05:04:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767186268; x=1767791068; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=QvinVfoFDPjlknKGn67l8Gg5B4n9tgdJdjjCi6Uwlf4=; b=goICprc+IYvi0RcQQHxNo04fmkKHTaceD7tsHtV4yMKXQFYSIPXlnHrvgZHw/KdCMZ 10AoO8Oj3xSOjHTSRl4FdNxasTFa6Zpa3Y41Nwp2BOwN7wH34sQQw1suWw7ZFo+U6r+I mb05toELqQZVDnGurXvMtBzcV7DabvoHqC2NgGevgGX5Nu5DK/xPs7jLG7p6QX/FVFrQ aWN0zaCw645RzQQtuRkNJ4xT0Q91MvHIlE6PlRT7mNAoE0mMQE3mk6lroMU7LC2O7R0J 7UbUwoGRI1Qlbo1GNe2eiNFW7Wd6pKe9VhF0HWnp1VxWGtsW6zLuxeIOuKktq/zk8pSQ LwhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767186268; x=1767791068; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=QvinVfoFDPjlknKGn67l8Gg5B4n9tgdJdjjCi6Uwlf4=; b=HJPgFS8qKX8gMb285PFuovgflRR4GWOdEopyhzdz4H3JruDvVKw0x6oNuZgPO0aWVC ZlciI32XNafDd+UgKPCSjCPondFPYPoAQC44tBXF+ndh61jEfXR58WqF4mUYI0D3qBeE qJfzxj45xcc4yHIXr5g6a3Bzg+cCl1gK0TeBJYzfB7H9RWgfCzXKbUnno9iQ++UidA9D EPrCt++GyxFKPZmDM1xfXPdlVvd8YYgI8iYC+WRel/kDOorGKA1iIV/7oT3Q5hUg4dxR hSbzcHjAQUj/TR4KdbSs2PFpxNxaEJhzSFb95GvAm2eXw1NsJH92vK7HaCumMXVQukQI 0s0w== X-Forwarded-Encrypted: i=1; AJvYcCUnWmU4HAhEoQ3RRc9Hhof03QBMOmnoRQ7ExVofk2i8vCBmLqsia2VAMYxKZ9b+F3O/3tte1jESYVPHWiTY@postgresql.org X-Gm-Message-State: AOJu0YzpfH8boE+9lwSFt7D1Zwe+QJ46nss/4R4aLEn8W1wIqAFhQ2S8 dPVSVTpiIYpkBmgblpEHI+kv3zfQMP9IqpNF3Sb2nqM84HLRsf79YObVczpTUkVTJ6kf5Aik7By 8V5eTn3krGD80hQLrOn706AGtmb/D+5o= X-Gm-Gg: AY/fxX7l3S3RPmeIuvoeN+c7GXT+krGCKuoi84MRPjX0MKokrkrIE9oh1MbSsAGd88o NdBvHkS+h65ld8yrxF4Ve7wyn41Yel6OaeHpw95D5c6xR6l+F3U9I4HXFQLH7ZN8iNiMkm/E7LW s6qyJoWVU/boP5fkKgzbTeh5r0S85gcd7gwBsR11y5h16x2KIksp0KtD0c5rAVTACi9TvIiiTki a1m8oVb4bYy4asDeCsUtSmzp8O2LI5S710cRMcsLKIYvfvWJq4bJagHC/7SRPc7y4c6FQ== X-Google-Smtp-Source: AGHT+IHALcx4/gyrklF9X6NUWMdc6h4tBjEOzHIuVJgTG9DDhVeL3qFjYF1NE37EmIJvCrtG5gCri58xpfhtLFD1jbM= X-Received: by 2002:a05:693c:3111:b0:2ae:5ab7:61fa with SMTP id 5a478bee46e88-2b05ec4fa14mr42555501eec.20.1767186267755; Wed, 31 Dec 2025 05:04:27 -0800 (PST) MIME-Version: 1.0 References: <8e226753-57af-489a-bfbe-caa23dd71286@dunslane.net> In-Reply-To: From: Nazir Bilal Yavuz Date: Wed, 31 Dec 2025 16:04:15 +0300 X-Gm-Features: AQt7F2oScqr7FbzWdVnMFGczeZ49_M4olJFJadhNJLZNiqhHTDi6edboxCHsNIY Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: KAZAR Ayoub Cc: Manni Wood , Mark Wong , Nathan Bossart , Andrew Dunstan , Shinya Kato , PostgreSQL-development Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On Wed, 24 Dec 2025 at 18:08, KAZAR Ayoub wrote: > > Hello, > Following the same path of optimizing COPY FROM using SIMD, i found that COPY TO can also benefit from this. > > I attached a small patch that uses SIMD to skip data and advance as far as the first special character is found, then fallback to scalar processing for that character and re-enter the SIMD path again... > There's two ways to do this: > 1) Essentially we do SIMD until we find a special character, then continue scalar path without re-entering SIMD again. > - This gives from 10% to 30% speedups depending on the weight of special characters in the attribute, we don't lose anything here since it advances with SIMD until it can't (using the previous scripts: 1/3, 2/3 specials chars). > > 2) Do SIMD path, then use scalar path when we hit a special character, keep re-entering the SIMD path each time. > - This is equivalent to the COPY FROM story, we'll need to find the same heuristic to use for both COPY FROM/TO to reduce the regressions (same regressions: around from 20% to 30% with 1/3, 2/3 specials chars). > > Something else to note is that the scalar path for COPY TO isn't as heavy as the state machine in COPY FROM. > > So if we find the sweet spot for the heuristic, doing the same for COPY TO will be trivial and always beneficial. > Attached is 0004 which is option 1 (SIMD without re-entering), 0005 is the second one. Patches look correct to me. I think we could move these SIMD code portions into a shared function to remove duplication, although that might have a performance impact. I have not benchmarked these patches yet. Another consideration is that these patches might need their own thread, though I am not completely sure about this yet. One question: what do you think about having a 0004-style approach for COPY FROM? What I have in mind is running SIMD for each line & column, stopping SIMD once it can no longer skip an entire chunk, and then continuing with the next line & column. -- Regards, Nazir Bilal Yavuz Microsoft