Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1voUAm-006a1S-2h for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Feb 2026 22:19:32 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1voUAl-005i3n-2L for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Feb 2026 22:19:31 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1voUAl-005i3f-1L for pgsql-hackers@lists.postgresql.org; Fri, 06 Feb 2026 22:19:31 +0000 Received: from mail-dy1-x1336.google.com ([2607:f8b0:4864:20::1336]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1voUAj-00000000uFD-0vnD for pgsql-hackers@postgresql.org; Fri, 06 Feb 2026 22:19:30 +0000 Received: by mail-dy1-x1336.google.com with SMTP id 5a478bee46e88-2b8397e3e09so3247568eec.0 for ; Fri, 06 Feb 2026 14:19:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770416368; cv=none; d=google.com; s=arc-20240605; b=gWw/2oHSQ4UKc7nRkI0wEpyIrFlRBV3bBYrUVV6rTmQgiLPniUcg+zw+xx8yuhyQlf THhs+5398KBBNtG3vSpNgl0NK9dwo/JYAZd7KqEfHOtdplbavpkV6O+InFnBM4M0a/LH QVWDHTftxHur+SZ6SBVg7fSgJTumzBjA2RGjzBEmb8nisCd4Xi5IEQjcGAVyd7LbGZGu 9oo77ghErlrtNRwz+Q+krj9nN100NLGrEvTSFqEKR5AHpScsYMlJvmendgo4eWTenn4l wbNnzVHaoEtEV0hecxg3NQ+6VJhFii5Y2KF1rtK36+EiSmzXTjK2Ve4w/8ZdR5sK2ogF b/wg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=7Z2f5KDp7Vk432WUspwkn/jnyA/qhaVXqctWPja/Q54=; fh=ZuSF/yTXQuf8gAvmwEdx0U7ltyGQHcS9FwMVeADQ8RA=; b=TvjeBn1x9mqpPK4mbINJ6BXFp8JOIL9iSw1ru5JErObYufEi76GFmKfX/GoTLPMeN+ vYIcPiU/1SWh0wBjT0IlhhN4aNdxzqkPeTzb6gi3vipaHm4oPUzCMCoiSNmBq+Itrqgq NT8nds6e/mmajoZiUlwJPLBkcmz4EzFvowCRpwa2vUEhiZgJC2h905f8T6V/o8rgIjoq Bg3TBovwmozcO0tK3bdtPQq36SagJ2SLC3HL3bFbQjQiQhJPf6EUKHi0alGKtMV8FYgk ts1d3LfADiXDudMi3hW8H2PcTxP0Jzwt07o0WjHjVEvzgMoZavcmx1zeqLqXnEylkT2g nslA==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770416368; x=1771021168; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=7Z2f5KDp7Vk432WUspwkn/jnyA/qhaVXqctWPja/Q54=; b=XCq7n9sZKcc7+caGPMn4Gu87abfU7NtNXLNju3uV/4IPmnFwFIZlpAQKkC01XaNc9n HM60GExa9MtsbhlRhI1pacRsxGedNx5vy14irz5lfipPtRd9EsB/gLSeu/5xIPrIQY08 gh1ZL+7/NKFUXmdjDxN1NYeoaOuWlislLjm51p8H5lkD+CnjHI/BQOE90oago+nnjpjo W+bQC25o6btSPPITlgqzBg2Zx90oGRUAbqSgsGKhyDngKx0rkPTsdHpBhFRjGQC1lx6x q7KhoVoMb4JxqCTtqjraD9SW9Rta9R0eU3Hr+1WojgYUofwzNZG8wFhsG1KXqOFMbyQ0 1K7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770416368; x=1771021168; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=7Z2f5KDp7Vk432WUspwkn/jnyA/qhaVXqctWPja/Q54=; b=LN9KMw2O0Jjl9dvp4/T4lPyqOM5+wsDWlGYUa5fWBCf//HmFo5W70OCmOwUDpTMgzt OxjDuxA8VitBr7kWKHph0eYgeSBrh1f8v1vbxBkKLjWsdieffz22hgACC/FtTglllwKq xXey1phzCScuqfHjQeSp0WTF/+PrP7KFKb3x6OTViN3G08YALNRvaqswKbe9SHEvhlkz yeCWneUpdRNA1LPK31eUQBKeST685PK9Xr5Y1UNxYYqcvxm8tyF4pjfsLWHDWqVJ2dz+ KZC9fuH1DjnOpQda2xTvkmYt4FZoO7Ksrv0lTkzepN/LQvPKm1+bZajpH8RdcN1bZ3Gi nicg== X-Forwarded-Encrypted: i=1; AJvYcCVCU0vZNVhQeh7zc9RNjK/Gtz0N1HnKc/kmde19g1ItO7wHsDJhLNVsJ7/iZzx9NvctDXtA4ytGdd2JqG/H@postgresql.org X-Gm-Message-State: AOJu0YzJfmcPY+G5jagTCaksqD6nWtsJh/9LbQFIHZTsxXQD7xAK6eFs WqXrucHYRSfnFf7GogeUlVaeGWhhPzdv0jtUSRsUFjOzTlTyRuDUm+PEcaREQ+RrhQwAln8ukTV IsRezDk4h4zk1RbYWzpdpk/lqCZWwi2M= X-Gm-Gg: AZuq6aJKhPmTFfz4wn6zwxb8HD5JbncbPEoIpD4z83MjjucQ0GI13qDoyD/KE2eys6P k+KqSw71TzGTnYvV3BAj0HWPBtjfVcCpC4iWy4OT0Dz+/CEiJIjKbiqPYxEln700CfyN+P7uMjc kMYPDmY6U1kFCkGeSrrQC2hHhRYHJ8kpX06Yw9vRr7rrZN7R2KgUlpDmvW/hEkhyFTlTBxfsqya TVTkgZxcOLVGXlUy8fX83lW/onqvO9XPchAmE2gVzk10dKMyL74a9i4shSUjKLwCozfG5XR X-Received: by 2002:a05:7300:134a:b0:2b6:adb4:8a18 with SMTP id 5a478bee46e88-2b856721d3amr1713888eec.22.1770416368063; Fri, 06 Feb 2026 14:19:28 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Nazir Bilal Yavuz Date: Sat, 7 Feb 2026 01:19:16 +0300 X-Gm-Features: AZwV_QhoZSSpFyaet7zgl0erlVxAMjAoo_a2jwlNOMmoarPET1B57llbQF28oR8 Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nathan Bossart Cc: KAZAR Ayoub , Neil Conway , Manni Wood , Andrew Dunstan , Shinya Kato , PostgreSQL-development Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, Thank you for sharing your thoughts! On Sat, 7 Feb 2026 at 00:29, Nathan Bossart wrote: > > It looks like a lot of energy has been put into benchmarking and refining > the heuristic for deciding when to use the SIMD path so that we avoid large > regressions when there are special characters. I think this is all > valuable work, but I'm a bit concerned that we are putting the cart before > the horse. IMHO it would be better to first get the SIMD code committed > with the absolute simplest heuristic we can think of (e.g., as soon as we > see a special character, switch to the scalar path for the remainder of > COPY FROM). My hope is that would be far easier to reason about from a > performance angle. If we immediately fall back to the existing code path, > we don't need to worry about how many special characters there are and > whether they are sparse or clustered or whatever. We just need to measure > the overhead of the new branches and ensure they don't produce meaningful > regressions. Assuming that all looks good, we can then focus on the SIMD > code itself and make sure that is correct and optimal. And once we get > that portion committed, we could then consider more sophisticated > heuristics. I have three possible approaches in my mind, they are actually similar to each other. 1- After encountering a special character, disable SIMD for the rest of the current line and also for the rest of the data. 2- It is a mixed version of the current heuristic and #1. After encountering a special character, skip SIMD for the current line (let' say line 1) and for the next line (line 2). Then try running SIMD for the next line (line 3), if there is no special character continue to run SIMD but if there is a special character then skip running SIMD for two lines this time. And it goes like that, everytime special character is encountered in the SIMD run, skipped SIMD lines are doubled. 3- This version is a bit different from #2. Instead of calculating the number of lines to skip dynamically, skip the constant N number of lines and then try to run SIMD again after these lines. N could be something like 100, 1000, or 10000 etc.. Actually, you and Andrew suggested this approach before [1]. I think what you suggested is closer to #1 or #3. I just wanted to hear your opinions, and whether you think any of these approaches are good to implement / work on. [1] https://postgr.es/m/aR4wDwNdLc5TmcQq%40nathan -- Regards, Nazir Bilal Yavuz Microsoft