Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1voURD-006so1-1A for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Feb 2026 22:36:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1voURB-005or4-1J for pgsql-hackers@arkaria.postgresql.org; Fri, 06 Feb 2026 22:36:29 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1voURB-005oqv-06 for pgsql-hackers@lists.postgresql.org; Fri, 06 Feb 2026 22:36:28 +0000 Received: from mail-ed1-x533.google.com ([2a00:1450:4864:20::533]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1voUR8-00000001R9V-2MTj for pgsql-hackers@postgresql.org; Fri, 06 Feb 2026 22:36:28 +0000 Received: by mail-ed1-x533.google.com with SMTP id 4fb4d7f45d1cf-65941c07e8dso1783014a12.1 for ; Fri, 06 Feb 2026 14:36:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770417385; cv=none; d=google.com; s=arc-20240605; b=ibgOO9Gex6JYxep3NInkJszfslE6nlNFskLL1oPUoB6j6smFBJS6o1Ma2WiMgqCMOf JQcI7YrHpTo5ght9WrXOQvuMGU46mFRRoXZ1qy74/pO9qgkWc7ZadaHhSOGogqf1nD8g xGf0B/pEPyTlNpe5AVO2F13yESDJkg35Y/afZaIaYU3v9Fjp1QTX9qGswqyh5r1XqhtR fU8fQxeWI46LrJUkkPxycJg2yL4XNiJnV4su5Z/7J65o3dZYOJxganvrgQ+A3jFrQneD UTC/xA+TcGnqaPpmkRV1vhN0BUJu+QvRQP+NmcFVgIPQk0YdtfCOk1mw4MDOchBYkvv5 pyxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=/1Ww9ZUjfgn5C31Y3ETb5cCSNeeefZJF9vT5L7N5S8k=; fh=CsYenrM1Nh6b4pcxemskiG1HpSbRwP+1W8g4UFiT5bs=; b=MbM3iBhH4X+/vLuT3nIEMHyGRC8xRTABYfg+fECN+0VslT45Ta/7gi4vfmOD5ygom3 LXNwlQjq3OE+BbxttcKoiimqQxQaOS/er7pW0NXNM1bm8OKX/Qdd3Gz4E4OS4lUH01Y0 dE2pt23D8RszstdzChz+X8ciacn5sVtps14HE2fr0dv1i0dknRKKozwJssPCkAy9sQaQ 4mmstx8gaYaKZeEPH+9XlTtOQkJ2jGejXJXwkW8lK79H2/8fi2W0uY/BCWbHH1H65/0t 5vcv9siSZYCF35TLwAb84qvB7injDqo0IY1q33V5Tb7C0EyJCfFas6FE/bKYNXzXWbS1 hJUg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=esi.dz; s=google; t=1770417385; x=1771022185; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=/1Ww9ZUjfgn5C31Y3ETb5cCSNeeefZJF9vT5L7N5S8k=; b=dlpwH3V0hxfsFRf3MJw3MJBx1NMw9VX0WK8QSftaqkUEjV8SFwB9wILC+aVyNp3A2A uAc7HpJkvGKGf06lq84Tt/u1on265XhKuITTKZ0vgRUZAAmBTinPQOCxjqRJ2bW6IFEi 1d1KYqoDK1/ua3H/HKjDJ/OayZa5y2CFhsB8SteF2OrX2GcWDy2w66G7h+kAm3pIN5Nb tivZi1MtnzvNm1xnJ57y0sSO836UYmLbWR10EKaf0oEaDz9AgICjbfWRz14RYKqdXEMW XtKkGUr3yHbN1t9cGHjBuPADjePQ+cik6QstdoUJtMxwyzL2iwNEdwrqpOtO94np/ayu 1edw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770417385; x=1771022185; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=/1Ww9ZUjfgn5C31Y3ETb5cCSNeeefZJF9vT5L7N5S8k=; b=l4/PmafzfAyuVJT1Z81Pyr+0PreoouDEhO+jirY26FUvE+AAdhCnc1G9Mz+Nn8POhp Cwka4HVumGWZj3L6P7dYlA9hFfxKdApM2NLsGLFCIpgjkOBPjvDEcP8xDB6ei8ZJX9B3 lBZYR+c5kmosjbeCBA6QPonnzjWAyvM1EVnT6+rbQl4moF5Nx+0koZ52127asFrKu6hv kzdkfuzgnJLXMcVg/VjnbDG25ode55DNQTGlme8u6lRS+5/FOnDkWzQ5rU4luzpD+WQw Ai175EmNYlsHQms+yt2+wZ/1uoYzrwkTjoYyWYXKCk9uOPlmbNyy3BUtO6yMMDEhSJVP yJ8A== X-Forwarded-Encrypted: i=1; AJvYcCUbq541lAGULSYH0IY5of2U3f7/Cag0H7xCKjTjBtj4IUzegKmAILRobijI4IWO9WXp7dqwvIlNRUS8aYS3@postgresql.org X-Gm-Message-State: AOJu0Yxpo4l4YbK/9vT0wzt+gcAxy1ERwZqiHpT7k5fwi8L6odyGEZ4S +gZ8F2OviTq0Zgo/aFZcyLtrGx4HyMowq57djEKWawGwS7dl+fEfkT2tAmMXhMt9HS/B8z+JOKf ZQkMg43Y3Kme25CcJSOyc3Ky+xVlq+VAcD3iP3OzvnycJqb1kfu2Nig== X-Gm-Gg: AZuq6aIeajNzDUwHfOmEtQ3njUyDFPN8NEXjqHm+ntlKZxobmWdkF4oEKvhOYVpLwBz 55dvFmnt2U/u/7bcVIlyxz8NbhXra83QWrnsXzY4rClYQBz6EjQxFFAQBsTcr1Ls37k4FaF2c0I 7h5uSdb0vLmqz1K3MAcm9fqYZdfYQBdxxu/wZUg6w59QyTdopVDUILx8HYssLjgau8ILlsbQs4r n2nkK2DwE99oo/kkLlg03zuzlM06v15wyj0TvIc0ybuz4cugVUntvuuZAwDZyTeimoCwPRUzcSO Tnofnkkf1D4gKCfeNRUUST114IqcfA== X-Received: by 2002:a05:6402:51c8:b0:64b:46ce:4706 with SMTP id 4fb4d7f45d1cf-6598413b120mr1861088a12.1.1770417385237; Fri, 06 Feb 2026 14:36:25 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: KAZAR Ayoub Date: Fri, 6 Feb 2026 23:36:13 +0100 X-Gm-Features: AZwV_Qh25kkqOoU-5yzmqPvWkFeBxX2CUm4Ak357C-qGRA7wu5nh6WWdiw5Whqc Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Nazir Bilal Yavuz Cc: Nathan Bossart , Neil Conway , Manni Wood , Andrew Dunstan , Shinya Kato , PostgreSQL-development Content-Type: multipart/alternative; boundary="00000000000059faae064a2f6be3" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000059faae064a2f6be3 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello, On Fri, Feb 6, 2026 at 11:19=E2=80=AFPM Nazir Bilal Yavuz wrote: > Hi, > > Thank you for sharing your thoughts! > > On Sat, 7 Feb 2026 at 00:29, Nathan Bossart > wrote: > > > > It looks like a lot of energy has been put into benchmarking and refini= ng > > the heuristic for deciding when to use the SIMD path so that we avoid > large > > regressions when there are special characters. I think this is all > > valuable work, but I'm a bit concerned that we are putting the cart > before > > the horse. IMHO it would be better to first get the SIMD code committe= d > > with the absolute simplest heuristic we can think of (e.g., as soon as = we > > see a special character, switch to the scalar path for the remainder of > > COPY FROM). My hope is that would be far easier to reason about from a > > performance angle. If we immediately fall back to the existing code > path, > > we don't need to worry about how many special characters there are and > > whether they are sparse or clustered or whatever. We just need to > measure > > the overhead of the new branches and ensure they don't produce meaningf= ul > > regressions. Assuming that all looks good, we can then focus on the SI= MD > > code itself and make sure that is correct and optimal. And once we get > > that portion committed, we could then consider more sophisticated > > heuristics. > I also agree on this, especially for the line_buf refilling idea, it needs a bit more time to find the good value of threshold than work for heuristic. > > I have three possible approaches in my mind, they are actually similar > to each other. > > 1- After encountering a special character, disable SIMD for the rest > of the current line and also for the rest of the data. > > 2- It is a mixed version of the current heuristic and #1. After > encountering a special character, skip SIMD for the current line (let' > say line 1) and for the next line (line 2). Then try running SIMD for > the next line (line 3), if there is no special character continue to > run SIMD but if there is a special character then skip running SIMD > for two lines this time. And it goes like that, everytime special > character is encountered in the SIMD run, skipped SIMD lines are > doubled. > > 3- This version is a bit different from #2. Instead of calculating the > number of lines to skip dynamically, skip the constant N number of > lines and then try to run SIMD again after these lines. N could be > something like 100, 1000, or 10000 etc.. Actually, you and Andrew > suggested this approach before [1]. > > I think what you suggested is closer to #1 or #3. I just wanted to > hear your opinions, and whether you think any of these approaches are > good to implement / work on. > For v19, #1 seems like a "wasted potential", #3 sounds more relaxed than v4.2 so this has good potential, i can fully benchmark it against v3 as soon as you send a patch for it. Regards, Ayoub --00000000000059faae064a2f6be3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,

On Fri, Feb 6, 2026 at 11= :19=E2=80=AFPM Nazir Bilal Yavuz <= byavuz81@gmail.com> wrote:
Hi,

Thank you for sharing your thoughts!

On Sat, 7 Feb 2026 at 00:29, Nathan Bossart <nathandbossart@gmail.com> wrote:<= br> >
> It looks like a lot of energy has been put into benchmarking and refin= ing
> the heuristic for deciding when to use the SIMD path so that we avoid = large
> regressions when there are special characters.=C2=A0 I think this is a= ll
> valuable work, but I'm a bit concerned that we are putting the car= t before
> the horse.=C2=A0 IMHO it would be better to first get the SIMD code co= mmitted
> with the absolute simplest heuristic we can think of (e.g., as soon as= we
> see a special character, switch to the scalar path for the remainder o= f
> COPY FROM).=C2=A0 My hope is that would be far easier to reason about = from a
> performance angle.=C2=A0 If we immediately fall back to the existing c= ode path,
> we don't need to worry about how many special characters there are= and
> whether they are sparse or clustered or whatever.=C2=A0 We just need t= o measure
> the overhead of the new branches and ensure they don't produce mea= ningful
> regressions.=C2=A0 Assuming that all looks good, we can then focus on = the SIMD
> code itself and make sure that is correct and optimal.=C2=A0 And once = we get
> that portion committed, we could then consider more sophisticated
> heuristics.
I also agree on this, especially for = the line_buf refilling idea, it needs a bit more time to find the good valu= e of threshold than work for heuristic.=C2=A0

I have three possible approaches in my mind, they are actually similar
to each other.

1- After encountering a special character, disable SIMD for the rest
of the current line and also for the rest of the data.

2- It is a mixed version of the current heuristic and #1. After
encountering a special character, skip SIMD for the current line (let'<= br> say line 1) and for the next line (line 2). Then try running SIMD for
the next line (line 3), if there is no special character continue to
run SIMD but if there is a special character then skip running SIMD
for two lines this time. And it goes like that, everytime special
character is encountered in the SIMD run, skipped SIMD lines are
doubled.

3- This version is a bit different from #2. Instead of calculating the
number of lines to skip dynamically, skip the constant N number of
lines and then try to run SIMD again after these lines. N could be
something like 100, 1000, or 10000 etc.. Actually, you and Andrew
suggested this approach before [1].

I think what you suggested is closer to #1 or #3. I just wanted to
hear your opinions, and whether you think any of these approaches are
good to implement / work on.
For v19, #1 seems like a = "wasted potential", #3 sounds more relaxed than v4.2 so this has = good potential, i can fully benchmark it against v3 as soon as you send a p= atch for it.


Regards,
Ayoub<= /div>
--00000000000059faae064a2f6be3--