Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vIwm2-0041G0-1y for pgsql-hackers@arkaria.postgresql.org; Tue, 11 Nov 2025 22:23:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vIwm0-0085Zm-0P for pgsql-hackers@arkaria.postgresql.org; Tue, 11 Nov 2025 22:23:36 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vIwlz-0085Zd-2A for pgsql-hackers@lists.postgresql.org; Tue, 11 Nov 2025 22:23:35 +0000 Received: from mail-pg1-x535.google.com ([2607:f8b0:4864:20::535]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vIwlw-006cIN-2j for pgsql-hackers@postgresql.org; Tue, 11 Nov 2025 22:23:33 +0000 Received: by mail-pg1-x535.google.com with SMTP id 41be03b00d2f7-b6cf07258e3so117901a12.0 for ; Tue, 11 Nov 2025 14:23:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1762899812; x=1763504612; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=GauPHhjL+5FU7Tz2KqxSXcbWhFaCOQ3Xo0caA8xczq4=; b=W1CD51WhewdoI5lbJtVBgRu/v6QEimcMrE2nTYjYJWbzQt4T5hGxtNKt1aO7s1WEPV UsFkqOQzX4x476aP9MTA2n0ZECFE2GBfGudb5IGWIZQajtfoip+uKCj0uJkwXuBnPil6 5ww+VrJCbus/cfLF/eo5vE8rTQEkS9i5yxkXCVq3JDAGBMpHLTW0tfsRc5S4HPh8lLfw ZecP1UnrCNFlK8BSi+GnTDIKIpahjg8SfytyfOFPKNY4YabOPDQPYM3tZDHDpYtUfvLC lkSh3NYiC2CEPxaoaSkml2tWV9C8jgs0cesZ5wK0ruSiGnfgNZo1SJDDj7RMKsb3wYKb 0XSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762899812; x=1763504612; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=GauPHhjL+5FU7Tz2KqxSXcbWhFaCOQ3Xo0caA8xczq4=; b=VOxof6b4IGMBFMonimUZXTrbIikjgwJInEgOCrOyBE3PU1iS74aQSwjmivtLfzvaCn fztiLoe9E0MUm3/Lm27Xd1RNOu8RUyd6ev/Pxjktx04UUMpiqKLlu72uIXZj26D03FRu ivgUjON6Sn3UU7OEy7hIL1OkC3D8yahPnQNtBZjgaQQZlGpKp7h8AN+AqSqyhMBqvY1U QhyBrhKiXX4CdBIegoXo7dEpWKlbuyb4ES++FEodq0Tb3L/LtPwZHjr/iuX1I/U8Ggm9 XsEmqyKSjHqAV+cy1tj8JWVUt96ckjQUnsJBXQHe19jljrRDQiQe/vfMw9GRK9AsTg+w ZemQ== X-Forwarded-Encrypted: i=1; AJvYcCUmoCDOsLjx1yr5p3/RPGUUa79n6CPq+jfxbcC4qDcarO9MD6wwTOiutjzDw5DDtLQOxBIOv9n+xG3GUJ3i@postgresql.org X-Gm-Message-State: AOJu0YwYKAUqtqDkWni9fLkAvn1QtER5WjQjsbdNbz4zHSKGiOufz8YE t/FW0py3iZJfX2TonpKcCegOGOIgRQCLrbux/mPjepYfNaS5V0qjcRPTHPH64wClxdC53A5nyAa fCB2DG8HJaXUxV4XkZHug9pbIewv1vkByrTvo4Yrs X-Gm-Gg: ASbGncuK7AE26mL/I5YmyBgLlw/nbHFblrJxqB3bewuRk9hFTe6UxRVOqPZ2//7e5eE zcJrTgibtOzVqI8bclyaBgSN9687n1oY6PJll2EMHUrtueOIBcuZNJ4MKoHeHFIYRIOatD0Y9E0 mJAFbfkAkHnz6gDiU/tK7mOm/53ULK2x/7jI8L4UMPCJlSCRVN8SqzNrjAIfsiXavxn/q1Iy2bl F2e+4sT3s8qKacDLbk3G9Cln59ZD5AMkGlvHuGg4DUggU4JYK9ygbQk22nyPXJurckfAz8= X-Google-Smtp-Source: AGHT+IE3PhemEc6yXADdA3Nvf+S9DCVnEsXqlznUdks5uzF3zWPPsrWAy26EEanhDR9bjibgtDXAbXk2QBTzcjjrDUk= X-Received: by 2002:a17:903:2f4b:b0:295:a1a5:baee with SMTP id d9443c01a7336-2984ed79975mr8915095ad.4.1762899811688; Tue, 11 Nov 2025 14:23:31 -0800 (PST) MIME-Version: 1.0 References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net> In-Reply-To: <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net> From: Manni Wood Date: Tue, 11 Nov 2025 16:23:20 -0600 X-Gm-Features: AWmQ_bm9QYUKl5YEgQL8orniXy9axjUJkCQVbgWt_xPnwe9VUe4Ogy84rq5w1WY Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: Andrew Dunstan Cc: Nathan Bossart , Nazir Bilal Yavuz , KAZAR Ayoub , Shinya Kato , pgsql-hackers@postgresql.org Content-Type: multipart/alternative; boundary="0000000000000cf6d3064359193e" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000000cf6d3064359193e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Oct 29, 2025 at 5:23=E2=80=AFPM Andrew Dunstan wrote: > > On 2025-10-22 We 3:24 PM, Nathan Bossart wrote: > > On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote: > >> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart > wrote: > >>> I wonder if we could mitigate the regression further by spacing out t= he > >>> checks a bit more. It could be worth comparing a variety of values t= o > >>> identify what works best with the test data. > >> Do you mean that instead of doubling the SIMD sleep, we should > >> multiply it by 3 (or another factor)? Or are you referring to > >> increasing the maximum sleep from 1024? Or possibly both? > > I'm not sure of the precise details, but the main thrust of my suggesti= on > > is to assume that whatever sampling you do to determine whether to use > SIMD > > is good for a larger chunk of data. That is, if you are sampling 1K > lines > > and then using the result to choose whether to use SIMD for the next 10= 0K > > lines, we could instead bump the latter number to 1M lines (or > something). > > That way we minimize the regression for relatively uniform data sets > while > > retaining some ability to adapt in case things change halfway through a > > large table. > > > > > I'd be ok with numbers like this, although I suspect the numbers of > cases where we see shape shifts like this in the middle of a data set > would be vanishingly small. > > > cheers > > > andrew > > > -- > Andrew Dunstan > EDB: https://www.enterprisedb.com > > > > Hello! I wanted reproduce the results using files attached by Shinya Kato and Ayoub Kazar. I installed a postgres compiled from master, and then I installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches applied. The master+v3patches postgres naturally performed better on copying into the database: anywhere from 11% better for the t.csv file produced by Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file created by Ayoub Kazar's simd-copy-from-bench.sql. But here's where it gets weird. The two files created by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt, and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% respectively. This seems impossible. A few things I should note: I timed the commands using the Unix time command, like so: time psql -X -U mwood -h localhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt' For each file, I timed the copy 6 times and took the average. This was done on my work Linux machine while also running Chrome and an Open Office spreadsheet; not a dedicated machine only running postgres. All of the copy results took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled from master) to 2 seconds (Ayoub Kazar's t_4096_none.csv copied into postgres compiled from master plus Nazir's v3 patches). Perhaps I need to fiddle with the provided SQL to produce larger files to get longer run times? Maybe sub-second differences won't tell as interesting a story as minutes-long copy commands? Thanks for reading this. --=20 -- Manni Wood EDB: https://www.enterprisedb.com --0000000000000cf6d3064359193e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, Oct 29,= 2025 at 5:23=E2=80=AFPM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
>> On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com&g= t; wrote:
>>> I wonder if we could mitigate the regression further by spacin= g out the
>>> checks a bit more.=C2=A0 It could be worth comparing a variety= of values to
>>> identify what works best with the test data.
>> Do you mean that instead of doubling the SIMD sleep, we should
>> multiply it by 3 (or another factor)? Or are you referring to
>> increasing the maximum sleep from 1024? Or possibly both?
> I'm not sure of the precise details, but the main thrust of my sug= gestion
> is to assume that whatever sampling you do to determine whether to use= SIMD
> is good for a larger chunk of data.=C2=A0 That is, if you are sampling= 1K lines
> and then using the result to choose whether to use SIMD for the next 1= 00K
> lines, we could instead bump the latter number to 1M lines (or somethi= ng).
> That way we minimize the regression for relatively uniform data sets w= hile
> retaining some ability to adapt in case things change halfway through = a
> large table.
>


I'd be ok with numbers like this, although I suspect the numbers of cases where we see shape shifts like this in the middle of a data set
would be vanishingly small.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Hello!

<= /div>
I wanted reproduce the results using files attached by Shinya Kat= o and Ayoub Kazar. I installed a postgres compiled from master, and then I = installed a postgres built from master plus Nazir Bilal Yavuz's v3 patc= hes applied.

The master+v3patches postgres natural= ly performed better on copying into the database: anywhere from 11% better = for the t.csv file produced by Shinyo's test.sql, to 35% better copying= in the t_4096_none.csv file created by=C2=A0Ayoub Kazar's simd-copy-fr= om-bench.sql.

But here's where it gets weird. = The two files created by Ayoub Kazar's simd-copy-from-bench.sql that ar= e supposed to be slower,=C2=A0t_4096_escape.txt, and=C2=A0t_4096_quote.csv,= actually ran faster on my machine, by 11% and 5% respectively.
<= br>
This seems impossible.

A few things = I should note:

I timed the commands using the Unix= time command, like so:

time psql -X -U mwood -h l= ocalhost -d postgres -c '\copy t from /tmp/t_4096_escape.txt'
=

For each file, I timed the copy 6 times and took the av= erage.

This was done on my work Linux machine whil= e also running Chrome and an Open Office spreadsheet; not a dedicated machi= ne only running postgres.

All of the copy results = took between 4.5 seconds (Shinyo's t.csv copied into postgres compiled = from master) to 2 seconds (Ayoub Kazar's=C2=A0t_4096_none.csv copied in= to postgres compiled from master plus Nazir's v3 patches).
Perhaps I need to fiddle with the provided SQL to produce larg= er files to get longer run times? Maybe sub-second differences won't te= ll as interesting a story as minutes-long copy commands?

Thanks for reading this.
--
= -- Manni Wood EDB: https://www.enterprisedb.com
--0000000000000cf6d3064359193e--