Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vJNGZ-003bed-27 for pgsql-hackers@arkaria.postgresql.org; Thu, 13 Nov 2025 02:40:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vJNGX-00FdhF-1K for pgsql-hackers@arkaria.postgresql.org; Thu, 13 Nov 2025 02:40:53 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vJNGW-00Fdh7-39 for pgsql-hackers@lists.postgresql.org; Thu, 13 Nov 2025 02:40:53 +0000 Received: from mail-pg1-x535.google.com ([2607:f8b0:4864:20::535]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vJNGU-006pjT-0J for pgsql-hackers@postgresql.org; Thu, 13 Nov 2025 02:40:51 +0000 Received: by mail-pg1-x535.google.com with SMTP id 41be03b00d2f7-bc169af247eso249623a12.2 for ; Wed, 12 Nov 2025 18:40:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1763001648; x=1763606448; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=OzG9rRXSIw4pkYscQm9Vhve3xQ5gFa91Tg4mJ2LjOMA=; b=KY1+MuBwcllld5qmmoI+TXt9C5LPWm7FLPaLVT5OVpdLdk1Sf+5KYaf1WOx/MTprx3 nJ/5j4+UNeshLVPTP3A/nHxPd+iuq7mRKfO8IcdtZlJjj+5ui4IkzdJuY9uVjxdG/cJ0 4Uz1Id9A0XKyVDWKp8BEkF3xBAJsrvpUDXZykUjrAsYZTviyXuk/eL46PiNrwaONUxLv ABzi87GyffvZqi9TGsjfxCU4ze6MLkhvk8WPG37I2OaQTWwaDv2WOsKWpRLxTo4YIKsc poy+hbBk/IMQPZhaR0UxogtoOfaK3Izopaxhq/zFfx+5X4UbQSZJFbpJO7xRfIpqldNA o97Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1763001648; x=1763606448; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=OzG9rRXSIw4pkYscQm9Vhve3xQ5gFa91Tg4mJ2LjOMA=; b=HdtkqhsaNWBQ4oeSaQhhiwfYOEErMmBQ6KdDuYNfCqiYd0bjvJ4cgzFIufihtiRjLO 3JJKwu9ZccEkQe6YR2Ai+8fd0Sh5z8B3FGqbgNW1NiL6WLPsOimWXGkzyrG4bT69umWw DJ2aUE5Hdd4Sz8ss0lwB0muLubYCki70wG7KmvOI1zUOAblpIKAFHDYVrM0ddCBbOGuy YXNaeA7crQkdL1FpFzz3eJx9Qge/shq1sNVS7cwk4x+hmUALpeXOGrhPAfdpsGY+1Szv nwxa3dhzbuX6V4u+vRLyXl6qW2/b3q39Mv8NaJEYZKX/Uc4W6a69e/nd5QPyoBuQXwLe Kmig== X-Forwarded-Encrypted: i=1; AJvYcCUYuBC4eS+OpUn6GTIL8jTafu9HnMoYsa000VfG4c+bW156S2eo/Ewiw+lGduu64aFs4V8HpINJM94PvQZS@postgresql.org X-Gm-Message-State: AOJu0YyGFRqJ1tEAfewb+WyYMcxVaniAVCrnHnDqHrULOqiD5p2zMJs1 rDtuwuLwQarDEfXaDoBYZv6MpdcDtwed2QWK7rMIrG2u7C/qI8CrmCpaFvmSg7t3jiIukL3oEOJ eguot0Dpiavv3rD0s2bnk/rQtov8Y8fQkr1FVkBMD X-Gm-Gg: ASbGnctcZbNg0zIAgyYZAffak+8WQbd5uiTiCx6nxOlzQafiVYENMT0cC3/Aacx3x3t cg0tjh0K2jvEGR7hKjHlHVlT9QTLBacIAgPaPB9LA/lWmkd7VjD1hXFHWvEASm4aeEatWBld1hB eezYJ2NBqa49pHcDWTh5UkwRY3PMmvyoxm4VHmKS216UMqrUyC+exrtGYVV/AG+RZnQ+giDQz/G 2PBQEe+zUVvGsEt8hZSDbOque+zRm8sSVdEVE2y5TNUx4J/2DnRhVslyLYx X-Google-Smtp-Source: AGHT+IHPf9RUjh4RvurDRR3b7zNbTKsCX91C8RVE6dyPSGDKvVrVkucyUYN1eF/hceVHmDgI10rFuUQS5n6scWTsF2o= X-Received: by 2002:a17:903:191:b0:295:195:23b6 with SMTP id d9443c01a7336-2984edec111mr68408575ad.55.1763001648316; Wed, 12 Nov 2025 18:40:48 -0800 (PST) MIME-Version: 1.0 References: <8615c983-1662-43b4-b0c9-49d194ac33aa@dunslane.net> <673d92f7-2489-475f-a208-9414ea35d4d8@dunslane.net> <8e045899-2023-48b1-bd91-f8cdffeb511d@dunslane.net> <5d81fbbb-7609-4445-9bc4-8af211fb7674@dunslane.net> In-Reply-To: From: Manni Wood Date: Wed, 12 Nov 2025 20:40:35 -0600 X-Gm-Features: AWmQ_bnbE1rNFdhFhtSwZq2AEHFHzD5mjEbUKWZzM2DeVP9uzTLKWX3cE31d0HQ Message-ID: Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD To: KAZAR Ayoub Cc: Andrew Dunstan , Nathan Bossart , Nazir Bilal Yavuz , Shinya Kato , pgsql-hackers@postgresql.org Content-Type: multipart/alternative; boundary="000000000000fc9557064370ceed" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000fc9557064370ceed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Nov 12, 2025 at 8:44=E2=80=AFAM KAZAR Ayoub wrote= : > On Tue, Nov 11, 2025 at 11:23=E2=80=AFPM Manni Wood > wrote: > >> Hello! >> >> I wanted reproduce the results using files attached by Shinya Kato and >> Ayoub Kazar. I installed a postgres compiled from master, and then I >> installed a postgres built from master plus Nazir Bilal Yavuz's v3 patch= es >> applied. >> >> The master+v3patches postgres naturally performed better on copying into >> the database: anywhere from 11% better for the t.csv file produced by >> Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file >> created by Ayoub Kazar's simd-copy-from-bench.sql. >> >> But here's where it gets weird. The two files created by Ayoub Kazar's >> simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.t= xt, >> and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5% >> respectively. >> >> This seems impossible. >> >> A few things I should note: >> >> I timed the commands using the Unix time command, like so: >> >> time psql -X -U mwood -h localhost -d postgres -c '\copy t from >> /tmp/t_4096_escape.txt' >> >> For each file, I timed the copy 6 times and took the average. >> >> This was done on my work Linux machine while also running Chrome and an >> Open Office spreadsheet; not a dedicated machine only running postgres. >> > Hello, > I think if you do a perf benchmark (if it still reproduces) it would > probably be possible to explain why it's performing like that looking at > the CPI and other metrics and compare it to my findings. > What i also suggest is to make the data close even closer to the worst > case i.e: more special characters where it hurts the switching between SI= MD > and scalar processing (in simd-copy-from-bench.sql file), if still does a > good job then there's something to look at. > >> >> > >> All of the copy results took between 4.5 seconds (Shinyo's t.csv copied >> into postgres compiled from master) to 2 seconds (Ayoub >> Kazar's t_4096_none.csv copied into postgres compiled from master plus >> Nazir's v3 patches). >> >> Perhaps I need to fiddle with the provided SQL to produce larger files t= o >> get longer run times? Maybe sub-second differences won't tell as >> interesting a story as minutes-long copy commands? >> > I did try it on some GBs (around 2-5GB only), the differences were not > that much, but if you can run this on more GBs (at least 10GB) it would b= e > good to look at, although i don't suspect anything interesting since the > shape of data is the same for the totality of the COPY. > >> >> Thanks for reading this. >> -- >> -- Manni Wood EDB: https://www.enterprisedb.com >> > Thanks for the info. > > > Regards, > Ayoub Kazar. > Hello again! It looks like using 10 times the data removed the apparent speedup in the simd code when the simd code has to deal with t_4096_escape.txt and t_4096_quote.csv. When both files contain 1,000,000 lines each, postgres master+v3patch imports 0.63% slower and 0.54% slower respectively. For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup. For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup. I got these numbers just via simple timing, though this time I used psql's \timing feature. I left psql running rather than launching it each time as I did when I used the unix "time" command. I ran the copy command 5 times for each file and averaged the results. Again, this happened on a Linux machine that also happened to be running Chrome and Open Office's spreadsheet. I should probably try to construct some .txt or .csv files that would trip up the simd on/off heuristic in the v3 patch. If data "in the wild" tend to be roughly the same "shape" from row to row, as Andrew's experience has shown, I imagine these million row results bode well for the v3 patch... --=20 -- Manni Wood EDB: https://www.enterprisedb.com --000000000000fc9557064370ceed Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, Nov 12,= 2025 at 8:44=E2=80=AFAM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
On Tue, Nov 11, 2025 at 1= 1:23=E2=80=AFPM Manni Wood <manni.wood@enterprisedb.com> wrote:
=
Hello!

I wanted = reproduce the results using files attached by Shinya Kato and Ayoub Kazar. = I installed a postgres compiled from master, and then I installed a postgre= s built from master plus Nazir Bilal Yavuz's v3 patches applied.
<= div>
The master+v3patches postgres naturally performed better= on copying into the database: anywhere from 11% better for the t.csv file = produced by Shinyo's test.sql, to 35% better copying in the t_4096_none= .csv file created by=C2=A0Ayoub Kazar's simd-copy-from-bench.sql.
=

But here's where it gets weird. The two files creat= ed by Ayoub Kazar's simd-copy-from-bench.sql that are supposed to be sl= ower,=C2=A0t_4096_escape.txt, and=C2=A0t_4096_quote.csv, actually ran faste= r on my machine, by 11% and 5% respectively.

This = seems impossible.

A few things I should note:

I timed the commands using the Unix time command, like= so:

time psql -X -U mwood -h localhost -d postgre= s -c '\copy t from /tmp/t_4096_escape.txt'

For each file, I timed the copy 6 times and took the average.
This was done on my work Linux machine while also running Chro= me and an Open Office spreadsheet; not a dedicated machine only running pos= tgres.
Hello,
I think if you do = a perf benchmark (if it still reproduces) it would probably be possible to = explain why it's performing like that looking at the CPI and other metr= ics and compare it to my findings.
What i also suggest is to make= the data close even closer to the worst case i.e: more special characters = where it hurts the switching between SIMD and scalar processing (in simd-co= py-from-bench.sql file), if still does a good job then there's somethin= g to look at.
=C2=A0

All of the copy = results took between 4.5 seconds (Shinyo's t.csv copied into postgres c= ompiled from master) to 2 seconds (Ayoub Kazar's=C2=A0t_4096_none.csv c= opied into postgres compiled from master plus Nazir's v3 patches).

Perhaps I need to fiddle with the provided SQL to prod= uce larger files to get longer run times? Maybe sub-second differences won&= #39;t tell as interesting a story as minutes-long copy commands?
I did try it on some GBs (around 2-5GB only), the differ= ences were not that much, but if you can run this on more GBs (at least 10G= B) it would be good to look at, although i don't suspect anything inter= esting since the shape of data is the same for the totality of the COPY.
=
Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Th= anks for the info.


Regards,
Ayoub Kazar.=C2=A0

Hello again!

It looks like using 10 times the data removed the apparent = speedup in the simd code when the simd code has to deal with=C2=A0t_4096_es= cape.txt and=C2=A0t_4096_quote.csv. When both files contain 1,000,000 lines= each, postgres master+v3patch imports 0.63% slower and 0.54% slower respec= tively. For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% s= peedup. For 1,000,000 lines of=C2=A0t_4096_none.csv, the v3 patch yields a = 33% speedup.

I got these numbers just via simple t= iming, though this time I used psql's \timing feature. I left psql runn= ing rather than launching it each time as I did when I used the unix "= time" command. I ran the copy command 5 times for each file and averag= ed the results. Again, this happened on a Linux machine that also happened = to be running Chrome and Open Office's spreadsheet.

I should probably try to construct some .txt or .csv files that would= trip up the simd on/off heuristic in the v3 patch.

If data "in the wild" tend to be roughly the same "shape&q= uot; from row to row, as Andrew's experience has shown, I imagine these= million row results bode well for the v3 patch...
--
-- Manni Wood EDB: https://www.enterprisedb.com
--000000000000fc9557064370ceed--