Re: Speed up COPY FROM text/CSV parsing using SIMD

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Nazir Bilal Yavuz <[email protected]>
To: KAZAR Ayoub <[email protected]>
Cc: Neil Conway <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: Nathan Bossart <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Fri, 6 Feb 2026 16:51:17 +0300
Message-ID: <CAN55FZ0FRB2OD6-oEESLvgUT4bLZQVD72pAqUqzdw7Rx5cN0ig@mail.gmail.com> (raw)
In-Reply-To: <CA+K2Ru=C_woAnd-3-pGHoNSTR8FOf=7eeSWE1xaLt9ojVWndVg@mail.gmail.com>
References: <aPkvi5P7kpA8oQKc@nathan>
	<[email protected]>
	<CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
	<CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
	<CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com>
	<aRue0D4QQkUf2B_N@nathan>
	<CAOzEurTHCGL-Txqf5rxMsPgTF=dTCOsr=uhJdXebqjEJy-0L7g@mail.gmail.com>
	<CAN55FZ0+JZvKYVCnJqLhHaWF9eBGmTaF1BCEpttxw1aT3G_+Qw@mail.gmail.com>
	<[email protected]>
	<CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
	<aR4wDwNdLc5TmcQq@nathan>
	<CA+K2Rump8NoMRZRZ2r4jHXUJwByasy_c3_b0oaO+TLkSbMD-jw@mail.gmail.com>
	<CAKWEB6rLxPVtN4ffZ3CMTL518zhk_BWzzBt6ZE2oUSaErdphxA@mail.gmail.com>
	<CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com>
	<CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com>
	<CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
	<CAKWEB6oZdQhhBV3ojHLBwjQgKzfDw0fkqncurt9oi7vNsq41ww@mail.gmail.com>
	<CAN55FZ1p5UyUdTRO7iWR_ukjhJDOnpOR2rYNOq=+hcC45OuahQ@mail.gmail.com>
	<CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com>
	<CA+K2Ru=C_woAnd-3-pGHoNSTR8FOf=7eeSWE1xaLt9ojVWndVg@mail.gmail.com>

Hi,

On Sat, 31 Jan 2026 at 19:21, KAZAR Ayoub <[email protected]> wrote:
>
> On Wed, Jan 21, 2026 at 9:50 PM Neil Conway <[email protected]> wrote:
>>
>> * I'm curious if we'll see better performance on large inputs if we flush to `line_buf` periodically (e.g., at least every few thousand bytes or so). Otherwise we might see poor data cache behavior if large inputs with no control characters get evicted before we've copied them over. See the approach taken in escape_json_with_len() in utils/adt/json.c
>>
> So i gave this a try, attached is the small patch that has v3 + the suggestion added, here are the results with different threshold for line_buf refill:
>
> Execution time compared to master:
> Workloadv3v3.1 (2k)v3.1 (4k)v3.1 (8k)v3.1 (16k)v3.1 (20k)v3.1 (28k)
> text/none-16.5%-17.4%-14.3%-12.6%-13.6%-10.5%-16.3%
> text/esc+5.6%+11.1%+3.1%+7.6%+3.0%+4.9%+4.2%
> csv/none-31.0%-29.9%-26.7%-30.1%-27.9%-30.2%-29.6%
> csv/quote+0.2%-0.6%-0.4%-1.0%+0.1%+2.5%-1.0%
>
> L1d cache miss rates:
> WorkloadMasterv3v3.1 (2k)v3.1 (4k)v3.1 (8k)v3.1 (16k)v3.1 (20k)v3.1 (28k)
> text/none0.20%0.23%0.21%0.22%0.21%0.21%0.21%0.22%
> text/esc0.21%0.22%0.22%0.22%0.22%0.21%0.22%0.22%
> csv/none0.17%0.22%0.21%0.22%0.21%0.21%0.22%0.22%
> csv/quote0.18%0.22%0.19%0.20%0.20%0.19%0.20%0.20%
>
> On my laptop I have 32KB L1 cache per core.
> Results are super close, it is hard to see in the cache misses numbers but execution times are saying other things, doing the periodic filling of line_buf seems good to do.
> If Manni can rerun the benchmarks on these too, it would be nice to confirm this.

I looked at this change and had a couple of points.

We already have REFILL_LINEBUF at the start of the for loop in the
CopyReadLineText() function (let’s call this refill #1). This refills
when the input_buf_ptr >= copy_buf_len check is true. On my end,
copy_buf_len stays at 8191 until the end of the input, and then it
becomes the remaining amount. So when I set LINE_BUF_FLUSH_AFTER to
8192, the REFILL_LINEBUF you added shouldn’t be called; instead,
refill #1 should be triggered.

I verified this manually by adding some logging, and the results seem
to confirm this behavior. Based on that, there shouldn’t be a
performance difference when LINE_BUF_FLUSH_AFTER >= 8k.

Could you please take a look and confirm whether you see the same behavior?

Also, I noticed that json.c uses ESCAPE_JSON_FLUSH_AFTER set to 512,
so it might be worth trying smaller values here as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

view thread (21+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
  In-Reply-To: <CAN55FZ0FRB2OD6-oEESLvgUT4bLZQVD72pAqUqzdw7Rx5cN0ig@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox