public inbox for [email protected]
help / color / mirror / Atom feedFrom: Neil Conway <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: KAZAR Ayoub <[email protected]>
Cc: Nathan Bossart <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
Date: Wed, 21 Jan 2026 15:49:59 -0500
Message-ID: <CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com> (raw)
In-Reply-To: <CAN55FZ1p5UyUdTRO7iWR_ukjhJDOnpOR2rYNOq=+hcC45OuahQ@mail.gmail.com>
References: <aPkvi5P7kpA8oQKc@nathan>
<[email protected]>
<CAKWEB6qdyhN3EoUNAK23etXX-kXH-_79NNbTsKqtF1g1WkuaBQ@mail.gmail.com>
<CA+K2RumMC+avYGSX-AWNeod3w+XOGHrVPz8HiqkvJj7AZ5tZXA@mail.gmail.com>
<CAKWEB6pev=pNVi4qDYWS50N=YFrKRbjH1h=5F1bXpnK7WR5CYg@mail.gmail.com>
<aRue0D4QQkUf2B_N@nathan>
<CAOzEurTHCGL-Txqf5rxMsPgTF=dTCOsr=uhJdXebqjEJy-0L7g@mail.gmail.com>
<CAN55FZ0+JZvKYVCnJqLhHaWF9eBGmTaF1BCEpttxw1aT3G_+Qw@mail.gmail.com>
<[email protected]>
<CAN55FZ1XF=R7F7B__gq04rp2nQnJqs1yfExEXo4riWc68+Pe0w@mail.gmail.com>
<aR4wDwNdLc5TmcQq@nathan>
<CA+K2Rump8NoMRZRZ2r4jHXUJwByasy_c3_b0oaO+TLkSbMD-jw@mail.gmail.com>
<CAKWEB6rLxPVtN4ffZ3CMTL518zhk_BWzzBt6ZE2oUSaErdphxA@mail.gmail.com>
<CAKWEB6oO4gQd+UJBrU=uuUTE8Hv7GMznjMouvn0Lskr52UqjhQ@mail.gmail.com>
<CAN55FZ0Nd9FL=aDSjOTJTeFAn8VNrZgWG+WbcHR+R7GkDMvUyw@mail.gmail.com>
<CAN55FZ1fwKgGo2wEie1w2M2jzJko6cMi1NWD05Xm47_L9a3D+g@mail.gmail.com>
<CAKWEB6oZdQhhBV3ojHLBwjQgKzfDw0fkqncurt9oi7vNsq41ww@mail.gmail.com>
<CAN55FZ1p5UyUdTRO7iWR_ukjhJDOnpOR2rYNOq=+hcC45OuahQ@mail.gmail.com>
A few suggestions:
* I'm curious if we'll see better performance on large inputs if we flush
to `line_buf` periodically (e.g., at least every few thousand bytes or so).
Otherwise we might see poor data cache behavior if large inputs with no
control characters get evicted before we've copied them over. See the
approach taken in escape_json_with_len() in utils/adt/json.c
* Did you compare the approach taken in the patch with a simpler approach
that just does
if (!(vector8_has(chunk, '\\') ||
vector8_has(chunk, '\r') ||
vector8_has(chunk, '\n') /* and so on, accounting for CSV / escapec /
quotec stuff */))
{
/* skip chunk */
}
That's roughly what we do elsewhere (e.g., escape_json_with_len). It has
the advantage of being more readable, along with potentially having fewer
data dependencies.
Neil
On Wed, Dec 10, 2025 at 7:00 AM Nazir Bilal Yavuz <[email protected]>
wrote:
> Hi,
>
> On Wed, 10 Dec 2025 at 01:13, Manni Wood <[email protected]>
> wrote:
> >
> > Bilal Yavuz (Nazir Bilal Yavuz?),
>
> It is Nazir Bilal Yavuz, I changed some settings on my phone and it
> seems that it affected my mail account, hopefully it should be fixed
> now.
>
> > I did not get a chance to do any work on this today, but wanted to thank
> you for finding my logic errors in counting special chars for CSV, and
> hacking on my naive solution to make it faster. By attempting Andrew
> Dunstan's suggestion, I got a better feel for the reality that the
> "housekeeping" code produces a significant amount of overhead.
>
> You are welcome! v4.1 has some problems with in_quote case in SIMD
> handling code and counting cstate->chars_processed variable. I fixed
> them in v4.2.
>
> --
> Regards,
> Nazir Bilal Yavuz
> Microsoft
>
view thread (4+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Speed up COPY FROM text/CSV parsing using SIMD
In-Reply-To: <CAOW5sYZEx=fPw2wp7y2nK_-ifXFeYW4CTmFx_OQeoHFjG7rbHw@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox