public inbox for [email protected]  
help / color / mirror / Atom feed
From: KAZAR Ayoub <[email protected]>
To: Nathan Bossart <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Pg Hackers <[email protected]>
Cc: Neil Conway <[email protected]>
Cc: Manni Wood <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Shinya Kato <[email protected]>
Cc: Mark Wong <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Subject: Re: Speed up COPY TO text/CSV parsing using SIMD
Date: Wed, 18 Mar 2026 03:29:32 +0100
Message-ID: <CA+K2Ru=PdZuXQbcfvqKysTkebTyXNd9j7dp+mTFQEYpLdGw1eA@mail.gmail.com> (raw)
In-Reply-To: <CA+K2Rum-TB_iNzDWoXOJspf=jq0gd-wees8+9tBTJNyhy9cK5g@mail.gmail.com>
References: <CA+K2Runi_H2CBL0yMm3De2KqcR9RMA0HK5cLJjEhoNszC7myeg@mail.gmail.com>
	<[email protected]>
	<CA+K2Rum_QTZqTUrdMOL5hr-OOpCwGR_9Nj1z15BFObjktMOY6A@mail.gmail.com>
	<abBuKalOno33MQFw@nathan>
	<CA+K2Rum7+Jm2rm65K5msxaiAM8QTkhSNAYarPBP9O7nBXYo12Q@mail.gmail.com>
	<abmiNPQOqBrRlf_m@nathan>
	<CA+K2Rum-TB_iNzDWoXOJspf=jq0gd-wees8+9tBTJNyhy9cK5g@mail.gmail.com>

On Wed, Mar 18, 2026 at 12:02 AM KAZAR Ayoub <[email protected]> wrote:

> On Tue, Mar 17, 2026 at 7:49 PM Nathan Bossart <[email protected]>
> wrote:
>
>> On Sat, Mar 14, 2026 at 11:43:38PM +0100, KAZAR Ayoub wrote:
>> > Just a small concern about where some varlenas have a larger binary size
>> > than its text representation ex:
>> > SELECT pg_column_size(to_tsvector('SIMD is GOOD'));
>> >  pg_column_size
>> > ----------------
>> >              32
>> >
>> > its text representation is less than sizeof(Vector8) so currently v3
>> would
>> > enter SIMD path and exit out just from the beginning (two extra
>> branches)
>> > because it does this:
>> > + if (TupleDescAttr(tup_desc, attnum - 1)->attlen == -1 &&
>> > + VARSIZE_ANY_EXHDR(DatumGetPointer(value)) > sizeof(Vector8))
>> >
>> > I thought maybe we could do * 2 or * 4 its binary size, depends on the
>> type
>> > really but this is just a proposition if this case is something
>> concerning.
>>
>> Can we measure the impact of this?  How likely is this case?
>>
> I'll respond to this separately in a different email.
>
My example was already incorrect (the text representation is lexems and
positions, not the text we read as it is, its lossy), anyways the point
still holds.
If we have some json(b) column like : {"key1":"val1","key2":"val2"}, for
CSV format this would immediately exit the SIMD path because of quote
character, for json(b) this is going to be always the case.
I measured the overhead of exiting the SIMD path a lot (8 million times for
one COPY TO command), i only found 3% regression for this case, sometimes
2%.

For cases where we do a false commitment on SIMD because we read a binary
size >= sizeof(Vector8), which i found very niche too, the short circuit to
scalar each time is even more negligible (the above CSV JSON case is the
absolute worst case).
So I don't think any of this should be a concern.


Regards,
Ayoub


view thread (13+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Speed up COPY TO text/CSV parsing using SIMD
  In-Reply-To: <CA+K2Ru=PdZuXQbcfvqKysTkebTyXNd9j7dp+mTFQEYpLdGw1eA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox