Avoid multiple calls to memcpy (src/backend/access/index/genam.c)

public inbox for [email protected]  
help / color / mirror / Atom feed

Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
17+ messages / 4 participants
[nested] [flat]

* Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
@ 2026-03-09 13:16 Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-25 04:10 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) David Rowley <[email protected]>
  0 siblings, 2 replies; 17+ messages in thread

From: Ranier Vilela @ 2026-03-09 13:16 UTC (permalink / raw)
  To: pgsql-hackers

Hi.

In the functions *systable_beginscan* and *systable_beginscan_ordered*,
is possible a small optimization.
The array *idxkey* can be constructed in one go with a single call to
mempcy.
The excess might not make much of a difference, but I think it's worth the
effort.

patch attached.

best regards,
Ranier Vilela


Attachments:

  [application/octet-stream] avoid-multiple-calls-to-memcpy-genam.patch (1.1K, 3-avoid-multiple-calls-to-memcpy-genam.patch)
  download | inline diff:
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 5e89b86a62..affaed2c48 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -435,13 +435,13 @@ systable_beginscan(Relation heapRelation,
 
 		idxkey = palloc_array(ScanKeyData, nkeys);
 
+		memcpy(idxkey, key, nkeys * sizeof(ScanKeyData));
+
 		/* Convert attribute numbers to be index column numbers. */
 		for (i = 0; i < nkeys; i++)
 		{
 			int			j;
 
-			memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
-
 			for (j = 0; j < IndexRelationGetNumberOfAttributes(irel); j++)
 			{
 				if (key[i].sk_attno == irel->rd_index->indkey.values[j])
@@ -688,13 +688,13 @@ systable_beginscan_ordered(Relation heapRelation,
 
 	idxkey = palloc_array(ScanKeyData, nkeys);
 
+	memcpy(idxkey, key, nkeys * sizeof(ScanKeyData));
+
 	/* Convert attribute numbers to be index column numbers. */
 	for (i = 0; i < nkeys; i++)
 	{
 		int			j;
 
-		memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
-
 		for (j = 0; j < IndexRelationGetNumberOfAttributes(indexRelation); j++)
 		{
 			if (key[i].sk_attno == indexRelation->rd_index->indkey.values[j])


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-09 14:05 ` Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  1 sibling, 1 reply; 17+ messages in thread

From: Ranier Vilela @ 2026-03-09 14:05 UTC (permalink / raw)
  To: pgsql-hackers

Em seg., 9 de mar. de 2026 às 10:16, Ranier Vilela <[email protected]>
escreveu:

> Hi.
>
> In the functions *systable_beginscan* and *systable_beginscan_ordered*,
> is possible a small optimization.
> The array *idxkey* can be constructed in one go with a single call to
> mempcy.
> The excess might not make much of a difference, but I think it's worth the
> effort.
>
> patch attached.
>
Someone asked me if O2 does not do the work.
Apparently not.

https://godbolt.org/z/h5dndz33x

best regards,
Ranier Vilela


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-09 14:45   ` Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Bryan Green @ 2026-03-09 14:45 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: pgsql-hackers

I created an example that is a little bit closer to the actual code and
changed the compiler from C++ to C.

It is interesting the optimization that the compiler has chosen for version
1 versus version 2.  One calls
memcpy and one doesn't.  There is a good chance the inlining of memcpy as
SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small (1-4
keys?).

Probably the only reason to do this patch would be if N is normally large
or if this is considered an
improvement in code clarity without a detrimental impact on small N
syscache scans.
I realize you only said "possible small optimization".  It might be
worthwhile to benchmark the code for
different values of n to determine if there is a tipping point either way?

https://godbolt.org/z/dM18cGfE6

-- bg

On Mon, Mar 9, 2026 at 8:05 AM Ranier Vilela <[email protected]> wrote:

>
> Em seg., 9 de mar. de 2026 às 10:16, Ranier Vilela <[email protected]>
> escreveu:
>
>> Hi.
>>
>> In the functions *systable_beginscan* and *systable_beginscan_ordered*,
>> is possible a small optimization.
>> The array *idxkey* can be constructed in one go with a single call to
>> mempcy.
>> The excess might not make much of a difference, but I think it's worth
>> the effort.
>>
>> patch attached.
>>
> Someone asked me if O2 does not do the work.
> Apparently not.
>
> https://godbolt.org/z/h5dndz33x
>
> best regards,
> Ranier Vilela
>

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
@ 2026-03-09 16:25     ` Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Ranier Vilela @ 2026-03-09 16:25 UTC (permalink / raw)
  To: Bryan Green <[email protected]>; +Cc: pgsql-hackers

Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]>
escreveu:

> I created an example that is a little bit closer to the actual code and
> changed the compiler from C++ to C.
>
> It is interesting the optimization that the compiler has chosen for
> version 1 versus version 2.  One calls
> memcpy and one doesn't.  There is a good chance the inlining of memcpy as
> SSE+scalar per iteration
> will be faster for syscache scans-- which I believe are usually small (1-4
> keys?).
>
I doubt the inline version is better.
Clang is supported too and the code generated is much better with memcpy
one call outside of the loop.


>
> Probably the only reason to do this patch would be if N is normally large
> or if this is considered an
> improvement in code clarity without a detrimental impact on small N
> syscache scans.
> I realize you only said "possible small optimization".  It might be
> worthwhile to benchmark the code for
> different values of n to determine if there is a tipping point either way?
>
 In your opinion, shouldn't this be considered an optimization, even a
small one?

best regards,
Ranier Vilela


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-09 17:02       ` Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-12 14:35         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  0 siblings, 2 replies; 17+ messages in thread

From: Bryan Green @ 2026-03-09 17:02 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: pgsql-hackers

I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
wins for small values of n.

20 runs:

n       version       min  median    mean     max  stddev  noise%
-----------------------------------------------------------------------
n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%

n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%

n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%

But, micro-benchmarks always make me nervous, so I looked at the actual
instruction cost for my
platform given the version 1 and version 2 code.

If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
tables:  version 1 (loop body)
has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
has ~3-4 cycles per iteration.

The problem for version 2 is that the call to memcpy is ~24-30 cycles due
to the stub + function call + return
and branch predictor pressure on first call.  This probably results in ~2.5
ns per iteration cost for version 2.

So, no I wouldn't call it an optimization.  But, it will be interesting to
hear other opinions on this.

--bg


On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <[email protected]> wrote:

>
>
> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]>
> escreveu:
>
>> I created an example that is a little bit closer to the actual code and
>> changed the compiler from C++ to C.
>>
>> It is interesting the optimization that the compiler has chosen for
>> version 1 versus version 2.  One calls
>> memcpy and one doesn't.  There is a good chance the inlining of memcpy as
>> SSE+scalar per iteration
>> will be faster for syscache scans-- which I believe are usually small
>> (1-4 keys?).
>>
> I doubt the inline version is better.
> Clang is supported too and the code generated is much better with memcpy
> one call outside of the loop.
>
>
>>
>> Probably the only reason to do this patch would be if N is normally large
>> or if this is considered an
>> improvement in code clarity without a detrimental impact on small N
>> syscache scans.
>> I realize you only said "possible small optimization".  It might be
>> worthwhile to benchmark the code for
>> different values of n to determine if there is a tipping point either way?
>>
>  In your opinion, shouldn't this be considered an optimization, even a
> small one?
>
> best regards,
> Ranier Vilela
>


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
@ 2026-03-09 17:06         ` Bryan Green <[email protected]>
  2026-03-20 09:54           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  1 sibling, 1 reply; 17+ messages in thread

From: Bryan Green @ 2026-03-09 17:06 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: pgsql-hackers

I meant to add that the micro-benchmark went through each loop 100,000,000
iterations for each run...
for 20 runs to come up with those numbers.

--bg

On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <[email protected]> wrote:

> I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
> wins for small values of n.
>
> 20 runs:
>
> n       version       min  median    mean     max  stddev  noise%
> -----------------------------------------------------------------------
> n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
> n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>
> n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
> n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>
> n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
> n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>
> But, micro-benchmarks always make me nervous, so I looked at the actual
> instruction cost for my
> platform given the version 1 and version 2 code.
>
> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
> tables:  version 1 (loop body)
> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
> has ~3-4 cycles per iteration.
>
> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
> to the stub + function call + return
> and branch predictor pressure on first call.  This probably results in
> ~2.5 ns per iteration cost for version 2.
>
> So, no I wouldn't call it an optimization.  But, it will be interesting to
> hear other opinions on this.
>
> --bg
>
>
> On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <[email protected]> wrote:
>
>>
>>
>> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]>
>> escreveu:
>>
>>> I created an example that is a little bit closer to the actual code and
>>> changed the compiler from C++ to C.
>>>
>>> It is interesting the optimization that the compiler has chosen for
>>> version 1 versus version 2.  One calls
>>> memcpy and one doesn't.  There is a good chance the inlining of memcpy
>>> as SSE+scalar per iteration
>>> will be faster for syscache scans-- which I believe are usually small
>>> (1-4 keys?).
>>>
>> I doubt the inline version is better.
>> Clang is supported too and the code generated is much better with memcpy
>> one call outside of the loop.
>>
>>
>>>
>>> Probably the only reason to do this patch would be if N is normally
>>> large or if this is considered an
>>> improvement in code clarity without a detrimental impact on small N
>>> syscache scans.
>>> I realize you only said "possible small optimization".  It might be
>>> worthwhile to benchmark the code for
>>> different values of n to determine if there is a tipping point either
>>> way?
>>>
>>  In your opinion, shouldn't this be considered an optimization, even a
>> small one?
>>
>> best regards,
>> Ranier Vilela
>>
>


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
@ 2026-03-20 09:54           ` lakshmi <[email protected]>
  2026-03-20 10:58             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: lakshmi @ 2026-03-20 09:54 UTC (permalink / raw)
  To: Bryan Green <[email protected]>; +Cc: Ranier Vilela <[email protected]>; pgsql-hackers

--0000000000002304d0064d719e44
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi,

I tried this change on PostgreSQL (19devel) and ran a few simple tests
using pgbench to see how it behaves in practice.

I used 10 clients and 4 threads and ran each test for 60 seconds.



^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-20 09:54           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
@ 2026-03-20 10:58             ` Ranier Vilela <[email protected]>
  2026-03-23 04:45               ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Ranier Vilela @ 2026-03-20 10:58 UTC (permalink / raw)
  To: lakshmi <[email protected]>; +Cc: Bryan Green <[email protected]>; pgsql-hackers

Hi.

Em sex., 20 de mar. de 2026 às 06:50, lakshmi <[email protected]>
escreveu:

> Hi,
>
> I tried this change on PostgreSQL (19devel) and ran a few simple tests
> using pgbench to see how it behaves in practice.
>
> I used 10 clients and 4 threads and ran each test for 60 seconds.
>
> From my runs, the original version was giving around ~663 TPS with ~15.07
> ms latency.
> With the patched version, I observed TPS in the range of ~638–657,
> averaging around ~648 TPS, with latency slightly higher (~15.2–15.6 ms).
>
Thanks for the benchmark.
Could you share the tests and the environment?
compiler
OS
etc.

best regards,
Ranier Vilela


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-20 09:54           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  2026-03-20 10:58             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-23 04:45               ` lakshmi <[email protected]>
  2026-03-23 11:12                 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: lakshmi @ 2026-03-23 04:45 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: Bryan Green <[email protected]>; pgsql-hackers

Hi,

Thanks for the reply.

Here are the details of my test setup.

I ran the tests on Debian Linux (6.1.0-10-amd64, x86_64) on an AMD Ryzen 5
7535U (6 cores / 12 threads).

The PostgreSQL version was 19devel, built from source using gcc 12.2.0 with
enable-debug.

For benchmarking, I used pgbench with the following command:
pgbench -p 55432 -d postgres -c 10 -j 4 -T 60

The database was initialized with scale factor 1 (pgbench -i). I ran
multiple iterations for both the original and patched versions and compared
TPS and latency across runs.

All tests were done on the same system under similar conditions.

Please let me know if you’d like me to try any other scenarios or provide
more details.

Regards,
Lakshmi G

On Fri, Mar 20, 2026 at 4:28 PM Ranier Vilela <[email protected]> wrote:

> Hi.
>
> Em sex., 20 de mar. de 2026 às 06:50, lakshmi <[email protected]>
> escreveu:
>
>> Hi,
>>
>> I tried this change on PostgreSQL (19devel) and ran a few simple tests
>> using pgbench to see how it behaves in practice.
>>
>> I used 10 clients and 4 threads and ran each test for 60 seconds.
>>
>> From my runs, the original version was giving around ~663 TPS with ~15.07
>> ms latency.
>> With the patched version, I observed TPS in the range of ~638–657,
>> averaging around ~648 TPS, with latency slightly higher (~15.2–15.6 ms).
>>
> Thanks for the benchmark.
> Could you share the tests and the environment?
> compiler
> OS
> etc.
>
> best regards,
> Ranier Vilela
>


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-20 09:54           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  2026-03-20 10:58             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-23 04:45               ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
@ 2026-03-23 11:12                 ` Ranier Vilela <[email protected]>
  2026-03-25 11:29                   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Ranier Vilela @ 2026-03-23 11:12 UTC (permalink / raw)
  To: lakshmi <[email protected]>; +Cc: Bryan Green <[email protected]>; pgsql-hackers

Hi.

Em seg., 23 de mar. de 2026 às 01:42, lakshmi <[email protected]>
escreveu:

> Hi,
>
> Thanks for the reply.
>
> Here are the details of my test setup.
>
> I ran the tests on Debian Linux (6.1.0-10-amd64, x86_64) on an AMD Ryzen 5
> 7535U (6 cores / 12 threads).
>
Thanks for the tests.

>
> The PostgreSQL version was 19devel, built from source using gcc 12.2.0
> with enable-debug.
>
> For benchmarking, I used pgbench with the following command:
> pgbench -p 55432 -d postgres -c 10 -j 4 -T 60
>
> The database was initialized with scale factor 1 (pgbench -i). I ran
> multiple iterations for both the original and patched versions and compared
> TPS and latency across runs.
>
> All tests were done on the same system under similar conditions.
>
> Please let me know if you’d like me to try any other scenarios or provide
> more details.
>
Could you repeat the tests without enable-debug?

best regards,
Ranier Vilela


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-20 09:54           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  2026-03-20 10:58             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-23 04:45               ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  2026-03-23 11:12                 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-25 11:29                   ` lakshmi <[email protected]>
  2026-03-25 11:38                     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: lakshmi @ 2026-03-25 11:29 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: Bryan Green <[email protected]>; pgsql-hackers

> Hi,
>>
>> Thanks for the suggestion and for the earlier discussion.
>>
> Could you repeat the tests without enable-debug?
>
I repeated the tests using a non-debug (optimized) build of PostgreSQL
(19devel), built without --enable-debug.

Test setup:

   -

   Debian Linux (x86_64)
   -

   gcc 12.2.0
   -

   AMD Ryzen 5 7535U (6 cores / 12 threads)
   -

   pgbench scale factor 1
   -

   command: pgbench -p 55432 -d postgres -c 10 -j 4 -T 60 I ran multiple
   iterations for both the original and patched versions and considered the
   stable runs (excluding those affected by checkpoints).

   Results:

   Original:
   TPS: ~1047
   Latency: ~9.5 ms

   Patched:
   TPS: ~1040
   Latency: ~9.6 ms

   From these runs, the results are quite close, and I didn’t observe a
   consistent performance improvement with the patch under this workload. The
   differences appear to be within normal run-to-run variation.

   It may be that this change has limited impact in typical cases where the
   number of keys is small. It would be interesting to see how it behaves with
   workloads involving larger numbers of keys or different access patterns.

   Please let me know if you’d like me to try any additional scenarios.

   Regards,
   Lakshmi


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 17:06         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-20 09:54           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  2026-03-20 10:58             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-23 04:45               ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
  2026-03-23 11:12                 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-25 11:29                   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) lakshmi <[email protected]>
@ 2026-03-25 11:38                     ` Ranier Vilela <[email protected]>
  0 siblings, 0 replies; 17+ messages in thread

From: Ranier Vilela @ 2026-03-25 11:38 UTC (permalink / raw)
  To: lakshmi <[email protected]>; +Cc: Bryan Green <[email protected]>; pgsql-hackers

Hi.

Em qua., 25 de mar. de 2026 às 08:26, lakshmi <[email protected]>
escreveu:

>
> Hi,
>>>
>>> Thanks for the suggestion and for the earlier discussion.
>>>
>> Could you repeat the tests without enable-debug?
>>
> I repeated the tests using a non-debug (optimized) build of PostgreSQL
> (19devel), built without --enable-debug.
>
> Test setup:
>
>    -
>
>    Debian Linux (x86_64)
>    -
>
>    gcc 12.2.0
>    -
>
>    AMD Ryzen 5 7535U (6 cores / 12 threads)
>    -
>
>    pgbench scale factor 1
>    -
>
>    command: pgbench -p 55432 -d postgres -c 10 -j 4 -T 60 I ran multiple
>    iterations for both the original and patched versions and considered the
>    stable runs (excluding those affected by checkpoints).
>
> Thanks.


>
>    -
>
>    Results:
>
>    Original:
>    TPS: ~1047
>    Latency: ~9.5 ms
>
>    Patched:
>    TPS: ~1040
>    Latency: ~9.6 ms
>
>    From these runs, the results are quite close, and I didn’t observe a
>    consistent performance improvement with the patch under this workload. The
>    differences appear to be within normal run-to-run variation.
>
>  I agree.


>    -
>
>    It may be that this change has limited impact in typical cases where
>    the number of keys is small. It would be interesting to see how it behaves
>    with workloads involving larger numbers of keys or different access
>    patterns.
>
> I believe it's unnecessary; I think it's been clearly demonstrated that
there has been no improvement, so it's not worth it.


>    -
>
>    Please let me know if you’d like me to try any additional scenarios.
>
> Thank you for testing.

For me, the lesson here is that this is the first time I've seen that
unloop doesn't work any better.

best regards,
Ranier Vilela


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
@ 2026-03-12 14:35         ` Ranier Vilela <[email protected]>
  2026-03-12 17:48           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  1 sibling, 1 reply; 17+ messages in thread

From: Ranier Vilela @ 2026-03-12 14:35 UTC (permalink / raw)
  To: Bryan Green <[email protected]>; +Cc: pgsql-hackers

Hi.

Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <[email protected]>
escreveu:

> I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
> wins for small values of n.
>
> 20 runs:
>
> n       version       min  median    mean     max  stddev  noise%
> -----------------------------------------------------------------------
> n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
> n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>
> n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
> n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>
> n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
> n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>
> But, micro-benchmarks always make me nervous, so I looked at the actual
> instruction cost for my
> platform given the version 1 and version 2 code.
>
> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
> tables:  version 1 (loop body)
> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
> has ~3-4 cycles per iteration.
>
> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
> to the stub + function call + return
> and branch predictor pressure on first call.  This probably results in
> ~2.5 ns per iteration cost for version 2.
>
> So, no I wouldn't call it an optimization.  But, it will be interesting to
> hear other opinions on this.
>
I made dirty and quick tests with two versions:
gcc 15.2.0
gcc -O2 memcpy1.c -o memcpy1

The first test was with keys 10000000 and 10000000 loops:
version1: on memcpy call
done in 1873 nanoseconds

version2: inlined memcpy
not finish

The second test was with keys 4 and 10000000 loops:
version1: one memcpy call
version2: inlined memcpy call

version1: done in 1519 nanoseconds
version2: done in 104981851 nanoseconds
(1.44692e-05 times faster)

version1: done in 1979 nanoseconds
version2: done in 110568901 nanoseconds
(1.78983e-05 times faster)

version1: done in 1814 nanoseconds
version2: done in 108555484 nanoseconds
(1.67103e-05 times faster)

version1: done in 1631 nanoseconds
version2: done in 109867919 nanoseconds
(1.48451e-05 times faster)

version1: done in 1269 nanoseconds
version2: done in 111639106 nanoseconds
(1.1367e-05 times faster)

Unless I'm doing something wrong, one call memcpy wins!
memcpy1.c attached.

best regards,
Ranier Vilela


Attachments:

  [text/x-csrc] memcpy1.c (2.6K, 3-memcpy1.c)
  download | inline:
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <time.h>
#include <immintrin.h>


/* Closer approximation of ScanKeyData - has function pointer and Datum */
typedef void (*RegProcedure)(void);
typedef uintptr_t Datum;

typedef struct ScanKeyData
{
    int         sk_flags;
    int         sk_attno;
    RegProcedure sk_func;
    Datum       sk_argument;
} ScanKeyData;

/*  */
const ScanKeyData * version1(int n, const ScanKeyData  * key)
{
    ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));

    memcpy(&idxkey, &key, n * sizeof(ScanKeyData));
    for (int i = 0; i < n; i++)
    {
        idxkey[i].sk_attno = i + 1;
    }

    return idxkey;
}

/* */
const ScanKeyData * version2(int n, const ScanKeyData *key)
{
    ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));

    for (int i = 0; i < n; i++)
    {
        memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
        idxkey[i].sk_attno = i + 1;
    }

    return idxkey;
}




#define NANOSEC_PER_SEC 1000000000

// Returns difference in nanoseconds
int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
	int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
	nanosec += (t1->tv_nsec - t2->tv_nsec);

	return nanosec;
}




//#define NKEYS 10000000 version2 does not finish
#define NKEYS 4
#define LOOPS 10000000

void test1(int n)
{
	ScanKeyData *keys;
	ScanKeyData *idx;

        keys = (ScanKeyData *) malloc(NKEYS * sizeof(ScanKeyData));
	memset(keys, 0, NKEYS * sizeof(ScanKeyData));

	for(int i = 0; i < n; i++)
	{
		idx = version1(NKEYS, keys);
		free(idx);
	}
	free(keys);
}

void test2(int n)
{
	ScanKeyData *keys;
	ScanKeyData *idx;

        keys = (ScanKeyData *) malloc(NKEYS * sizeof(ScanKeyData));
	memset(keys, 0, NKEYS * sizeof(ScanKeyData));

	for(int i = 0; i < n; i++)
	{
		idx = version2(NKEYS, keys);
		free(idx);
	}
	free(keys);
}


int main(void)
{
	struct timespec start,end;
	int64_t version1_time, version2_time;

	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
	test1(LOOPS);
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
	version1_time = get_clock_diff(&end, &start);
	printf("version1: done in %lld nanoseconds\n", version1_time);	

	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
	test2(LOOPS);
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
	version2_time = get_clock_diff(&end, &start);
	printf("version2: done in %lld nanoseconds\n", version2_time);	

	printf("(%g times faster)\n", (double) version1_time / version2_time);

	return 0;
}

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-12 14:35         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-12 17:48           ` Bryan Green <[email protected]>
  2026-03-12 19:21             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Bryan Green @ 2026-03-12 17:48 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: pgsql-hackers

I don't think your version 1 memcpy is doing what you think it is doing.

On Thu, Mar 12, 2026 at 12:35 PM Ranier Vilela <[email protected]> wrote:

> Hi.
>
> Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <[email protected]>
> escreveu:
>
>> I performed a micro-benchmark on my dual epyc (zen 2) server and version
>> 1 wins for small values of n.
>>
>> 20 runs:
>>
>> n       version       min  median    mean     max  stddev  noise%
>> -----------------------------------------------------------------------
>> n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
>> n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>>
>> n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
>> n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>>
>> n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
>> n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>>
>> But, micro-benchmarks always make me nervous, so I looked at the actual
>> instruction cost for my
>> platform given the version 1 and version 2 code.
>>
>> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
>> tables:  version 1 (loop body)
>> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
>> has ~3-4 cycles per iteration.
>>
>> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
>> to the stub + function call + return
>> and branch predictor pressure on first call.  This probably results in
>> ~2.5 ns per iteration cost for version 2.
>>
>> So, no I wouldn't call it an optimization.  But, it will be interesting
>> to hear other opinions on this.
>>
> I made dirty and quick tests with two versions:
> gcc 15.2.0
> gcc -O2 memcpy1.c -o memcpy1
>
> The first test was with keys 10000000 and 10000000 loops:
> version1: on memcpy call
> done in 1873 nanoseconds
>
> version2: inlined memcpy
> not finish
>
> The second test was with keys 4 and 10000000 loops:
> version1: one memcpy call
> version2: inlined memcpy call
>
> version1: done in 1519 nanoseconds
> version2: done in 104981851 nanoseconds
> (1.44692e-05 times faster)
>
> version1: done in 1979 nanoseconds
> version2: done in 110568901 nanoseconds
> (1.78983e-05 times faster)
>
> version1: done in 1814 nanoseconds
> version2: done in 108555484 nanoseconds
> (1.67103e-05 times faster)
>
> version1: done in 1631 nanoseconds
> version2: done in 109867919 nanoseconds
> (1.48451e-05 times faster)
>
> version1: done in 1269 nanoseconds
> version2: done in 111639106 nanoseconds
> (1.1367e-05 times faster)
>
> Unless I'm doing something wrong, one call memcpy wins!
> memcpy1.c attached.
>
> best regards,
> Ranier Vilela
>


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-12 14:35         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-12 17:48           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
@ 2026-03-12 19:21             ` Bryan Green <[email protected]>
  2026-03-12 19:27               ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Bryan Green @ 2026-03-12 19:21 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: pgsql-hackers

I modified your memcpy1.c program to not inline the version functions.  I
changed the memcpy function
call in version 1, added volatile to keep some DCE opportunities from
happening and added a range
of N values to keep the compiler from specializing the code for N = 4.
Before it did DCE and the test1
function was just a ret.

The interesting issue is the use of malloc versus the stack.  The use of
malloc will probably track closer
with PG's use of palloc so I would say in that case this is an
optimization.  It might be fun to compile PG
with and without the patch (in debug mode) and actually see what gets
generated for this function.

Here are the results I got using your modified benchmark:
--- stack allocated ---
stack  n=1  v1(patch): 49721599 ns  v2(original): 21477302 ns  ratio: 2.315
 original wins
stack  n=2  v1(patch): 52065462 ns  v2(original): 28765199 ns  ratio: 1.810
 original wins
stack  n=3  v1(patch): 58914958 ns  v2(original): 39726110 ns  ratio: 1.483
 original wins
stack  n=4  v1(patch): 64585275 ns  v2(original): 47046397 ns  ratio: 1.373
 original wins
stack  n=5  v1(patch): 73929844 ns  v2(original): 58588698 ns  ratio: 1.262
 original wins
stack  n=6  v1(patch): 95465376 ns  v2(original): 67807817 ns  ratio: 1.408
 original wins
stack  n=7  v1(patch): 86910226 ns  v2(original): 76999488 ns  ratio: 1.129
 original wins
stack  n=8  v1(patch): 107765417 ns  v2(original): 86046016 ns  ratio:
1.252  original wins

--- malloc allocated ---
malloc n=1  v1(patch): 133283824 ns  v2(original): 141361091 ns  ratio:
0.943  patch wins
malloc n=2  v1(patch): 145625895 ns  v2(original): 180912711 ns  ratio:
0.805  patch wins
malloc n=3  v1(patch): 153975594 ns  v2(original): 228459879 ns  ratio:
0.674  patch wins
malloc n=4  v1(patch): 154483094 ns  v2(original): 248157408 ns  ratio:
0.623  patch wins
malloc n=5  v1(patch): 157710598 ns  v2(original): 298795018 ns  ratio:
0.528  patch wins
malloc n=6  v1(patch): 165196636 ns  v2(original): 332940132 ns  ratio:
0.496  patch wins
malloc n=7  v1(patch): 169576370 ns  v2(original): 358438778 ns  ratio:
0.473  patch wins
malloc n=8  v1(patch): 184463815 ns  v2(original): 403721513 ns  ratio:
0.457  patch wins


The modified program:

#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <time.h>

typedef void (*RegProcedure)(void);
typedef uintptr_t Datum;

typedef struct ScanKeyData
{
    int         sk_flags;
    int         sk_attno;
    RegProcedure sk_func;
    Datum       sk_argument;
} ScanKeyData;

/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
void version1_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
    memcpy(idxkey, key, n * sizeof(ScanKeyData));
    for (int i = 0; i < n; i++)
        idxkey[i].sk_attno = i + 1;
}

/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
void version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
    for (int i = 0; i < n; i++)
    {
        memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
        idxkey[i].sk_attno = i + 1;
    }
}

/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
ScanKeyData *version1_malloc(int n, const ScanKeyData *key)
{
    ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));

    memcpy(idxkey, key, n * sizeof(ScanKeyData));
    for (int i = 0; i < n; i++)
        idxkey[i].sk_attno = i + 1;

    return idxkey;
}

/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
ScanKeyData *version2_malloc(int n, const ScanKeyData *key)
{
    ScanKeyData *idxkey = (ScanKeyData *) malloc(n * sizeof(ScanKeyData));

    for (int i = 0; i < n; i++)
    {
        memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
        idxkey[i].sk_attno = i + 1;
    }

    return idxkey;
}

#define NANOSEC_PER_SEC 1000000000

int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
    int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
    nanosec += (t1->tv_nsec - t2->tv_nsec);
    return nanosec;
}

#define MAX_KEYS 8
#define LOOPS 10000000

void test_stack(int n)
{
    ScanKeyData  keys[MAX_KEYS];
    ScanKeyData  idxkey[MAX_KEYS];
    struct timespec start, end;
    int64_t      version1_time, version2_time;

    memset(keys, 0, sizeof(keys));

    /* warmup */
    for (int i = 0; i < 1000; i++)
    {
        version1_stack(n, keys, idxkey);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i = 0; i < LOOPS; i++)
    {
        version1_stack(n, keys, idxkey);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version1_time = get_clock_diff(&end, &start);

    /* warmup */
    for (int i = 0; i < 1000; i++)
    {
        version2_stack(n, keys, idxkey);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i = 0; i < LOOPS; i++)
    {
        version2_stack(n, keys, idxkey);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version2_time = get_clock_diff(&end, &start);

    printf("stack  n=%d  v1(patch): %ld ns  v2(original): %ld ns  ratio:
%.3f  %s\n",
           n, version1_time, version2_time,
           (double) version1_time / version2_time,
           version1_time < version2_time ? "patch wins" : "original wins");
}

void test_malloc(int n)
{
    ScanKeyData  keys[MAX_KEYS];
    ScanKeyData  *idxkey;
    struct timespec start, end;
    int64_t      version1_time, version2_time;

    memset(keys, 0, sizeof(keys));

    /* warmup */
    for (int i = 0; i < 1000; i++)
    {
        idxkey = version1_malloc(n, keys);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i = 0; i < LOOPS; i++)
    {
        idxkey = version1_malloc(n, keys);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version1_time = get_clock_diff(&end, &start);

    /* warmup */
    for (int i = 0; i < 1000; i++)
    {
        idxkey = version2_malloc(n, keys);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i = 0; i < LOOPS; i++)
    {
        idxkey = version2_malloc(n, keys);
        volatile int sink = idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version2_time = get_clock_diff(&end, &start);

    printf("malloc n=%d  v1(patch): %ld ns  v2(original): %ld ns  ratio:
%.3f  %s\n",
           n, version1_time, version2_time,
           (double) version1_time / version2_time,
           version1_time < version2_time ? "patch wins" : "original wins");
}

int main(void)
{
    printf("--- stack allocated ---\n");
    for (int n = 1; n <= MAX_KEYS; n++)
        test_stack(n);

    printf("\n--- malloc allocated ---\n");
    for (int n = 1; n <= MAX_KEYS; n++)
        test_malloc(n);

    return 0;
}


-- bg




On Thu, Mar 12, 2026 at 12:48 PM Bryan Green <[email protected]> wrote:

> I don't think your version 1 memcpy is doing what you think it is doing.
>
> On Thu, Mar 12, 2026 at 12:35 PM Ranier Vilela <[email protected]>
> wrote:
>
>> Hi.
>>
>> Em seg., 9 de mar. de 2026 às 14:02, Bryan Green <[email protected]>
>> escreveu:
>>
>>> I performed a micro-benchmark on my dual epyc (zen 2) server and version
>>> 1 wins for small values of n.
>>>
>>> 20 runs:
>>>
>>> n       version       min  median    mean     max  stddev  noise%
>>> -----------------------------------------------------------------------
>>> n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
>>> n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>>>
>>> n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
>>> n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>>>
>>> n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
>>> n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>>>
>>> But, micro-benchmarks always make me nervous, so I looked at the actual
>>> instruction cost for my
>>> platform given the version 1 and version 2 code.
>>>
>>> If we count cpu cycles using the AMD Zen 2 instruction
>>> latency/throughput tables:  version 1 (loop body)
>>> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
>>> has ~3-4 cycles per iteration.
>>>
>>> The problem for version 2 is that the call to memcpy is ~24-30 cycles
>>> due to the stub + function call + return
>>> and branch predictor pressure on first call.  This probably results in
>>> ~2.5 ns per iteration cost for version 2.
>>>
>>> So, no I wouldn't call it an optimization.  But, it will be interesting
>>> to hear other opinions on this.
>>>
>> I made dirty and quick tests with two versions:
>> gcc 15.2.0
>> gcc -O2 memcpy1.c -o memcpy1
>>
>> The first test was with keys 10000000 and 10000000 loops:
>> version1: on memcpy call
>> done in 1873 nanoseconds
>>
>> version2: inlined memcpy
>> not finish
>>
>> The second test was with keys 4 and 10000000 loops:
>> version1: one memcpy call
>> version2: inlined memcpy call
>>
>> version1: done in 1519 nanoseconds
>> version2: done in 104981851 nanoseconds
>> (1.44692e-05 times faster)
>>
>> version1: done in 1979 nanoseconds
>> version2: done in 110568901 nanoseconds
>> (1.78983e-05 times faster)
>>
>> version1: done in 1814 nanoseconds
>> version2: done in 108555484 nanoseconds
>> (1.67103e-05 times faster)
>>
>> version1: done in 1631 nanoseconds
>> version2: done in 109867919 nanoseconds
>> (1.48451e-05 times faster)
>>
>> version1: done in 1269 nanoseconds
>> version2: done in 111639106 nanoseconds
>> (1.1367e-05 times faster)
>>
>> Unless I'm doing something wrong, one call memcpy wins!
>> memcpy1.c attached.
>>
>> best regards,
>> Ranier Vilela
>>
>


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:05 ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 14:45   ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-09 16:25     ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-09 17:02       ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-12 14:35         ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
  2026-03-12 17:48           ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
  2026-03-12 19:21             ` Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Bryan Green <[email protected]>
@ 2026-03-12 19:27               ` Ranier Vilela <[email protected]>
  0 siblings, 0 replies; 17+ messages in thread

From: Ranier Vilela @ 2026-03-12 19:27 UTC (permalink / raw)
  To: Bryan Green <[email protected]>; +Cc: pgsql-hackers

Em qui., 12 de mar. de 2026 às 16:21, Bryan Green <[email protected]>
escreveu:

> I modified your memcpy1.c program to not inline the version functions.  I
> changed the memcpy function
> call in version 1, added volatile to keep some DCE opportunities from
> happening and added a range
> of N values to keep the compiler from specializing the code for N = 4.
> Before it did DCE and the test1
> function was just a ret.
>
> The interesting issue is the use of malloc versus the stack.  The use of
> malloc will probably track closer
> with PG's use of palloc so I would say in that case this is an
> optimization.  It might be fun to compile PG
> with and without the patch (in debug mode) and actually see what gets
> generated for this function.
>
> Here are the results I got using your modified benchmark:
> --- stack allocated ---
> stack  n=1  v1(patch): 49721599 ns  v2(original): 21477302 ns  ratio:
> 2.315  original wins
> stack  n=2  v1(patch): 52065462 ns  v2(original): 28765199 ns  ratio:
> 1.810  original wins
> stack  n=3  v1(patch): 58914958 ns  v2(original): 39726110 ns  ratio:
> 1.483  original wins
> stack  n=4  v1(patch): 64585275 ns  v2(original): 47046397 ns  ratio:
> 1.373  original wins
> stack  n=5  v1(patch): 73929844 ns  v2(original): 58588698 ns  ratio:
> 1.262  original wins
> stack  n=6  v1(patch): 95465376 ns  v2(original): 67807817 ns  ratio:
> 1.408  original wins
> stack  n=7  v1(patch): 86910226 ns  v2(original): 76999488 ns  ratio:
> 1.129  original wins
> stack  n=8  v1(patch): 107765417 ns  v2(original): 86046016 ns  ratio:
> 1.252  original wins
>
> --- malloc allocated ---
> malloc n=1  v1(patch): 133283824 ns  v2(original): 141361091 ns  ratio:
> 0.943  patch wins
> malloc n=2  v1(patch): 145625895 ns  v2(original): 180912711 ns  ratio:
> 0.805  patch wins
> malloc n=3  v1(patch): 153975594 ns  v2(original): 228459879 ns  ratio:
> 0.674  patch wins
> malloc n=4  v1(patch): 154483094 ns  v2(original): 248157408 ns  ratio:
> 0.623  patch wins
> malloc n=5  v1(patch): 157710598 ns  v2(original): 298795018 ns  ratio:
> 0.528  patch wins
> malloc n=6  v1(patch): 165196636 ns  v2(original): 332940132 ns  ratio:
> 0.496  patch wins
> malloc n=7  v1(patch): 169576370 ns  v2(original): 358438778 ns  ratio:
> 0.473  patch wins
> malloc n=8  v1(patch): 184463815 ns  v2(original): 403721513 ns  ratio:
> 0.457  patch wins
>
Thanks for your attention and tests.

I think that patch can continue then.

best regards,
Ranier Vilela


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
@ 2026-03-25 04:10 ` David Rowley <[email protected]>
  1 sibling, 0 replies; 17+ messages in thread

From: David Rowley @ 2026-03-25 04:10 UTC (permalink / raw)
  To: Ranier Vilela <[email protected]>; +Cc: pgsql-hackers

On Tue, 10 Mar 2026 at 02:16, Ranier Vilela <[email protected]> wrote:
> In the functions *systable_beginscan* and *systable_beginscan_ordered*,
> is possible a small optimization.
> The array *idxkey* can be constructed in one go with a single call to mempcy.
> The excess might not make much of a difference, but I think it's worth the effort.

This whole thread reeks of you running a static analyser to find
optimisation opportunities. If you want to find these, then please be
aware that static analysis of code is pretty much useless for this
purpose. It has no idea how hot the code in question is.

If you want to find places that actually matter, try using perf. Even
perf top --pid=N will allow you to see which functions matter enough
to warrant further analysis. By all means, if you find something
interesting and see a way to make a meaningful improvement, then post
about it and show a benchmark script and results which demonstrate the
performance gain.

For the patch and the analysis that's been done on it so far, please
be aware that the struct size that you use will be critical to both
the code emitted by the compiler and to the performance. On my
machine, ScanKeyData is 72 bytes. None of the godbolt.org links I
looked at got this right. One had 8 bytes, and another had 24 bytes.
With the default -march flags, you're probably only going to get SSE
instructions which will move 16 bytes at a time. Moving 72 bytes is
going to need 4 of those SSE moves and a MOV for the 8 byte remainder.
That's 5 instructions. 8 bytes is only 1 instruction, and 24 is 2
instructions (16+8).  I've forgotten the exact thresholds, but both
gcc and clang will inline fixed-sized memcpys below something like
256-512 bytes. I assume the compiler authors did a reasonable amount
of analysis to discover those thresholds. 4 ScanKeyData is 288 bytes,
which is pretty close to that threshold. Flicking through our code, I
see most systable_beginscan() calls are only using 1-2 scan keys. I
did see some using 3. I didn't look at all of them, but the ones I saw
are not even near the threshold for not inlining the memcpy. But, by
all means, prove me wrong and do something like add logging in
systable_beginscan and run the regression tests and use the logs to
build a histogram of the calls by nkeys.

For me, I think the unmodified code is better as the memcpy is a fixed
size and can be inlined by the compiler. I might feel different about
it if the outer loop could disappear, but that's not possible since
sk_attno needs to be set. I also doubt that it matters very much as
most of the time scans to catalogue tables will be cached in
sys/rel/cat caches. A few tables don't, like pg_inherits, possibly
others. But by all means, prove me wrong with perf results and the
query that shows systable_beginscan using a meaningful amount of CPU.

On a final note, please please stop thinking your static analyser is
going to help you improve the performance of PostgreSQL. Go and learn
a new tool that's actually useful and fit for the purpose. There's
likely years of work doing small things like this in places that
actually matter. You seem to have lots of energy for PostgreSQL, so I
highly recommend that you take the time and go and learn to use a tool
that finds those for you. You can then focus on the places it
highlights for you. Otherwise, you're just wasting your time and all
of ours.

David

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

end of thread, other threads:[~2026-03-25 11:38 UTC | newest]

Thread overview: 17+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-09 13:16 Avoid multiple calls to memcpy (src/backend/access/index/genam.c) Ranier Vilela <[email protected]>
2026-03-09 14:05 ` Ranier Vilela <[email protected]>
2026-03-09 14:45   ` Bryan Green <[email protected]>
2026-03-09 16:25     ` Ranier Vilela <[email protected]>
2026-03-09 17:02       ` Bryan Green <[email protected]>
2026-03-09 17:06         ` Bryan Green <[email protected]>
2026-03-20 09:54           ` lakshmi <[email protected]>
2026-03-20 10:58             ` Ranier Vilela <[email protected]>
2026-03-23 04:45               ` lakshmi <[email protected]>
2026-03-23 11:12                 ` Ranier Vilela <[email protected]>
2026-03-25 11:29                   ` lakshmi <[email protected]>
2026-03-25 11:38                     ` Ranier Vilela <[email protected]>
2026-03-12 14:35         ` Ranier Vilela <[email protected]>
2026-03-12 17:48           ` Bryan Green <[email protected]>
2026-03-12 19:21             ` Bryan Green <[email protected]>
2026-03-12 19:27               ` Ranier Vilela <[email protected]>
2026-03-25 04:10 ` David Rowley <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox