Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vze44-001Bm5-2t for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Mar 2026 17:06:45 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vze43-000A3u-0g for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Mar 2026 17:06:43 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vze42-000A3m-2t for pgsql-hackers@lists.postgresql.org; Mon, 09 Mar 2026 17:06:43 +0000 Received: from mail-ej1-x62a.google.com ([2a00:1450:4864:20::62a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vze40-00000001pYH-3skm for pgsql-hackers@postgresql.org; Mon, 09 Mar 2026 17:06:42 +0000 Received: by mail-ej1-x62a.google.com with SMTP id a640c23a62f3a-b96dc65b886so272211466b.0 for ; Mon, 09 Mar 2026 10:06:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773076000; cv=none; d=google.com; s=arc-20240605; b=ARqOGHib7UN0t6nmE1PcFtjCzwbnlRcM7bbII1vKW+SSzHtGdQXMDNo6eustLROaWk Ii9VBaF7BUStdrsKAodKW5z70L4iaqvRzzebWXS7SPKhv9qGxOSovzxZ/r5S1BhQq1d4 W/urFEH/PnAEzcUdmmA99fYKotDrncIFoHbqh+rrA8oRkDYusTsAXiSZTq5nlcCYuNWZ GxcljUNE9PXRUKMMhnFlLySDSOia9KwDWZQw8Pd+YDpxTMvEPPT878rMtjdnC7RoB/a3 zsnz76wr9KasnYSGMm3h7fsPGj7FnILpU4IqaRJ7q/dFw4MCu3NM6SzibTEc0GyzfVZy uFIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=1haMyN0cKRJ2dq3PFd4thrvyi0Zt9ZyQvand7CoOV3A=; fh=TVfGSc1ROJnfT2DI0+LhstR3EAj4703kqaeMLTgraqk=; b=RWxIHfEssjTgEB/XjXck1k/9IYWrceJtLpStDEwpyG8HCiHG+KAPwGQYq6BpwZ3UVP gtCYCXQkE9kr6hb6L9ocdlNgD8u9S/SD+bRosRTUrLH9LPutiDrqZ0VYyPa+efRtqSaH bmvGCTxXu4h5rjCRGj0bzj7WsN2amSB3tM3Eqwbyo21szSYCLJzy8FzsHxojyH6eoJd7 zAKLvwyUl42wxFNtWueNesn7gHvTEnJclgxE179N+KnKMOYYM0L1r/uYRlAgtVjuQkvF dcD86sQGTA4LKTgXzT7DdVIT8JphnzJCdUOMU+enFxGERiuhhnUxk48ejX0C3A95yWZU o1ow==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773076000; x=1773680800; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=1haMyN0cKRJ2dq3PFd4thrvyi0Zt9ZyQvand7CoOV3A=; b=I9cGUQgEBcTmRwZIKgxxKeZHn1H/FnwRIZRGZf7/2/+Nn6RdnP0f5B34YvmLLaX0uH MBdDZI+rzyXj4SSqJKxteNYx2Q8VXQPQ0u4vUUXqY6/mSEdBHxp+sjKXwvW6hsYLUIud uh4OUjpjEUDbwtIoGjrFgB5U1rdRJIDEBLLEiVN7rb2BaIJ7ZLK2JJMe/zGlCAj5bSuu 3vBokMlVqOoxBeS6ucCEkGIvp+VtEd+bwkF+axueoveL5NblcExSlGPTBREBRRLFaWMh L+OGwUUeAjpqAgkD4N+tHllP7qnFIWFG4CSAr+UD9A8CEV0usnclU1iU5aMVMkoFXcoW pzYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773076000; x=1773680800; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1haMyN0cKRJ2dq3PFd4thrvyi0Zt9ZyQvand7CoOV3A=; b=MoPHbiGYeqiOTKUie+l0O/4zMqCIbnZe86jIw5pTotBZONJTrIpqLRaPaY/hA4/Tr8 FeNLAfkj8L6Nvl3NBZjtoGCl+FPjyP/rDyTHFRXjAHnEFAPktsDCwnAVXXUpw33LmgQ9 ga9P7L9XVdRw0isSw6ZqE5byBlAqz9z0NvHsOlXFWqz4/UIb/IBrWXe5c19d+Rj6Cnmv y5U+xyPkBhuHtfdcnFYAk3shxW8SlThlYhJbidaxkBIuXjJcWp70AOjh1GDT/USfGI6d LRube+wOVYKNo9JbXVEWWgq1tXgPFGbryiqcS8lrtsmM8zAqI9VdpaAynS3xncj2L3A+ UWkg== X-Gm-Message-State: AOJu0YxIUVqPLDYbvE/gwlFjhCrbhHDFWfAesRA6CRV+e8pvCVFYhfHY RaAkuMGFSTkwOds+oFbfeumSw/2BPrQuy1vff7q+N86UXiFQNTnQ1tg8fq87abYQsQYnTof8wld cyXCc8i8XgPdfPbfdxlIek+EPSDMwAVA= X-Gm-Gg: ATEYQzw93QacBM9mVx+fWRd1Nohn76OqK1to3WfchaTmFxqroAWz7vy6oj1/LcAbdCQ iple8PgNt3ZsnAMYVGRd8Ve2woxrFlhxQw4nb4a/BVoDDkOrDMqkK0tt/yjc0fL99bZ1tcqXAmn OHuvbtkFQHIpQStL/ixmmUy0UbAciAblLgDWg2lAG5jRpIVZyxo69e6thGt+08Z+kObdG6c6gvK S1gXL1HxQXzw9EzLsknyyZl/OZpvusQ27BUE2LVFhpH2O3zl0N+HD2WwVdKLFf7FjvFgHO6ErBJ q6d+NsymIFXRPud3Ru/QQJGDP/e9lsU877U/gPDMIyau8lS25Qa/cfr+KMujPiMaoeogzZg= X-Received: by 2002:a17:906:dc92:b0:b90:bd3f:7c0d with SMTP id a640c23a62f3a-b9711a78478mr22925466b.27.1773076000163; Mon, 09 Mar 2026 10:06:40 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bryan Green Date: Mon, 9 Mar 2026 11:06:28 -0600 X-Gm-Features: AaiRm53lsRlkfW0qIvMNubxxiEIO1UFoJIrfL85AjOPDSZderIDl2Sl9jIMXP1Y Message-ID: Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) To: Ranier Vilela Cc: Pg Hackers Content-Type: multipart/alternative; boundary="000000000000264c91064c9a6d8c" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000264c91064c9a6d8c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I meant to add that the micro-benchmark went through each loop 100,000,000 iterations for each run... for 20 runs to come up with those numbers. --bg On Mon, Mar 9, 2026 at 11:02=E2=80=AFAM Bryan Green wrote: > I performed a micro-benchmark on my dual epyc (zen 2) server and version = 1 > wins for small values of n. > > 20 runs: > > n version min median mean max stddev noise% > ----------------------------------------------------------------------- > n=3D1 version1 2.440 2.440 2.450 2.550 0.024 4.5% > n=3D1 version2 4.260 4.280 4.277 4.290 0.007 0.7% > > n=3D2 version1 2.740 2.750 2.757 2.880 0.029 5.1% > n=3D2 version2 3.970 3.980 3.980 4.020 0.010 1.3% > > n=3D4 version1 4.580 4.595 4.649 4.910 0.094 7.2% > n=3D4 version2 5.780 5.815 5.809 5.820 0.013 0.7% > > But, micro-benchmarks always make me nervous, so I looked at the actual > instruction cost for my > platform given the version 1 and version 2 code. > > If we count cpu cycles using the AMD Zen 2 instruction latency/throughput > tables: version 1 (loop body) > has a critical path of ~5-6 cycles per iteration. version 2 (loop body) > has ~3-4 cycles per iteration. > > The problem for version 2 is that the call to memcpy is ~24-30 cycles due > to the stub + function call + return > and branch predictor pressure on first call. This probably results in > ~2.5 ns per iteration cost for version 2. > > So, no I wouldn't call it an optimization. But, it will be interesting t= o > hear other opinions on this. > > --bg > > > On Mon, Mar 9, 2026 at 10:25=E2=80=AFAM Ranier Vilela wrote: > >> >> >> Em seg., 9 de mar. de 2026 =C3=A0s 11:47, Bryan Green >> escreveu: >> >>> I created an example that is a little bit closer to the actual code and >>> changed the compiler from C++ to C. >>> >>> It is interesting the optimization that the compiler has chosen for >>> version 1 versus version 2. One calls >>> memcpy and one doesn't. There is a good chance the inlining of memcpy >>> as SSE+scalar per iteration >>> will be faster for syscache scans-- which I believe are usually small >>> (1-4 keys?). >>> >> I doubt the inline version is better. >> Clang is supported too and the code generated is much better with memcpy >> one call outside of the loop. >> >> >>> >>> Probably the only reason to do this patch would be if N is normally >>> large or if this is considered an >>> improvement in code clarity without a detrimental impact on small N >>> syscache scans. >>> I realize you only said "possible small optimization". It might be >>> worthwhile to benchmark the code for >>> different values of n to determine if there is a tipping point either >>> way? >>> >> In your opinion, shouldn't this be considered an optimization, even a >> small one? >> >> best regards, >> Ranier Vilela >> > --000000000000264c91064c9a6d8c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I meant to add that the micro-benchmark went through each = loop 100,000,000 iterations for each run...
for 20 runs to come up with= those numbers.=C2=A0=C2=A0

--bg

On Mon, Mar 9, 2026 at 11:02=E2=80=AFAM Bryan Green <dbryan.green@gmail.com> wrote:
I= performed a micro-benchmark on my dual epyc (zen 2) server and version 1 w= ins for small values of n.

20 runs:=C2=A0

=
n =C2=A0 =C2=A0 =C2=A0 version =C2=A0 =C2=A0 =C2=A0 min =C2=A0median = =C2=A0 =C2=A0mean =C2=A0 =C2=A0 max =C2=A0stddev =C2=A0noise%
----------= -------------------------------------------------------------
n=3D1 =C2= =A0 =C2=A0 version1 =C2=A0 =C2=A0 2.440 =C2=A0 2.440 =C2=A0 2.450 =C2=A0 2.= 550 =C2=A0 0.024 =C2=A0 =C2=A04.5%
n=3D1 =C2=A0 =C2=A0 version2 =C2=A0 = =C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.277 =C2=A0 4.290 =C2=A0 0.007 =C2=A0 =C2= =A00.7%

n=3D2 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.740 =C2=A0 2.75= 0 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 0.029 =C2=A0 =C2=A05.1%
n=3D2 =C2=A0 = =C2=A0 version2 =C2=A0 =C2=A0 3.970 =C2=A0 3.980 =C2=A0 3.980 =C2=A0 4.020 = =C2=A0 0.010 =C2=A0 =C2=A01.3%

n=3D4 =C2=A0 =C2=A0 version1 =C2=A0 = =C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.649 =C2=A0 4.910 =C2=A0 0.094 =C2=A0 =C2= =A07.2%
n=3D4 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 5.780 =C2=A0 5.815 = =C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013 =C2=A0 =C2=A00.7%

But, micro-benchmarks always make me nervous, so I looked at the act= ual instruction cost for my=C2=A0
platform given the version 1 an= d version 2 code.

If we count cpu cycles using the= AMD Zen 2 instruction latency/throughput tables:=C2=A0 version 1 (loop bod= y)=C2=A0
has a critical path of ~5-6 cycles per iteration.=C2=A0 = version 2 (loop body) has ~3-4 cycles per iteration.=C2=A0

The problem for version 2 is that the call to memcpy is ~24-30 cyc= les due to the stub=C2=A0+ function call=C2=A0+ return
and branch= predictor pressure on first call.=C2=A0 This probably results in ~2.5 ns p= er iteration cost for version 2.

So, no I wouldn&#= 39;t call it an optimization.=C2=A0 But, it will be interesting to hear oth= er opinions on this.=C2=A0

--bg


On Mon, Mar 9, 2026 at 10:25=E2=80=AFAM Ranier Vilela <ranier.vf@gmail.com&g= t; wrote:


Em seg., 9 de mar. de 2026 =C3=A0s 11:47, B= ryan Green <= dbryan.green@gmail.com> escreveu:
I created an example that is a li= ttle bit closer to the actual code and changed the compiler from C++ to C.= =C2=A0

It is interesting the optimization that the compi= ler has chosen for version 1 versus version 2.=C2=A0 One calls
me= mcpy and one doesn't.=C2=A0 There is a good chance the inlining of memc= py as SSE+scalar per iteration
will be faster for syscache scans-= - which I believe are usually small (1-4 keys?).=C2=A0=C2=A0
I doubt the inline version is better.
Clang is sup= ported too and the code generated is much better with memcpy one call outsi= de of the loop.
=C2=A0

Probably the only reason = to do this patch would be if N is normally large or if this is considered a= n
improvement in code clarity without a detrimental=C2=A0impact o= n small N syscache scans.=C2=A0=C2=A0
I realize you only said &qu= ot;possible small optimization".=C2=A0 It might be worthwhile to bench= mark the code for=C2=A0
different values of n to determine if the= re is a tipping point either way?
=C2=A0In you= r opinion, shouldn't this be considered an optimization, even a small o= ne?

best regards,
Ranier Vilela
--000000000000264c91064c9a6d8c--