MIME-Version: 1.0
References: 
 <CAHJZqBBLHFNs6it-fcJ6LEUXeC5t73soR3h50zUSFpg7894qfQ@mail.gmail.com>
 <CAOBaU_ZHtRhG+MvoT6=HpbMoK=JnsJHBMmxQ-XLSVoh6fFqJHQ@mail.gmail.com>
 <CABUevExXvoPvLN70CznmQfbjwxnrdXo9gXxZwGpBoUhjtFi3Ng@mail.gmail.com>
 <1428485.1623266137@sss.pgh.pa.us>
 <CABUevEzn7fZmZ-8m4ph0wBa1bqz32ghWh-vjhf+Vh0OjW9EdQw@mail.gmail.com>
 <1429225.1623266925@sss.pgh.pa.us>
 <CABUevEzwS3zG3X6dpSZTy-u1LoJ8kRS_A493mFdMii3_AhX-Ew@mail.gmail.com>
 <CADrzpjFQ8awR62Y0GC7K=ohtnBeAL06jkuMHqh6neCF3H89jMw@mail.gmail.com>
 <CAHJZqBATodSGXZ2vD_4efKdmAdaN0ucP=m93KL7Xmf5jqNzvYw@mail.gmail.com>
 <20210611002333.GK16435@telsasoft.com>
In-Reply-To: <20210611002333.GK16435@telsasoft.com>
From: Don Seiler <don@seiler.us>
Date: Mon, 14 Jun 2021 09:16:39 -0500
Message-ID: 
 <CAHJZqBAZ+SYR4jZ-Jy5nHYwUP3vYF+UjPGKwCR+gZm0z8vyoag@mail.gmail.com>
Subject: Re: Estimating HugePages Requirements?
To: Justin Pryzby <pryzby@telsasoft.com>
Cc: P C <puravc@gmail.com>, Magnus Hagander <magnus@hagander.net>,
	Julien Rouhaud <rjuju123@gmail.com>, Tom Lane <tgl@sss.pgh.pa.us>,
	pgsql-admin <pgsql-admin@postgresql.org>
Content-Type: multipart/alternative; boundary="0000000000002ad25105c4ba82d0"
Archived-At: 
 <https://www.postgresql.org/message-id/CAHJZqBAZ%2BSYR4jZ-Jy5nHYwUP3vYF%2BUjPGKwCR%2BgZm0z8vyoag%40mail.gmail.com>
Precedence: bulk

--0000000000002ad25105c4ba82d0
Content-Type: text/plain; charset="UTF-8"

On Thu, Jun 10, 2021 at 7:23 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

> On Wed, Jun 09, 2021 at 10:55:08PM -0500, Don Seiler wrote:
> > On Wed, Jun 9, 2021, 21:03 P C <puravc@gmail.com> wrote:
> >
> > > I agree, its confusing for many and that confusion arises from the fact
> > > that you usually talk of shared_buffers in MB or GB whereas hugepages
> have
> > > to be configured in units of 2mb. But once they understand they
> realize its
> > > pretty simple.
> > >
> > > Don, we have experienced the same not just with postgres but also with
> > > oracle. I havent been able to get to the root of it, but what we
> usually do
> > > is, we add another 100-200 pages and that works for us. If the SGA or
> > > shared_buffers is high eg 96gb, then we add 250-500 pages. Those few
> > > hundred MBs  may be wasted (because the moment you configure
> hugepages, the
> > > operating system considers it as used and does not use it any more) but
> > > nowadays, servers have 64 or 128 gb RAM easily and wasting that 500mb
> to
> > > 1gb does not hurt really.
> >
> > I don't have a problem with the math, just wanted to know if it was
> > possible to better estimate what the actual requirements would be at
> > deployment time. My fallback will probably be you did and just pad with
> an
> > extra 512MB by default.
>
> It's because the huge allocation isn't just shared_buffers, but also
> wal_buffers:
>
> | The amount of shared memory used for WAL data that has not yet been
> written to disk.
> | The default setting of -1 selects a size equal to 1/32nd (about 3%) of
> shared_buffers, ...
>
> .. and other stuff:
>
> src/backend/storage/ipc/ipci.c
>          * Size of the Postgres shared-memory block is estimated via
>          * moderately-accurate estimates for the big hogs, plus 100K for
> the
>          * stuff that's too small to bother with estimating.
>          *
>          * We take some care during this phase to ensure that the total
> size
>          * request doesn't overflow size_t.  If this gets through, we don't
>          * need to be so careful during the actual allocation phase.
>          */
>         size = 100000;
>         size = add_size(size, PGSemaphoreShmemSize(numSemas));
>         size = add_size(size, SpinlockSemaSize());
>         size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
>
>                sizeof(ShmemIndexEnt)));
>         size = add_size(size, dsm_estimate_size());
>         size = add_size(size, BufferShmemSize());
>         size = add_size(size, LockShmemSize());
>         size = add_size(size, PredicateLockShmemSize());
>         size = add_size(size, ProcGlobalShmemSize());
>         size = add_size(size, XLOGShmemSize());
>         size = add_size(size, CLOGShmemSize());
>         size = add_size(size, CommitTsShmemSize());
>         size = add_size(size, SUBTRANSShmemSize());
>         size = add_size(size, TwoPhaseShmemSize());
>         size = add_size(size, BackgroundWorkerShmemSize());
>         size = add_size(size, MultiXactShmemSize());
>         size = add_size(size, LWLockShmemSize());
>         size = add_size(size, ProcArrayShmemSize());
>         size = add_size(size, BackendStatusShmemSize());
>         size = add_size(size, SInvalShmemSize());
>         size = add_size(size, PMSignalShmemSize());
>         size = add_size(size, ProcSignalShmemSize());
>         size = add_size(size, CheckpointerShmemSize());
>         size = add_size(size, AutoVacuumShmemSize());
>         size = add_size(size, ReplicationSlotsShmemSize());
>         size = add_size(size, ReplicationOriginShmemSize());
>         size = add_size(size, WalSndShmemSize());
>         size = add_size(size, WalRcvShmemSize());
>         size = add_size(size, PgArchShmemSize());
>         size = add_size(size, ApplyLauncherShmemSize());
>         size = add_size(size, SnapMgrShmemSize());
>         size = add_size(size, BTreeShmemSize());
>         size = add_size(size, SyncScanShmemSize());
>         size = add_size(size, AsyncShmemSize());
> #ifdef EXEC_BACKEND
>         size = add_size(size, ShmemBackendArraySize());
> #endif
>
>         /* freeze the addin request size and include it */
>         addin_request_allowed = false;
>         size = add_size(size, total_addin_request);
>
>         /* might as well round it off to a multiple of a typical page size
> */
>         size = add_size(size, 8192 - (size % 8192));
>
> BTW, I think it'd be nice if this were a NOTICE:
> | elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled:
> %m", allocsize);
>

Great detail. I did some trial and error around just a few variables
(shared_buffers, wal_buffers, max_connections) and came up with a formula
that seems to be "good enough" for at least a rough default estimate.

The pseudo-code is basically:

ceiling((shared_buffers + 200 + (25 * shared_buffers/1024) +
10*(max_connections-100)/200 + wal_buffers-16)/2)

This assumes that all values are in MB and that wal_buffers is set to a
value other than the default of -1 obviously. I decided to default
wal_buffers to 16MB in our environments since that's what -1 should go to
based on the description in the documentation for an instance with
shared_buffers of the sizes in our deployments.

This formula did come up a little short (2MB) when I had a low
shared_buffers value at 2GB. Raising that starting 200 value to something
like 250 would take care of that. The limited testing I did based on
different values we see across our production deployments worked otherwise.
Please let me know what you folks think. I know I'm ignoring a lot of other
factors, especially given what Justin recently shared.

The remaining trick for me now is to calculate this in chef since
shared_buffers and wal_buffers attributes are strings with the unit ("MB")
in them, rather than just numerical values. Thinking of changing that
attribute to be just that and assume/require MB to make the calculations
easier.

-- 
Don Seiler
www.seiler.us

--0000000000002ad25105c4ba82d0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Thu, Jun 10, 2021 at 7:23 PM Justin Pr=
yzby &lt;<a href=3D"mailto:pryzby@telsasoft.com">pryzby@telsasoft.com</a>&g=
t; wrote:<br></div><div class=3D"gmail_quote"><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,20=
4);padding-left:1ex">On Wed, Jun 09, 2021 at 10:55:08PM -0500, Don Seiler w=
rote:<br>
&gt; On Wed, Jun 9, 2021, 21:03 P C &lt;<a href=3D"mailto:puravc@gmail.com"=
 target=3D"_blank">puravc@gmail.com</a>&gt; wrote:<br>
&gt; <br>
&gt; &gt; I agree, its confusing for many and that confusion arises from th=
e fact<br>
&gt; &gt; that you usually talk of shared_buffers in MB or GB whereas hugep=
ages have<br>
&gt; &gt; to be configured in units of 2mb. But once they understand they r=
ealize its<br>
&gt; &gt; pretty simple.<br>
&gt; &gt;<br>
&gt; &gt; Don, we have experienced the same not just with postgres but also=
 with<br>
&gt; &gt; oracle. I havent been able to get to the root of it, but what we =
usually do<br>
&gt; &gt; is, we add another 100-200 pages and that works for us. If the SG=
A or<br>
&gt; &gt; shared_buffers is high eg 96gb, then we add 250-500 pages. Those =
few<br>
&gt; &gt; hundred MBs=C2=A0 may be wasted (because the moment you configure=
 hugepages, the<br>
&gt; &gt; operating system considers it as used and does not use it any mor=
e) but<br>
&gt; &gt; nowadays, servers have 64 or 128 gb RAM easily and wasting that 5=
00mb to<br>
&gt; &gt; 1gb does not hurt really.<br>
&gt; <br>
&gt; I don&#39;t have a problem with the math, just wanted to know if it wa=
s<br>
&gt; possible to better estimate what the actual requirements would be at<b=
r>
&gt; deployment time. My fallback will probably be you did and just pad wit=
h an<br>
&gt; extra 512MB by default.<br>
<br>
It&#39;s because the huge allocation isn&#39;t just shared_buffers, but als=
o<br>
wal_buffers:<br>
<br>
| The amount of shared memory used for WAL data that has not yet been writt=
en to disk.<br>
| The default setting of -1 selects a size equal to 1/32nd (about 3%) of sh=
ared_buffers, ...<br>
<br>
.. and other stuff:<br>
<br>
src/backend/storage/ipc/ipci.c<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Size of the Postgres shared-memory bloc=
k is estimated via<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* moderately-accurate estimates for the b=
ig hogs, plus 100K for the<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* stuff that&#39;s too small to bother wi=
th estimating.<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* We take some care during this phase to =
ensure that the total size<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* request doesn&#39;t overflow size_t.=C2=
=A0 If this gets through, we don&#39;t<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* need to be so careful during the actual=
 allocation phase.<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D 100000;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PGSemaphoreShmemSize(nu=
mSemas));<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SpinlockSemaSize());<br=
>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, hash_estimate_size(SHME=
M_INDEX_SIZE,<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0sizeof(ShmemIndexEnt)));<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, dsm_estimate_size());<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BufferShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, LockShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PredicateLockShmemSize(=
));<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ProcGlobalShmemSize());=
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, XLOGShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, CLOGShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, CommitTsShmemSize());<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SUBTRANSShmemSize());<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, TwoPhaseShmemSize());<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BackgroundWorkerShmemSi=
ze());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, MultiXactShmemSize());<=
br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, LWLockShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ProcArrayShmemSize());<=
br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BackendStatusShmemSize(=
));<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SInvalShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PMSignalShmemSize());<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ProcSignalShmemSize());=
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, CheckpointerShmemSize()=
);<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, AutoVacuumShmemSize());=
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ReplicationSlotsShmemSi=
ze());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ReplicationOriginShmemS=
ize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, WalSndShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, WalRcvShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PgArchShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ApplyLauncherShmemSize(=
));<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SnapMgrShmemSize());<br=
>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BTreeShmemSize());<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SyncScanShmemSize());<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, AsyncShmemSize());<br>
#ifdef EXEC_BACKEND<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ShmemBackendArraySize()=
);<br>
#endif<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* freeze the addin request size and include it=
 */<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addin_request_allowed =3D false;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, total_addin_request);<b=
r>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* might as well round it off to a multiple of =
a typical page size */<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, 8192 - (size % 8192));<=
br>
<br>
BTW, I think it&#39;d be nice if this were a NOTICE:<br>
| elog(DEBUG1, &quot;mmap(%zu) with MAP_HUGETLB failed, huge pages disabled=
: %m&quot;, allocsize);<br></blockquote><div><br></div><div>Great detail. I=
 did some trial and error around just a few variables (shared_buffers, wal_=
buffers, max_connections) and came up with a formula that seems to be &quot=
;good enough&quot; for at least a rough default estimate.</div><div><br></d=
iv><div>The pseudo-code is basically:</div><br>ceiling((shared_buffers + 20=
0 + (25 * shared_buffers/1024) + 10*(max_connections-100)/200 + wal_buffers=
-16)/2)<br>=C2=A0</div><div class=3D"gmail_quote">This assumes that all val=
ues are in MB and that wal_buffers is set to a value other than the default=
 of -1 obviously. I decided to default wal_buffers to 16MB in our environme=
nts since that&#39;s what -1 should go to based on the description in the d=
ocumentation for an instance with shared_buffers of the sizes in our deploy=
ments.</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote"=
>This formula did come up a little short (2MB) when I had a low shared_buff=
ers value at 2GB. Raising that starting 200 value to something like 250 wou=
ld take care of that. The limited testing I did based on different values w=
e see across our production deployments worked otherwise. Please let me kno=
w what you folks think. I know I&#39;m ignoring a lot of other factors, esp=
ecially given what Justin recently shared.</div><div class=3D"gmail_quote">=
<br></div><div class=3D"gmail_quote">The remaining trick for me now is to c=
alculate this in chef since shared_buffers and wal_buffers attributes are s=
trings with the unit (&quot;MB&quot;) in them, rather than just numerical v=
alues. Thinking of changing that attribute to be just that and assume/requi=
re MB to make the calculations easier.</div><div><br></div>-- <br><div dir=
=3D"ltr" class=3D"gmail_signature"><div dir=3D"ltr"><div>Don Seiler<br><a h=
ref=3D"http://www.seiler.us" target=3D"_blank">www.seiler.us</a></div></div=
></div></div>

--0000000000002ad25105c4ba82d0--