Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lsnOc-0008Pk-C6 for pgsql-admin@arkaria.postgresql.org; Mon, 14 Jun 2021 14:16:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1lsnOZ-0003n2-Qj for pgsql-admin@arkaria.postgresql.org; Mon, 14 Jun 2021 14:16:55 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lsnOZ-0003lH-50 for pgsql-admin@lists.postgresql.org; Mon, 14 Jun 2021 14:16:55 +0000 Received: from mail-qk1-x734.google.com ([2607:f8b0:4864:20::734]) by makus.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lsnOV-00019E-W8 for pgsql-admin@postgresql.org; Mon, 14 Jun 2021 14:16:53 +0000 Received: by mail-qk1-x734.google.com with SMTP id j62so24674928qke.10 for ; Mon, 14 Jun 2021 07:16:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=seiler-us.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=R1lVj4qjySd8G8c68AMj6+GNQTOZ+9i1V+IKemfDBiU=; b=UY+Yw44Zg1CRTqzcP+ECnguip1xo8iuyXouGfYvY4a2+xWaRHVCTvPwHUkCBWfGegq UKo5u6+U8mIlY+90yxCO4yfvxgb3cDPoabcpiV/T1wc3KzmKhsfLNIq1+rzxaqK2V+Ye kQbVYeZ7vqAW4CMnnOy7+5SWF311yJxtpO3CtEqf297aS6WLHweqfunBGhgkvzc68GnH rhmyfvdjR1s0iRVQgsbIg0MrFzcIuTiYvnyStA2/RKuXn88ch5LmJE5dGYrkXZqztRDc W41wYebfrNY1M3PvTtXmuK524MHl92PqQoRKFufw25gXEVoy0x2SHowMdojhSU139a8O nW2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=R1lVj4qjySd8G8c68AMj6+GNQTOZ+9i1V+IKemfDBiU=; b=AR5fF/+h2V3FY5y7HLfRfzZqHzjf2cRPAbZBjLsAZYB6tlutWO2+6AymkQQCxBTIic Liuswwr+nl5bVy5+7ANH5ucOacQM/oUZCKbl4JHaNDEEaMMaUKkApUuCMRrPuy5sgww9 Kf203JNlQZUTvLvWgZHOG2M7Ex8IuQA7PntyBsarjOoXTKnSEFJCEgir31GlL1suvBIX cDHDzoWRska08e2ljfwzj7BefT5yVVSTn87rdhb9CKha0mOnah5X2rBx1DxI10X1zGrY 4XdcBAFFJ26WKcb+CJTSpOHE3fo5CfLluzIrSahTUWFwd/mnzdqNJAmYjjpL+tu4YPRz C7xw== X-Gm-Message-State: AOAM532+/kUv0Qeqbn/KSHM4m5xRCOvPN7lEyGw/j3flfGA4Q3q36mhK gmOWuudG6xUQCsn6R9BFsLXlKd10BmWP9KQ7tVaPFw== X-Google-Smtp-Source: ABdhPJydTjPBnlqLTlkG24WVudtgCHSou8DnA1H9ArgehZd/ws7EiSMPUcZoDae5ZHTCG3WRttOvVMz4ABFXQobtsUE= X-Received: by 2002:ae9:eb44:: with SMTP id b65mr16280253qkg.443.1623680210352; Mon, 14 Jun 2021 07:16:50 -0700 (PDT) MIME-Version: 1.0 References: <1428485.1623266137@sss.pgh.pa.us> <1429225.1623266925@sss.pgh.pa.us> <20210611002333.GK16435@telsasoft.com> In-Reply-To: <20210611002333.GK16435@telsasoft.com> From: Don Seiler Date: Mon, 14 Jun 2021 09:16:39 -0500 Message-ID: Subject: Re: Estimating HugePages Requirements? To: Justin Pryzby Cc: P C , Magnus Hagander , Julien Rouhaud , Tom Lane , pgsql-admin Content-Type: multipart/alternative; boundary="0000000000002ad25105c4ba82d0" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000002ad25105c4ba82d0 Content-Type: text/plain; charset="UTF-8" On Thu, Jun 10, 2021 at 7:23 PM Justin Pryzby wrote: > On Wed, Jun 09, 2021 at 10:55:08PM -0500, Don Seiler wrote: > > On Wed, Jun 9, 2021, 21:03 P C wrote: > > > > > I agree, its confusing for many and that confusion arises from the fact > > > that you usually talk of shared_buffers in MB or GB whereas hugepages > have > > > to be configured in units of 2mb. But once they understand they > realize its > > > pretty simple. > > > > > > Don, we have experienced the same not just with postgres but also with > > > oracle. I havent been able to get to the root of it, but what we > usually do > > > is, we add another 100-200 pages and that works for us. If the SGA or > > > shared_buffers is high eg 96gb, then we add 250-500 pages. Those few > > > hundred MBs may be wasted (because the moment you configure > hugepages, the > > > operating system considers it as used and does not use it any more) but > > > nowadays, servers have 64 or 128 gb RAM easily and wasting that 500mb > to > > > 1gb does not hurt really. > > > > I don't have a problem with the math, just wanted to know if it was > > possible to better estimate what the actual requirements would be at > > deployment time. My fallback will probably be you did and just pad with > an > > extra 512MB by default. > > It's because the huge allocation isn't just shared_buffers, but also > wal_buffers: > > | The amount of shared memory used for WAL data that has not yet been > written to disk. > | The default setting of -1 selects a size equal to 1/32nd (about 3%) of > shared_buffers, ... > > .. and other stuff: > > src/backend/storage/ipc/ipci.c > * Size of the Postgres shared-memory block is estimated via > * moderately-accurate estimates for the big hogs, plus 100K for > the > * stuff that's too small to bother with estimating. > * > * We take some care during this phase to ensure that the total > size > * request doesn't overflow size_t. If this gets through, we don't > * need to be so careful during the actual allocation phase. > */ > size = 100000; > size = add_size(size, PGSemaphoreShmemSize(numSemas)); > size = add_size(size, SpinlockSemaSize()); > size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE, > > sizeof(ShmemIndexEnt))); > size = add_size(size, dsm_estimate_size()); > size = add_size(size, BufferShmemSize()); > size = add_size(size, LockShmemSize()); > size = add_size(size, PredicateLockShmemSize()); > size = add_size(size, ProcGlobalShmemSize()); > size = add_size(size, XLOGShmemSize()); > size = add_size(size, CLOGShmemSize()); > size = add_size(size, CommitTsShmemSize()); > size = add_size(size, SUBTRANSShmemSize()); > size = add_size(size, TwoPhaseShmemSize()); > size = add_size(size, BackgroundWorkerShmemSize()); > size = add_size(size, MultiXactShmemSize()); > size = add_size(size, LWLockShmemSize()); > size = add_size(size, ProcArrayShmemSize()); > size = add_size(size, BackendStatusShmemSize()); > size = add_size(size, SInvalShmemSize()); > size = add_size(size, PMSignalShmemSize()); > size = add_size(size, ProcSignalShmemSize()); > size = add_size(size, CheckpointerShmemSize()); > size = add_size(size, AutoVacuumShmemSize()); > size = add_size(size, ReplicationSlotsShmemSize()); > size = add_size(size, ReplicationOriginShmemSize()); > size = add_size(size, WalSndShmemSize()); > size = add_size(size, WalRcvShmemSize()); > size = add_size(size, PgArchShmemSize()); > size = add_size(size, ApplyLauncherShmemSize()); > size = add_size(size, SnapMgrShmemSize()); > size = add_size(size, BTreeShmemSize()); > size = add_size(size, SyncScanShmemSize()); > size = add_size(size, AsyncShmemSize()); > #ifdef EXEC_BACKEND > size = add_size(size, ShmemBackendArraySize()); > #endif > > /* freeze the addin request size and include it */ > addin_request_allowed = false; > size = add_size(size, total_addin_request); > > /* might as well round it off to a multiple of a typical page size > */ > size = add_size(size, 8192 - (size % 8192)); > > BTW, I think it'd be nice if this were a NOTICE: > | elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: > %m", allocsize); > Great detail. I did some trial and error around just a few variables (shared_buffers, wal_buffers, max_connections) and came up with a formula that seems to be "good enough" for at least a rough default estimate. The pseudo-code is basically: ceiling((shared_buffers + 200 + (25 * shared_buffers/1024) + 10*(max_connections-100)/200 + wal_buffers-16)/2) This assumes that all values are in MB and that wal_buffers is set to a value other than the default of -1 obviously. I decided to default wal_buffers to 16MB in our environments since that's what -1 should go to based on the description in the documentation for an instance with shared_buffers of the sizes in our deployments. This formula did come up a little short (2MB) when I had a low shared_buffers value at 2GB. Raising that starting 200 value to something like 250 would take care of that. The limited testing I did based on different values we see across our production deployments worked otherwise. Please let me know what you folks think. I know I'm ignoring a lot of other factors, especially given what Justin recently shared. The remaining trick for me now is to calculate this in chef since shared_buffers and wal_buffers attributes are strings with the unit ("MB") in them, rather than just numerical values. Thinking of changing that attribute to be just that and assume/require MB to make the calculations easier. -- Don Seiler www.seiler.us --0000000000002ad25105c4ba82d0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Jun 10, 2021 at 7:23 PM Justin Pr= yzby <pryzby@telsasoft.com&g= t; wrote:
On Wed, Jun 09, 2021 at 10:55:08PM -0500, Don Seiler w= rote:
> On Wed, Jun 9, 2021, 21:03 P C <puravc@gmail.com> wrote:
>
> > I agree, its confusing for many and that confusion arises from th= e fact
> > that you usually talk of shared_buffers in MB or GB whereas hugep= ages have
> > to be configured in units of 2mb. But once they understand they r= ealize its
> > pretty simple.
> >
> > Don, we have experienced the same not just with postgres but also= with
> > oracle. I havent been able to get to the root of it, but what we = usually do
> > is, we add another 100-200 pages and that works for us. If the SG= A or
> > shared_buffers is high eg 96gb, then we add 250-500 pages. Those = few
> > hundred MBs=C2=A0 may be wasted (because the moment you configure= hugepages, the
> > operating system considers it as used and does not use it any mor= e) but
> > nowadays, servers have 64 or 128 gb RAM easily and wasting that 5= 00mb to
> > 1gb does not hurt really.
>
> I don't have a problem with the math, just wanted to know if it wa= s
> possible to better estimate what the actual requirements would be at > deployment time. My fallback will probably be you did and just pad wit= h an
> extra 512MB by default.

It's because the huge allocation isn't just shared_buffers, but als= o
wal_buffers:

| The amount of shared memory used for WAL data that has not yet been writt= en to disk.
| The default setting of -1 selects a size equal to 1/32nd (about 3%) of sh= ared_buffers, ...

.. and other stuff:

src/backend/storage/ipc/ipci.c
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* Size of the Postgres shared-memory bloc= k is estimated via
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* moderately-accurate estimates for the b= ig hogs, plus 100K for the
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* stuff that's too small to bother wi= th estimating.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* We take some care during this phase to = ensure that the total size
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* request doesn't overflow size_t.=C2= =A0 If this gets through, we don't
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0* need to be so careful during the actual= allocation phase.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*/
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D 100000;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PGSemaphoreShmemSize(nu= mSemas));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SpinlockSemaSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, hash_estimate_size(SHME= M_INDEX_SIZE,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0sizeof(ShmemIndexEnt)));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, dsm_estimate_size()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BufferShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, LockShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PredicateLockShmemSize(= ));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ProcGlobalShmemSize());=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, XLOGShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, CLOGShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, CommitTsShmemSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SUBTRANSShmemSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, TwoPhaseShmemSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BackgroundWorkerShmemSi= ze());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, MultiXactShmemSize());<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, LWLockShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ProcArrayShmemSize());<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BackendStatusShmemSize(= ));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SInvalShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PMSignalShmemSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ProcSignalShmemSize());=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, CheckpointerShmemSize()= );
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, AutoVacuumShmemSize());=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ReplicationSlotsShmemSi= ze());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ReplicationOriginShmemS= ize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, WalSndShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, WalRcvShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, PgArchShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ApplyLauncherShmemSize(= ));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SnapMgrShmemSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, BTreeShmemSize());
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, SyncScanShmemSize()); =C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, AsyncShmemSize());
#ifdef EXEC_BACKEND
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, ShmemBackendArraySize()= );
#endif

=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* freeze the addin request size and include it= */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 addin_request_allowed =3D false;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, total_addin_request);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 /* might as well round it off to a multiple of = a typical page size */
=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =3D add_size(size, 8192 - (size % 8192));<= br>
BTW, I think it'd be nice if this were a NOTICE:
| elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled= : %m", allocsize);

Great detail. I= did some trial and error around just a few variables (shared_buffers, wal_= buffers, max_connections) and came up with a formula that seems to be "= ;good enough" for at least a rough default estimate.

The pseudo-code is basically:

ceiling((shared_buffers + 20= 0 + (25 * shared_buffers/1024) + 10*(max_connections-100)/200 + wal_buffers= -16)/2)
=C2=A0
This assumes that all val= ues are in MB and that wal_buffers is set to a value other than the default= of -1 obviously. I decided to default wal_buffers to 16MB in our environme= nts since that's what -1 should go to based on the description in the d= ocumentation for an instance with shared_buffers of the sizes in our deploy= ments.

This formula did come up a little short (2MB) when I had a low shared_buff= ers value at 2GB. Raising that starting 200 value to something like 250 wou= ld take care of that. The limited testing I did based on different values w= e see across our production deployments worked otherwise. Please let me kno= w what you folks think. I know I'm ignoring a lot of other factors, esp= ecially given what Justin recently shared.
=
The remaining trick for me now is to c= alculate this in chef since shared_buffers and wal_buffers attributes are s= trings with the unit ("MB") in them, rather than just numerical v= alues. Thinking of changing that attribute to be just that and assume/requi= re MB to make the calculations easier.

--
Don Seiler
www.seiler.us
--0000000000002ad25105c4ba82d0--