Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vsjpc-00HF3d-0y for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Feb 2026 15:51:16 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vsjpb-00GipU-1S for pgsql-hackers@arkaria.postgresql.org; Wed, 18 Feb 2026 15:51:15 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vsjpb-00GipL-0P for pgsql-hackers@lists.postgresql.org; Wed, 18 Feb 2026 15:51:15 +0000 Received: from mail-wr1-x431.google.com ([2a00:1450:4864:20::431]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vsjpZ-00000001Npu-1YfN for pgsql-hackers@postgresql.org; Wed, 18 Feb 2026 15:51:15 +0000 Received: by mail-wr1-x431.google.com with SMTP id ffacd0b85a97d-43591b55727so6081086f8f.3 for ; Wed, 18 Feb 2026 07:51:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1771429872; cv=none; d=google.com; s=arc-20240605; b=UcQSyIzFtLNEdqnRHFNCDh5bWHG84ckIUJI1BEIZDxE+u1PMMYcuokSD8OaG1N4vAj t7syPVk0QuKb4QXT6UCZbrirxHr/xsqzQSrfiuOEZa73h33sU8AMnMZvaInFOR4pbkOy XD3hltbR+Btt7TMyqERe/Xs77Q6YicAYUFhciXNRvZ2fuJ1CszKw+sGh2Y67savlNefV 4M/FaZr5Q0gPMpYJzvGtz1CoRjGcdZKjWFVuO9G2YF3y9iuI7hsnuHMKJdGx6t5WrzSj 5ckwEpBR7nKzFrdfdSyTvo8HjLz+lq8HA8m3YD0DM2I/+yr+tiNRb1UKFCnRhFdT9G30 rg6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=w/fwRrEmKz0E6aze9CBDl+3Vbil4VjIUUDrMWe+wsT8=; fh=N9gJFGKoQ1twCkRipetoZYxj2v24yAHJGaw1ssiihHc=; b=PHn20iAQSuSRuzE+PYmZe3EMbhHrANqknHkt7dwcqukm+xw0cIpTnFWUkcI26HYFmO ai8Mh4dw6kzRBZWMA/NXvNgsdTkhzLcyjdJRroa8RVNimnzRRl6gddYMl/8liMuCY3Wa ZwJQMl+jEbDrVq0SVK5prttwB54y3W5XvDjpsQrVs+1EbGs58VGLKgN7aWrJd0ADbvBg 4oW1VFUB5+EOMSl7wpVPgJxpjZUM3I0vEKgGP3XU8/Bq3VefICHMkTN27zkiNTQs5eWI VdOaO2Rafvu7HtlGIOUvPdzOqIU/c9c+Ecp6Eb7hYNvTMqpEGkRqJcCEG0HyHYx/kltW VOkg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1771429872; x=1772034672; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=w/fwRrEmKz0E6aze9CBDl+3Vbil4VjIUUDrMWe+wsT8=; b=deiZQFLjDj976bl1XDMyUfdnaxupOO6YQ81ZLNERHywuMKuNo1U+qOInPi9L/OjqoS IqcYzBCfsr4zTJzDrCNnRvBl7J0Ckrv3ZIYcItAXF3y1ewUmIvaj2FZ1WZ2MKEGK27yQ uGAXmReveBVsjenVBz6ArUFtCCk8JAnpLkW6sxRl7YPYe4HpuZYjk38DLQPJ+u5/iQtQ J/5hLZuUi44NbzBE28ckSmXu8N67jeIOprhs/kCp016qCBQGchdu9ay/b3zn7Cq48/K0 mEelZ7CGaXa3i2zHEN4+xw3GSHytFUwXHeQTQ0HrcgxK/nUVdyV3GSWv9zKB2sFnhH9A M9xQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771429872; x=1772034672; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=w/fwRrEmKz0E6aze9CBDl+3Vbil4VjIUUDrMWe+wsT8=; b=HSc8CsugAilxNW/IRup3QhZj53ZWjRBpX+yfMj518bT525/4jZyh0NIz76/5oWx/EW vXWCKjmLVyvNL9FlfBGrt3vlGzkMFGEwIpXWQg7dd6uoqjMfF9+xADD/6ExOpp5aRzOW tP1ZOBzE7+Gc6DQGbslavVHGO4Fc63ir6/pdi0AkfUHAAjupVwiWpurTS+/reJuhWAk3 am6FIbiYXLKzDB1UOFWbTRd/Wx48LroET87t/wL1NVgBfwRBF7nd3ZBtxyCe6mBhloa8 8U6OeZYtgV+CNJdkwF+sOxK7kWK6cgwwGFzT6dtDwLc0mJ0BeRQg66gE445E7I5QEUE5 /t8w== X-Gm-Message-State: AOJu0YwYofwXAT0BZz5qpBG4NDiMay+ucV3H8It3l8lIqbmdQqSBAtgp ayVgabB4a7jBY9AeqArz+AVn7vomBXdneMwMq2eoMgC2kdv19FMwIUi6mR4CAIkPxThPXbcG8si GRkHTBc4Ar1IDK5f0+8JmntaL5TrjuP1EDQ== X-Gm-Gg: AZuq6aI7rVj+7jXwkw1E4CbfxJEZUZHDckU0LuSjl3xqrj0ztLMOHhctJYewGNXsD6I kt16t/j6uU61BP/sQmYMe6zticC4LtS2DDntAf7QQadPAbytTGtj5HPmADs6DBJwsELAVn8i1Vi m2RqPeNSgzUPsZyVrAmC08mbIoR/W+IDt7MnNiqoPjb8V0wOnK4uDAhAblyOSo5ikpDd6fDKnSy YCfikRsQLChDJJeUMfd5CCJNSiowOl/QEJ1hysdEIWm6SPM8UCMv805cu6WGG9uvGOT8MvH+xoA VjiNwIb0aPGbXb+8mOryoxwpCDS2m7747gnSFgukBAIxfniDUz4= X-Received: by 2002:a5d:5d85:0:b0:430:ff0c:35f9 with SMTP id ffacd0b85a97d-4379dba3125mr28246028f8f.48.1771429871787; Wed, 18 Feb 2026 07:51:11 -0800 (PST) MIME-Version: 1.0 References: <5a37c2e3-619d-4816-84d7-0b27e3e6797f@iki.fi> In-Reply-To: From: Ashutosh Bapat Date: Wed, 18 Feb 2026 21:20:59 +0530 X-Gm-Features: AaiRm51RmM5Z1kRrHAUIi_8REdDW5bRT16GRUhrzJ1lWfOLoZ9S-gFVh3Eu_N2Q Message-ID: Subject: Re: Better shared data structure management and resizable shared data structures To: Andres Freund , Heikki Linnakangas Cc: pgsql-hackers , chaturvedipalak1911@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Feb 18, 2026 at 9:17=E2=80=AFPM Ashutosh Bapat wrote: > > On Tue, Feb 17, 2026 at 5:06=E2=80=AFPM Ashutosh Bapat > wrote: > > > > On Mon, Feb 16, 2026 at 11:26=E2=80=AFPM Andres Freund wrote: > > > > > > I think we *do* want the MADV_POPULATE_WRITE, at least when using hug= e pages, > > > because otherwise you'll get a SIGBUS when accessing the memory if th= ere is no > > > huge page available anymore. > > > > > > > Ok. > > > > Jakub's experiments [1] showed that fallocate()ing shared memory would > > slow down postmaster start on a slow machine. I suppose the same thing > > applies to MADV_POPULATE_WRITE. And we don't do that today even in the > > case of huge pages; so we already have that problem. > > > > If we perform MADV_POPULATE_WRITE, do we want it only for resizable > > shared memory structures or all the structures in the shared memory? > > In the attached patches, I have used MADV_POPULATE_WRITE during > resizing, which is run time operation. When the structures are > allocated when server starts, they are usually initialised, so we end > up allocating memory for the same. So we don't need > MADV_POPULATE_WRITE at that time, and thus avoid affecting startup > slowness, if any. Buffer blocks are not initialised at the time of > starting the server, so their memory is allocated as they are > accessed. But that's how it works today, so no change there. > > > > > > > > > > > > 4. the address and length passed to madvise needs to be page aligne= d, > > > > but that passed to fallocate() needn't be. `man fallocate` says > > > > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux > > > > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byt= e > > > > range starting at offset and continuing for len bytes. Within the > > > > specified range, partial filesystem blocks are zeroed, and whole > > > > filesystem blocks are removed from the file.". It seems to be > > > > automatically taking care of the page size. So using fallocate() > > > > simplifies logic. Further `man madvise` says "but since Linux 3.5, = any > > > > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mod= e > > > > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is > > > > guaranteed to be available on a system which supports MADV_REMOVE. > > > > > > I think it makes no sense to support resizing below page size > > > granularity. What's the point of doing that? > > > > > > > No point really. But we can not control the extensions which want to > > specify a maximum size smaller than a page size. They wouldn't know > > what page size the underlying machine will have, especially with huge > > pages which have a wide range of sizes. Even in the case of shared > > buffers, a value of max_shared_buffers may cause buffer blocks to span > > pages but other structures may fit a page. > > > > In the attached patches, if a resizable structure is such that its > > max_size is smaller than a page size, it is treated as a fixed > > structure with size =3D max_size. Any request to resize such structures > > will simply update the metadata without actual madvise operation. Only > > the structures whose max_size > page_size would be treated as truly > > resizable and will use madvise. You bring another interesting point. > > If a resizable structure has a maximum size higher than the page size, > > but it is allocated such that the initial part of it is on a partially > > allocated page and the last part of it is on another partially > > allocated page, those pages are never freed because of adjoining > > structures. Per the logic in the attached patches, all the fixed (or > > pseudo-resizable structures) are packed together. The resizable > > structures start on a page boundary and their max_sizes are adjusted > > to be page aligned. That way we can release pages when the structure > > shrinks more than a page. > > > > > > > > > > Using fallocate() (or madvise()) to free memory, we don't need > > > > multiple segments. So much less code churn compared to the multiple > > > > mappings approach. However, there is one drawback. In the multiple > > > > mapping approach access beyond the current size of the structure wo= uld > > > > result in segfault or bus error. But in the fallocate/madvise appro= ach > > > > such an access does not cause a crash. A write beyond the pages tha= t > > > > fit the current size of the structure causes more memory to be > > > > allocated silently. A read returns 0s. So, there's a possibility th= at > > > > bugs in size calculations might go unnoticed. I think that's how it > > > > works even today, access in the yet un-allocated part of the shared > > > > memory will simply go unnoticed. > > > > > > If that's something you care about, you can mprotect(PROT_NONE) the r= elevant > > > regions. > > > > I am fine, if we let go of this protection while getting rid of > > multiple segments, if we all agree to do so. > > > > I could be wrong, but mprotect needs to be executed in every backend > > where the memory is mapped and then a new backend needs to inherit it > > from the postmaster. Makes resizing complex since it has to touch > > every backend. So avoiding mprotect is better. > > > Sent too soon. I have also reworked the test into a TAP test which looks stable than the earlier version. Haven't had any failures on my laptop. > If the general approach in the attached patches looks good, we can > work on improving the 0001 + 0002 to be committable and then work on > 0003. The resizable memory patch works only in linux where MADV_POPULATE_WRITE and MADV_REMOVE are supported on anonymous shared memory. On other platforms and where that support doesn't exist, we will need to disable the feature for now. That work remains. Also the TODOs need to be addressed. --=20 Best Wishes, Ashutosh Bapat