Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs2pN-002USr-0E for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 17:56:09 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vs2pL-004YGH-2b for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 17:56:07 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs2pL-004YG9-1F for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 17:56:07 +0000 Received: from fout-b5-smtp.messagingengine.com ([202.12.124.148]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vs2pI-00000000vgA-20X8 for pgsql-hackers@postgresql.org; Mon, 16 Feb 2026 17:56:06 +0000 Received: from phl-compute-08.internal (phl-compute-08.internal [10.202.2.48]) by mailfout.stl.internal (Postfix) with ESMTP id 931DA1D0056F; Mon, 16 Feb 2026 12:56:04 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-08.internal (MEProxy); Mon, 16 Feb 2026 12:56:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1771264564; x=1771350964; bh=54905C6qh2g4mv1IEERX6akBg9FxY782uq117vo96Ng=; b= CmZbGbI8oKD4WXpDfBWxVizlIY5ujojSm9PxpPC3IvcSVLuOeTzGGGBzniekgXbK swyBEWzVvAQ8IntAchgXGGMWMZ6PY6MVEbe7wnB4l0CDTPrvVhy0Te8mmno/r8Mn xMPFxuWXGo0AFmxuySJv3LmpTuPW+laL/Z42AD/qP4wD7tMXSxAu4Dx0Njg+/PwS UP6KeXvSjrx9XzVBfXdwfs3SJQOFxMpiQVDKkfdyikxmwxDh2Amqdkr5YCJ3vKwo eyc4/zSg2/YYINFpVNnbTusHzVMBNDsRQin9zRWoU+XlwrLIgDF11uXNuewmNlbl fT5ZzYSpOUGF1kznt+gt1g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1771264564; x= 1771350964; bh=54905C6qh2g4mv1IEERX6akBg9FxY782uq117vo96Ng=; b=T zY/Uch45wLQWaEkxj/9F7YFKVf7Ovss0HT/uOvakVaoSn/f8Y+SHu6Y19H7UXAMq 7CjlNBP5rtSW4hUBop/lriUuzceiCEWd171+03CgDjVt/Cr4tLH9uqxo54P9RWQk abkX9k2Nryksg/c01o1HSA92HMXkSkbKML4UyOPsZfqhhv+34chiVhxP/rLJa+lw wgxkak5mcBTxpULN+3ka/UZJ0mngL3yiT06ZHbnujqr+aq5i1vy4VHQ1IHcBZyf0 vBsS/6smXo8twVZBaaEqCwRrY6gwDwp6EuhJFnxUfjj5sLq8C7+BlreIYoQ1Sbin aR41QAh2+N5KIrVKvat7g== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvudejhedvucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggugfgjsehtkefstddttdejnecuhfhrohhmpeetnhgurhgv shcuhfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrg htthgvrhhnpeehffeigffggedvkeeukeehvefhjeehjeegheettdehtdekleeigeevheek uedtieenucffohhmrghinhepmhgrnhejrdhorhhgnecuvehluhhsthgvrhfuihiivgeptd enucfrrghrrghmpehmrghilhhfrhhomheprghnughrvghssegrnhgrrhgriigvlhdruggv pdhnsggprhgtphhtthhopeegpdhmohguvgepshhmthhpohhuthdprhgtphhtthhopegrsh hhuhhtohhshhdrsggrphgrthdrohhsshesghhmrghilhdrtghomhdprhgtphhtthhopegt hhgrthhurhhvvgguihhprghlrghkudeluddusehgmhgrihhlrdgtohhmpdhrtghpthhtoh ephhhlihhnnhgrkhgrsehikhhirdhfihdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgv rhhssehpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 16 Feb 2026 12:56:03 -0500 (EST) Date: Mon, 16 Feb 2026 12:56:03 -0500 From: Andres Freund To: Ashutosh Bapat Cc: Heikki Linnakangas , pgsql-hackers , chaturvedipalak1911@gmail.com Subject: Re: Better shared data structure management and resizable shared data structures Message-ID: References: <5a37c2e3-619d-4816-84d7-0b27e3e6797f@iki.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-02-16 20:22:51 +0530, Ashutosh Bapat wrote: > On Fri, Feb 13, 2026 at 5:33 PM Heikki Linnakangas wrote: > > > > On 13/02/2026 13:47, Ashutosh Bapat wrote: > > > `man madvise` has this > > > MADV_REMOVE (since Linux 2.6.16) > > > Free up a given range of pages and its associated > > > backing store. This is equivalent to punching a > > > hole in the corresponding byte range of the backing > > > store (see fallocate(2)). Subsequent accesses > > > in the specified address range will see bytes containing zero. > > > > > > The specified address range must be mapped shared > > > and writable. This flag cannot be applied to > > > locked pages, Huge TLB pages, or VM_PFNMAP pages. > > > > > > In the initial implementation, only tmpfs(5) was > > > supported MADV_REMOVE; but since Linux 3.5, any > > > filesystem which supports the fallocate(2) > > > FALLOC_FL_PUNCH_HOLE mode also supports MADV_REMOVE. > > > Hugetlbfs fails with the error EINVAL and other > > > filesystems fail with the error EOPNOTSUPP. > > > > > > It says the flag can not be applied to Huge TLB pages. We won't be > > > able to make resizable shared memory structures allocated with huge > > > pages. That seems like a serious restriction. > > > > Per https://man7.org/linux/man-pages/man2/madvise.2.html: > > > > MADV_REMOVE (since Linux 2.6.16) > > ... > > > > Support for the Huge TLB filesystem was added in Linux > > v4.3. > > > > > I may be misunderstanding something, but it seems like this is useful > > > to free already allocated memory, not necessarily allocate more > > > memory. I don't understand how a user would start with a larger > > > reserved address space with only small portions of that space being > > > backed by memory. > > > > Hmm, I guess you'll need to use MAP_NORESERVE in the first mmap() call. > > to reserve address space for the maximum size, and then > > madvise(MADV_POPULATE_WRITE) using the initial size. Later, > > madvise(MADV_REMOVE) to shrink, and madvise(MADV_POPULATE_WRITE) to grow > > again. > > Thank you for the hint. Also thanks to Andres's idea, the resizable > structure patch is quite small now. Actually, after experimenting with > madvise, memfd_create and ftruncate(), I see that MADV_POPULATE_WRITE > is not required at all. We don't have to do anything to expand a > structure. Memory will be allocated as and when the program writes to > it. I think we *do* want the MADV_POPULATE_WRITE, at least when using huge pages, because otherwise you'll get a SIGBUS when accessing the memory if there is no huge page available anymore. > I also discovered things that I didn't know about. > 1. ftruncate() sets the size of the file but it doesn't allocate the > memory pages. Right. > 2. to use madvise() the address needs to be backed by a file, so > memfd_create is a must. I am quite sure that that is not true. I hacked this up with today's postgres, and the madvise works with the mmap() backed allocation from sysv_shmem.c, which is anonymous. What made you conclude that that is the case? > 4. the address and length passed to madvise needs to be page aligned, > but that passed to fallocate() needn't be. `man fallocate` says > "Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux > 2.6.38) in mode deallocates space (i.e., creates a hole) in the byte > range starting at offset and continuing for len bytes. Within the > specified range, partial filesystem blocks are zeroed, and whole > filesystem blocks are removed from the file.". It seems to be > automatically taking care of the page size. So using fallocate() > simplifies logic. Further `man madvise` says "but since Linux 3.5, any > filesystem which supports the fallocate(2) FALLOC_FL_PUNCH_HOLE mode > also supports MADV_REMOVE." fallocate with FALLOC_FL_PUNCH_HOLE is > guaranteed to be available on a system which supports MADV_REMOVE. I think it makes no sense to support resizing below page size granularity. What's the point of doing that? > Using fallocate() (or madvise()) to free memory, we don't need > multiple segments. So much less code churn compared to the multiple > mappings approach. However, there is one drawback. In the multiple > mapping approach access beyond the current size of the structure would > result in segfault or bus error. But in the fallocate/madvise approach > such an access does not cause a crash. A write beyond the pages that > fit the current size of the structure causes more memory to be > allocated silently. A read returns 0s. So, there's a possibility that > bugs in size calculations might go unnoticed. I think that's how it > works even today, access in the yet un-allocated part of the shared > memory will simply go unnoticed. If that's something you care about, you can mprotect(PROT_NONE) the relevant regions. Greetings, Andres Freund