Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5Yn7-003NFE-3A for pgsql-hackers@arkaria.postgresql.org; Thu, 26 Mar 2026 00:41:42 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w5Yn6-000DLT-0Z for pgsql-hackers@arkaria.postgresql.org; Thu, 26 Mar 2026 00:41:40 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5Yn5-000DLL-2m for pgsql-hackers@lists.postgresql.org; Thu, 26 Mar 2026 00:41:40 +0000 Received: from mail-qv1-xf2b.google.com ([2607:f8b0:4864:20::f2b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w5Yn3-00000001BKm-25yd for pgsql-hackers@lists.postgresql.org; Thu, 26 Mar 2026 00:41:39 +0000 Received: by mail-qv1-xf2b.google.com with SMTP id 6a1803df08f44-89c52db6231so4135536d6.3 for ; Wed, 25 Mar 2026 17:41:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774485696; cv=none; d=google.com; s=arc-20240605; b=ZPub5vXkepzfDT0aFuzkyVxvPZk7olYVEGyfRjc6chl/O6oSV97gOOlFjmIrq0FVF4 1npEm2gEnvaKp5WA6XDgUcNnw59qqlEH+FHjTjSwyR8pzJidwzCbDsRnLDf5TOctKtgq QnzT6jg37mUrYG4q11Xd96ZPJBk/Bk+RuuyBvKmDyfAMh7+Qsqr8J/M5FU+IzmKwQw+h TAwx+OjCOQG5hBmqjDoLu4fi08U6nCZUzKxi/cGSuyQ8eHpk6svfI4pNl55TjHUm9Lcl 2viQYnvIZLCL7oJRY5WV2xwjJ3BXsVKSS27NtKapoy6jz7QRKFf3mQd7hHlRUXpob1qg ywzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Fu71l/b7R4wxJg8SmHWU3BL4r/cTMPQ/HU8jpGkHOqA=; fh=Sa1dAg279NkXoHkLVdJsyy9GSeYVoYBhpukaUfXaBuM=; b=F6cuQfd+p1VD5f7vu4jQ8zoYWys+gang8or4cbvmOOfEnMIr01zdcnC1NghWRHKYcX MsycgdqOhNGmxV1aqCj8rh+sBtzvNE+Hl1Pi2m5vkRd1cn4I+4IrS3JqMttI70Xr/GKd 6Db+Fa4b8Nif01dtSyv/wXpfApuB9eXolCLZwg9XYlgu0l1rW4Jw2dyK1sP/VYj3rzxT cOYpxIVBM+8WV4HarlH7sia4JFHWjfYauw9uQ029Y78aSt37u2nhQuGj0QfAOEdDoMio WwSGXij5JSFh1Moyrys1PUhLpyb3ebX/ttoVk25UNLXEGTzm7ko2j411mStCak0joyGz O/Tg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fittl.com; s=google; t=1774485696; x=1775090496; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Fu71l/b7R4wxJg8SmHWU3BL4r/cTMPQ/HU8jpGkHOqA=; b=T5GFEqGTLgEbl/q22IZ5cIAeA98vN+EWJrjRE0u0mIBF4R6c6KAJJ6P9QuocYTfNpX w0fehAfBM5o3OxBWezyd0+s4KvJATKpKxIaDNiVQGidTeYyZJ4aaOEDWzgvVah5494Gg eOexQDAbmSpGNi2LhYAOIdpbu0ATBh/86SOnA= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774485696; x=1775090496; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Fu71l/b7R4wxJg8SmHWU3BL4r/cTMPQ/HU8jpGkHOqA=; b=GTC35NIAMLjxU64n/hpGNHsZrjE0JAcHPspl0kcjfXAv8wMzrFP996JsW5tcX5jqv3 /zoWr73yK1w6JKTGaFn3Cz5qlE/rmn7RacCGtds63t/cmgN1+M/qawBiq/nmdmO88Jqm 8Kq/QBj5IUxGLkRep2ZY7mERUiaqJN+R0dVypoziYDQI8d/Al6/ezZwrlIkc7mD4SyQE FhkK5oJPM/mLyCEG05Z2ipdBa0Xu0Jbg7A1QYzHVYCwcGaNRWvDsMDWs3vtOrpp/AYfF qRreDRDcfPJ6+BbRsWag003Ibvl/Aq6nwrALYZRWN55F+EhI5tM/SosGHpeaBvVroAxN 6HeQ== X-Forwarded-Encrypted: i=1; AJvYcCUOlNLYLzb+z3tXPX6dqyLuPo47Rzz6/yrxpW7U3zMCwhW6ssxq4szMsib7rHHCG+RBAeYbDH1S5HkOvZ6F@lists.postgresql.org X-Gm-Message-State: AOJu0Yy2Em6plcD9SjJT9lRsjP0tr4Z0oCvYvOBB1GVhbhY6XnrsBUzP 6eluKCK3b/mb7qobylGPODYCHVhj7uKMBYERG+kquXoOM0GRQ/on7QI+NAGSoR+yqIH+kjDptOW +UVb4RVSSJQL1CtOoMjPsEvSJs/Njl/oorNr4ttiv X-Gm-Gg: ATEYQzySl0KjXXXhaJc3hsncYgFXhNJuv2dB8aRvwZzSLZL+27cMJQYp2fnfZ4CfbR2 0x3yZIrNXzdnxkfL1tKzSTfAcoAUtYpLUbWJA/UXZVY8D5iWma0J5wpAIicwjqmZ4DhALzEiiGD SgEBZM7ymST989th258ZRqOwsrJZi/c7R8NIK9uujWxZEtvNlctBpcIljUpeydUDn3zRJyHBqqE f15erEOLE3Gty9XWYat0/rRGnxaWlyxJJsMBgGUQydgmqTNXozrLMnwvWFd5eCeD5XG0rnoy5JE OG3J7NryqNNWrH+muIdXf7xE6Z9oIr+2NFswh2+otnAAulwmLw== X-Received: by 2002:ad4:5c49:0:b0:89a:61bb:7314 with SMTP id 6a1803df08f44-89cc4aadd11mr92141486d6.42.1774485696239; Wed, 25 Mar 2026 17:41:36 -0700 (PDT) MIME-Version: 1.0 References: <06086cb4-881b-4f5a-96af-f275220ff52d@vondra.me> In-Reply-To: From: Lukas Fittl Date: Wed, 25 Mar 2026 17:41:00 -0700 X-Gm-Features: AaiRm53szL_V88BoYezFzNf32PNTob6lezvpFrpbJh1r2dH2XU7jHdR5AQAUK7U Message-ID: Subject: Re: Stack-based tracking of per-node WAL/buffer usage To: Heikki Linnakangas Cc: Zsolt Parragi , Andres Freund , Tomas Vondra , PostgreSQL Hackers , Peter Smith Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Heikki, On Wed, Mar 25, 2026 at 3:47=E2=80=AFAM Heikki Linnakangas wrote: > > On 24/03/2026 08:03, Lukas Fittl wrote: > > Instead I've tried introducing a memory context for instrumentation > > managed as a resource owner, and I am now (for now) convinced that > > this is the right trade-off for the problem at hand. > > Yes, that seems better. Thanks for reviewing! > This patch could use an overview README file, I'm struggling to > understand how the this all works. Here's my understanding so far, > please correct me if I'm wrong: Sure, happy to put this together - I wonder where would place that best - probably src/backend/executor/README.instrument ? > There are *two* data structures tracking the Instrumentation nodes. The > patch only talks about a stack, but I think there's also implicitly a > tree in there. > > Tree > ---- > > All Instrumentation nodes are part of a tree. For example, if you have > two portals open, the tree might look like this: > > Session - Query A - NestLoop - Seq Scan A > - Seq Scan B > > - Query B - Seq Scan C > > When a node is "finalized", its counters are added to its parent. > > This tree is a somewhat implicit in the patch. Each QueryInstrumentation > has a list of child nodes, but only unfinalized ones. Don't we need that > at the session level too? When a Query is released on abort, its > counters need to be added to the parent too. If I understand correctly, > the patch tries to use the stack for that, but it's confusing. If I follow you correctly, we're talking about the work that InstrStopFinalize is doing (both in regular flow, and in abort): void InstrStopFinalize(Instrumentation *instr) { ... InstrAccumStack(instr_stack.current, instr); } The "instr_stack.current" global referenced here is effectively the instrumentation that was active before InstrStart was called, and would be a parent in the tree in that sense. Its worth noting that on abort we don't care about the tree structure below the aborted activity, i.e. each QueryInstrumentation acts as a finalization point of sorts, with any tree structure below (e.g. that of executor nodes) not being finalized to their respective parents, but instead just getting added to the QueryInstrumentation they were attached to (which then gets finalized into "instr_stack.current"). > I think it would make the patch more clear to talk explicitly about the > tree, and represent it explicitly in the Instrumentation nodes. I.e. add > a "parent" pointer, or a "children" list, or both to the Instrumentation > struct. I'm happy to clarify the mechanism, but I'm hesitant to add more pointers to Instrumentation, since its a base struct that gets re-used in different places, and also gets copied to parallel workers (so any pointer requires extra scrutiny to avoid mis-use) - and I don't think we actually need to track the parent pointer, since in practice it will always be the current stack entry. > > > Stack > ----- > > At all times, there's a stack that tracks what is the Instrumentation in > the tree that is *currently* executing. For example, while executing the > Seq Scan B, the stack would look like this: > > 0: Session > 1: Query A > 2: NestLoop > 3: Seq Scan B > > And when the code is sending a result row back to the client, while the > query is being executed, the stack would be just: > > 0: Session > > > In the patch, the stack is represented by an array. It could also be > implemented with a CurrentInstrumentation global variable, similar to > CurrentMemoryContext and CurrentResourceOwner. It could be, and in fact earlier iterations were closer to that, but I modified that to the current version based on Andres' feedback - The array structure is a lot easier to work with during abort when things execute out-of-order (as you also note later). > > > Abort handling > -------------- > > On abort, two things need to happen: > > 1. Reset the stack to the appropriate level. This ensures that any we > don't later try to update the counters on an Instrumentation nodes that > is going away with the abort. In the above example, the stack would be > reset to the 0: Session level. Correct, but just to clarify, the main problem we deal with in terms of reset is making sure that we cancel out any InstrPushStack that was done, but will now no longer have a matching InstrPopStack getting called. We need to make sure that we get to the stack entry that was active before the aborted activity started. > 2. Finalize all the Instrumentation nodes as part of the ResourceOwner > cleanup. All Instrumentation nodes that are released roll up their > counters to their parents. > > > Questions: > > Is the stack always a path from the root of the tree, down to some node? > Or could you have e.g. recursion like A -> B -> C -> A? (I don't know if > it makes a difference, just wondering) I don't think this happens in practice - I think for the stack itself that'd probably be fine (since you're just putting the entry back on that was on before, in a sense), but it'd e.g. result in timer instrumentation behaving incorrectly. We could probably add an assert to explicitly prevent that if we're worried about it, but the existing Start/Stop instrumentation calls haven't seen this issue I think, and they'd already have had a problem with that. > What happens if you release e.g. the NestLoop before its children? All > the Instrumentation nodes belonging to a query would usually be part of > the same ResourceOwner and there's no guarantee on what order the > resources are released. Correct, and in fact prior versions of the patch struggled with that exact problem. Its both an issue for resource owner managed cleanup, and when you have PG_FINALLY in the picture (e.g. pg_stat_statements). But that's exactly why its not really a full tree - in the abort case we do not care about the relationship of child instrumentations underneath the QueryInstrumentation - we just make sure that the stack is reset to the entry that was active before the QueryInstrumentation started, and that all activity that occurred is added to the QueryInstrumentation. If you had a situation that had two QueryInstrumentations active (i.e. both registered as resource owner), we go up to whichever one of the two is higher up in the stack, per logic in InstrStopFinalize. Thanks for thinking this through & hopefully this clarifies things a bit? Thanks, Lukas --=20 Lukas Fittl