Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vzkJ3-001HBx-1g for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Mar 2026 23:46:37 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vzkJ1-001AWq-22 for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Mar 2026 23:46:36 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vzkJ1-001AWh-0t for pgsql-hackers@lists.postgresql.org; Mon, 09 Mar 2026 23:46:35 +0000 Received: from mail-qk1-x736.google.com ([2607:f8b0:4864:20::736]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vzkIz-00000001spv-0XL4 for pgsql-hackers@lists.postgresql.org; Mon, 09 Mar 2026 23:46:35 +0000 Received: by mail-qk1-x736.google.com with SMTP id af79cd13be357-8cd90401034so99500485a.0 for ; Mon, 09 Mar 2026 16:46:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773099991; cv=none; d=google.com; s=arc-20240605; b=dZwpDCNG/1MNundix7g8XtcHJPUv5gEUr2ZUUP1XELT61kYOHHZGxAzklPjfoplBZw 49SA7p3jles6uQqBdYO0DHtKJ8Uk9zwHN6eifX9Ati4aCp58lrailVwOrcqldYj9u4l2 K0LcAeY5pKyMOs2S1cwyacPTMZ0bhp6yHlOLVRIKEWPNHS7WIBiZVeaBeCwYECyxOdGK eBEwSC4CrQOkw1Uuk2HGIhWYTy+SPrq2qWIEkVZ538X/KD2UGnDR/Vr10FW7u2tBc/zO Wj9k17WFGPBdXIqo5/GAlmlw03n13a8LCNJgdSNB17A5Hkui51ggdGBe/UJCFrfwYaSc N4cw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=NS7y63LcHdcUs7uVGToLTXyMogfHI29z2fnoszFuosU=; fh=X+xgt3RWQqVa/ShYML1qDQtbaw1CyOMs+FXxjUFyFtE=; b=MdlTcKE7VI7lCexvA0+LcqdAsJdqCrT92VmX6nsUSldoDPxrNY29ohVR1VMhQe3yWj TCfMPSHf6Q46d5tQPQIqqrhs/R0RTEhrE5ytEnUNQrzKVTdQk1LMVCk09rHbOXvUFAgN Pqie7si/C9K9XwPiBNOrokHX8czHhJ3Evl5IbwWg4EEAZOkZriWROVj55sinxiQA5eZY lHGPgUjmlQpBxd7qfb2E5cfbg2h48p3+dDOGHWvr2ZmG4mIn0KRGoakbTHhKegVx5U86 uCYnWsUZHJrsyeAGHSoSUcoLCBxyJ8FSIoR1Ww32CRCFJgz90x0kWFi5cvwGwfewuMHM n9ZA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fittl.com; s=google; t=1773099991; x=1773704791; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NS7y63LcHdcUs7uVGToLTXyMogfHI29z2fnoszFuosU=; b=W0bnab4sFvsLNiHYaWAdwKJzth7MAp9/LhDrIrObOlhf0sR2mm1oPCx7xhHRRgASMe 4o1Zm2DCYGJ5Nt8vCwoqnwA3UfZMuiFx6z/SbGZGrcG3kbg60iZsYB2r7uPznljPzeWL HM9zOCQKGN1C4PRk0cO0ThiFr7uhJLlbNj8KQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773099991; x=1773704791; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=NS7y63LcHdcUs7uVGToLTXyMogfHI29z2fnoszFuosU=; b=X/CYHOdSkOtjKsEdPBBS5++kRoKazqwtYR3v7kZ/+V4db01/wYMFJWvu9E4TGOXHNZ f327iKowqD29WMOmQarPveRK4FBzDot1+URx3YWzwxmjsCGOEOU6yGw3A+2lubuI7Ngs miJtYG14axWDRQC1jPyoxM0gMCjvVWKKKg2KRdJ8uIa2V+/QArfx5/eRoADdJ62KDa8J 8RPsClEnpGZdmi/6TfNWISeOP5VmOAd8aBQzuSp9IaOPHgSCNr4nh258xhWY5t03QrVZ 8bx6Grobpt6VvvtF1d6t8RhAWsN5bjgMKVfQmEPiavkVskTtirnxxm9l45ZZzzzN9WSh dp/Q== X-Gm-Message-State: AOJu0YywTDEjJh2h+uG658tUBxqFsZcFlNUsp5LvQzkT56gJ15FecV71 l+3iH4y6bykYFTTosdFVBqwFDKgKhXLeRCkGtaiGr2skVCbOP4ZWFOlZ1yq/c94m8ZOR07vXcsL qhwaY/J9yis6WPfRln28gglVR7gESBiI+MlfoRBN6 X-Gm-Gg: ATEYQzz7qWybmLlnKK8GZqMWcb+2YVxAVKmlrbgL6NZIHnmb4/5+lMtb2To+vWeFMHR hmGTa8/ApslGmvxhL7lp0DHjqDvFUMFbXtuMora0+A/CihjxUcsDKdDsHqAVBSVLR09OSU4meV7 6Y3Tj0M4QxkFDL6Ev8rT40MTSJiboxOF0VlELeWewW1PXnNi+qIeRYWD5CcRoANnhJshrLFsmIh SEu69JjyLZBQowO9yRbksPmOA9rO2x0YnY3epXpl63kIKRpFUFt8NnxNW+bnx0Yv0JacZ2dE0Yb kb4l0DPJquaIBZpoM6d7LMQh8MQgQvDIPXClxjFd6LGuEzwkUTrnbPwSkyrs6OwzhPS7xnoo X-Received: by 2002:a05:620a:4146:b0:8cd:827a:2ab9 with SMTP id af79cd13be357-8cd827a3712mr891872785a.21.1773099991222; Mon, 09 Mar 2026 16:46:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Lukas Fittl Date: Mon, 9 Mar 2026 16:45:55 -0700 X-Gm-Features: AaiRm53k7IX4huGrxiF34Y3L7WmhmJbdeyZrGljuQ7nE3Cds7m22-kx11pYsJho Message-ID: Subject: Re: Stack-based tracking of per-node WAL/buffer usage To: Zsolt Parragi Cc: PostgreSQL Hackers , Andres Freund , Peter Smith Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Zsolt, Thanks for reviewing! On Mon, Mar 9, 2026 at 2:55=E2=80=AFPM Zsolt Parragi wrote: > + if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker(= )) > + { > + ExecFinalizeNodeInstrumentation(queryDesc->planstate); > + > + ExecFinalizeTriggerInstrumentation(estate); > + } > + > if (queryDesc->totaltime) > - InstrStop(queryDesc->totaltime); > + queryDesc->totaltime =3D InstrQueryStopFinalize(queryDesc->totaltime); > > In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the > original instrumentation, but doesn't remove it from the > unfinalized_children list. In normal execution in > InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles > this, but what about the error path, if something happens between the > two? > Won't we end up in ResOwnerReleaseInstrumentation and do use after > free reads and then a double free on the now invalid pointer? ResourceOwnerForgetInstrumentation directly follows the call to ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not sure which error case you're thinking of? We could explicitly zap the unfinalized_children list at the end of ExecFinalizeNodeInstrumentation (and have it take a QueryInstrumentation argument) to protect against this, but I don't think that makes much of a difference with the code as currently written. Maybe I'm misunderstanding which situation you're thinking of? > + if (myState->es->timing || myState->es->buffers) > + instr =3D InstrQueryStopFinalize(instr); > + > > Is it okay to leak 1 instrumentation copy per tuple in the query > context? This freshly palloced object will be out of scope a few lines > after this. I don't think that's a permanent leak, since it would be in the memory context of the caller, i.e. the per-query memory context, but yeah, this doesn't seem ideal, since we're keeping this memory around for each tuple processed. First of all, we could just do a pfree in serializeAnalyzeReceive after we added the stats, but that said, this extra instrumentation for EXPLAIN (SERIALIZE) is a bit of a curious edge case I suppose, since serializeAnalyzeReceive gets called once per tuple - and so doing the ResOwner dance for each tuple is probably not ideal. I wonder if maybe we shouldn't treat this as a case of NodeInstrumentation instead, and attach to the query's totaltime instrumentation for abort safety. The only thing that's a bit inconvenient about that is that calling InstrQueryRememberNode/InstrFinalizeNode requires passing in that parent QueryInstrumentation, and the SerializeDestReceiver doesn't currently have easy access to that. If we brute-force solve that we could just add an extra method to set it after CreateQueryDesc is called, but: Stepping back, from an overall API design perspective, we could also track the current QueryInstrumentation in a global variable - that'd make it easier to attach/finalize extra instrumentation on it even if we don't have direct access to querydesc->totaltime, or we're in a non-executor context that uses QueryInstrumentation (like some utility commands) for future use cases of extra sub-query-level instrumentation. For example, that could also come in handy if we wanted VACUUM VERBOSE break out the buffer access it does on indexes vs tables. > @@ -128,8 +130,22 @@ IndexNext(IndexScanState *node) > > IndexNextWithReorder doesn't need the same handling? Yes - good point, that's an unintentional omission. For context, I hadn't spent substantial amounts of time on this part of the series (rather on getting the design of the stack mechanism right), but will fix this in the next revision to also handle IndexNextWithReorder. I've also been thinking whether we should teach Index-only Scans to break out the per-table buffer stats for the (rare) heap fetches. I guess we should? (it'd be fairly easy to instrument that index_fetch_heap there now) > + if (scandesc->xs_heap_continue) > + elog(ERROR, "non-MVCC snapshots are not supported in index-only scans")= ; > > Shouldn't this say index scans? (there's another preexisting > indexonylscan mention in this file, but that also seems wrong) Hmm. Looking at this again, I think the handling of xs_heap_continue isn't right altogether. Basically it should be this instead, I think, so we correctly call the table AM's table_index_fetch_tuple again if call_again gets set: while ((tid =3D index_getnext_tid(scandesc, direction)) !=3D NULL) { if (node->iss_InstrumentTable) InstrPushStack(&node->iss_InstrumentTable->instr); for (;;) { found =3D index_fetch_heap(scandesc, slot); if (found || !scandesc->xs_heap_continue) break; } if (node->iss_InstrumentTable) InstrPopStack(&node->iss_InstrumentTable->instr); if (unlikely(!found)) continue; Which matches the loop in index_getnext_slot (i.e keep calling index_fetch_heap if xs_heap_continue=3Dtrue). Based on that, we wouldn't need the elog at all. > +#if HAVE_INSTR_STACK > + usage =3D &instr_top.bufusage; > +#else > + usage =3D &pgBufferUsage; > +#endif > > I don't see pgBufferUsage anywhere else in the current code, it was > probably removed between rebases? That's intentional to allow testing pg_session_buffer_usage both on master (HAVE_INSTR_STACK=3D0) and with the patched version (HAVE_INSTR_STACK=3D1) - mainly to ensure we're matching behavior for anyone interested in the top-line buffer or WAL numbers. I don't think we should commit pg_session_buffer_usage (at least not as-is), its just there for testing the earlier changes. > + char *prefix =3D title ? psprintf("%s ", title) : pstrdup(""); > + > + ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL, > usage->shared_blks_hit, es); > > (And many similar ExplainPropery after this) > > title is NULL most of the time, and this results in 16 allocations for > that common case - isn't there a better solution like using > ExplainOpenGroup or something? Yeah, I guess that's fair - I don't know if the extra allocations really matter, but I can see your point. If we used ExplainOpenGroup that would make it a nested structure in the JSON/etc representation, so it'd look like this: "Shared Read Blocks:" 123.0, ... "Temp I/O Write Time:" 42.0, "Table": { "Shared Read Blocks:" 123.0, ... "Temp I/O Write Time:" 42.0, } compared to text representation (simplified): Buffers: shared read=3D123.0 I/O Timings: temp write=3D42.0 Table Buffers: shared read=3D123.0 Table I/O Timings: temp write=3D42.0 I think that can work, and we have precedence for using groups in a node like this, e.g. with "Workers". If we go with this, I do wonder if we should clarify the JSON/etc group key as "Table Access" (one group) or "Table Buffers" / "Table I/O" (two groups), instead of just "Table"? > + * Callers must ensure that no intermediate stack entries are skipped, t= o > + * handle aborts correctly. If you're thinking of calling this in a PG_F= INALLY > + * block, instead call InstrPopAndFinalizeStack which can skip intermedi= ate > + * stack entries, or instead use InstrStart/InstrStop. > > InstrPopAndFinalizeStack doesn't exists in the latest patch version Ah, yeah, that should say InstrStopFinalize (basically direct use of InstrPushStack/InstrPopStack is discouraged over use of the Instrumentation functions), I'll fix it in the next revision. Thanks, Lukas -- Lukas Fittl