Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w9gEY-001dzf-2k for pgsql-hackers@arkaria.postgresql.org; Mon, 06 Apr 2026 09:27:03 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w9gEX-007Ugd-0t for pgsql-hackers@arkaria.postgresql.org; Mon, 06 Apr 2026 09:27:01 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w9gEW-007Uej-2b for pgsql-hackers@lists.postgresql.org; Mon, 06 Apr 2026 09:27:01 +0000 Received: from mail-qt1-x834.google.com ([2607:f8b0:4864:20::834]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w9gEU-00000000pUn-3qX7 for pgsql-hackers@lists.postgresql.org; Mon, 06 Apr 2026 09:27:00 +0000 Received: by mail-qt1-x834.google.com with SMTP id d75a77b69052e-506a747448dso27789391cf.0 for ; Mon, 06 Apr 2026 02:26:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1775467618; cv=none; d=google.com; s=arc-20240605; b=AQevbakYUWfV/OtdvLmk8mruZ5rs12pVC59NPnU1s+o+dX7T0XL/S7bgXil1q6Gmeq yWfp2HsbdQbULRld4MgCFMrhfcgtHK7fJBT8uceV5ju4M1HJI1BmZr6zMk5iQsVFdJDj /RaKF77KU95H5qR4Z5R9U3SPtj8o/6maln2QWMM0uzgYJCHcHdpJaAbg1JXYEDDbyVHI Vo+WokQI8mbjEGNtOr7c2C7XhNSn6vlm1c6o5wU9LVY8kDLHRTBjgCk3GMWTk9gSP2W3 GfUuwvym0mwgZgJEf0FdajZpQlNIX8Wf/8LRboVJV1CKMSyP6X2LZXj1ZfcNQMrzoiOO 7pVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=q9dVYJnh37NjCMOkGoDJlc7fGotSka1SuAKfAeZxhPw=; fh=cPvjZI5Hoabvyl6hd2S3cdrfzl+MivxxO4gHoYHHbno=; b=Ifb613FyMwoRztEfCXnQDBQa/BIc9bILw8knXNE1A+DaiCLsk3L7XcwBi8mSKbIKm0 uqOnUlDMVyhw/otZW0PMTU5Z1GJene1Lo82k1bahqxoR6wAbmOc91eQ2HX7RE73Qq0MW FrMik8YNW6k0U1XPlmN8TwTJH3fXRde9qF4Pc5+LztDPkwL8d2Exp3nVXSGvCyrjxNVP teHrnZ82VPRteakVF0nhrBxHdY845Kzpw52ksVZRrPROCWtafkjibRdYHxS/Zl5jkx/8 gNdWyFy71lDJaFgs0nylGoRVRoRB5/nVNzHClNjTMj1vbr17nP/i4B640Za7KT4fHGP7 6rXA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fittl.com; s=google; t=1775467618; x=1776072418; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=q9dVYJnh37NjCMOkGoDJlc7fGotSka1SuAKfAeZxhPw=; b=Jadh4PjtwK6hL0hUHOKJlOsO5h+Z6k/PIuRvkFUHeQbSLZBsWCyWcqHwtqiyT+H4/Z Pu79SLZNgE2G4zyaBrzfLIsB/LxZDfhou7/Hk/LDYOvIW/AS8EdEhhwLDDNNJeJyzH8f bkyT0doqAqgAPF85wzzCSs4Vcoj5WF+ha8mE8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775467618; x=1776072418; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=q9dVYJnh37NjCMOkGoDJlc7fGotSka1SuAKfAeZxhPw=; b=FKKWAbRLlFmX5/XwKa9UzEudk96S/viHBVJDYyWAlo7k1keTDa3q18nglBc7cBJAqW M6to+KamDZdxbFbJkCyW+awN3OCjQ22Ds5zqElJ6AMg0M5SPiVWNh2z+Nt65fuHIjKy8 sg6v+cXfpiZODco1mFIk7Tlg/sFrIhSf/+8dPOEKyZ4Hjl/nrb+CJgZcoxdmGNBmcWCe +l09x1BxTHZXRsfoR34At1xv16lFGukUoNxrR1BxlpZgPUhrd+fZFdc3qlBCXpNW9lsE 0CWLCKso8XAtZE/pIC7ngG8CTcpIhu1chxHhhDfDfzOLYabRQpW2SGoqPw4/JLEvkAoV uPoQ== X-Gm-Message-State: AOJu0Yx+GI3Y/TPjLkZGtkDcBI2T4ou/OUWsvnAdO2htmWDWMmIhtbIP r3tNn6DxlD6rweCKLdZeYe8UvMsIt/NwS8o9mzDcEKEE8g0qw/b5bb2lHOleVJqFKKv1w6EoFc3 QSTdPzUoy38bRqiAypaMSSK8yTjR93IBjZzXrSSIx X-Gm-Gg: AeBDieupubYUsf3QAQKqmB6DoA1LCXcsqPqOoQGwiB2ToTU6wxG1vAHTJjhVI6w+Lba euNafFAUcb58qTdj/1QC+W1axldEZx5FZaPg0jxzgw/kJi2wwM3qfP7FMsAvu/3vdi+LIj3eRkm bI80awJtLyZzad5Wy5StruS2/XCkFGFEnpZKUOp0UdunnN8sNnfjKqHyke+RP4EPvETYBOWv1mb SoScxnmH5uUykygX5bk/z9iN/iQ/tGxLhc8k8RNfU9csn29M7q5MPOKGOZh26dbro8XmpxEv+3r lSDWnchpNHSlQutD8XTCq+HVJsMI2oZ88+GmTP6zPSG8xNqYADI= X-Received: by 2002:a05:622a:589:b0:50d:6907:56cd with SMTP id d75a77b69052e-50d69075d17mr168251511cf.56.1775467618059; Mon, 06 Apr 2026 02:26:58 -0700 (PDT) MIME-Version: 1.0 References: <57biou6l65r7gr4nunoe6lignz2x6m3w45gihoypaez4pc46di@txj3bakhj66l> In-Reply-To: From: Lukas Fittl Date: Mon, 6 Apr 2026 02:26:20 -0700 X-Gm-Features: AQROBzCTrAHfSfRoqh00GqvxfNa0JQoHMsdy6DZsPWw_t4gMr3H-rFHS8LrmBtQ Message-ID: Subject: Re: Stack-based tracking of per-node WAL/buffer usage To: Andres Freund , Heikki Linnakangas Cc: PostgreSQL Hackers , Zsolt Parragi , Tomas Vondra , Peter Smith Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sun, Apr 5, 2026 at 5:31=E2=80=AFAM Lukas Fittl wrote: > > On Sat, Apr 4, 2026 at 12:39=E2=80=AFPM Andres Freund wrote: > > > > > @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int= eflags) > > > estate->es_snapshot =3D RegisterSnapshot(queryDesc->snapshot); > > > estate->es_crosscheck_snapshot =3D RegisterSnapshot(queryDesc->= crosscheck_snapshot); > > > estate->es_top_eflags =3D eflags; > > > - estate->es_instrument =3D queryDesc->instrument_options; > > > estate->es_jit_flags =3D queryDesc->plannedstmt->jitFlags; > > > > > > + /* > > > + * Set up query-level instrumentation if needed. We do this bef= ore > > > + * InitPlan so that node and trigger instrumentation can be all= ocated > > > + * within the query's dedicated instrumentation memory context. > > > + */ > > > + if (!queryDesc->totaltime && queryDesc->instrument_options) > > > + { > > > + queryDesc->totaltime =3D InstrQueryAlloc(queryDesc->ins= trument_options); > > > + estate->es_instrument =3D queryDesc->totaltime; > > > + } > > > + > > > /* > > > * Set up an AFTER-trigger statement context, unless told not t= o, or > > > * unless it's EXPLAIN-only mode (when ExecutorFinish won't be = called). > > > > It seems pretty weird to still have queryDesc->totaltime *sometimes* cr= eated > > by pgss etc, but also create it in standard_ExecutorStart if not alread= y > > created. What if the explain options aren't compatible? Sure > > pgss/auto_explain use ALL, but that's not a given. > > Yeah, I think in practice all use cases I've ever seen pass > INSTRUMENT_ALL (and in fact it won't behave sane if this differs > between extensions), but you're right there is no guarantee. > > Overall, there are two aspects to this: > > 1) Query instrumentation as the parent for node instrumentation, > driven by use of EXPLAIN or auto_explain setting > queryDesc->instrument_options > > 2) Instrumentation as a mechanism to measure the activity of a query, > as used by pg_stat_statements or auto_explain (to get the runtime / > aggregate buffer usage) > > I could see two solutions: > > A) Keep two separate QueryInstrumentations (EXPLAIN/auto_explain get > es_instrument, any extensions measuring aggregate activity get > query->totaltime) > > B) Have one internal QueryInstrumentation (that's responsible to be > the abort "parent" to both node instrumentation, and query->totaltime) > > I was initially thinking we could maybe combine them creatively (i.e. > expand on what we've done so far), but I'm not sure there is a > reasonable design that isn't convoluted. We could also have a way for > extensions to "request" a certain level of instrumentation (instead of > directly allocating it), but it seems the current hooks are > insufficient for that. > > I've gone with solution (A) for now, with es_instrument being > allocated when per-node instrumentation is needed. Obviously that gets > us two ResOwner cleanups instead of one when e.g. auto_explain is > active, but I think that's still acceptable. It also shows how its > easy to do an extra level of nesting with the stack-based > instrumentation, without too much expense. > > With this in place, I do wonder if we should avoid the full memory > context setup in InstrQueryAlloc (i.e. instead just make a direct > allocation), unless we know that children are going to be attached. > The downside of that would be that we can't just re-assign the > instr_cxt in InstrQueryStopFinalize (we'd have to go back to the > previous logic of doing a memcpy into the callers context, for the > no-children case), but it might make a notable performance difference? I've done a stress test of the logic I had added here in v13 (two separate QueryInstrumentations to not mess with query->totaltime), specifically "pgbench -n -j 32 -c 32 -f select1.sql -T 60 postgres" with auto_explain enabled with all log_* settings (so its both exercising query->totaltime and instrument_options), and unfortunately that showed about a 1 to 2% impact. So I don't think this was the right direction. I'll go back to what I had before, but fix the specific issue you pointed out when instrumentation options differ. Specifically, I'll add a preparatory patch to stop extensions from allocating queryDesc->totaltime themselves, and add queryDesc->totaltime_options that they use to request which level of totaltime instrumentation they need. If they request less than INSTRUMENT_ALL, they might still get more instrumentation actually collected, when the query in question is an EXPLAIN (ANALYZE). But since they don't have to read those fields from query->totaltime, I think that's acceptable. > > > > > > > > @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, > > > palloc0_array(FmgrInfo, n); > > > resultRelInfo->ri_TrigWhenExprs =3D (ExprState **) > > > palloc0_array(ExprState *, n); > > > - if (instrument_options) > > > - resultRelInfo->ri_TrigInstrument =3D InstrAlloc= Trigger(n, instrument_options); > > > + if (qinstr) > > > + resultRelInfo->ri_TrigInstrument =3D InstrAlloc= Trigger(qinstr, n); > > > > Hm. Why do we not need to pass down the instrument_options anymore? I g= uess > > the assumption is that we always are going to use the flags from qinstr= ? > > > > Is that right? Because right now pgss/auto_explain use _ALL, even when = an > > EXPLAIN ANALYZE doesn't. > > > > With the solution mentioned earlier, where es_instrument is a separate > allocation, this problem now goes away without any extra changes > needed. > > Overall, I think its reasonable to make node/trigger instrumentation > be attached to a query instrumentation that has the instrumentation > options set that should be applied. That way we don't have think about > edge cases like a query instrumentation that doesn't need a stack, but > children that do. Because we now again have a query->totaltime that may have more instrumentation_options than the per-node/per-trigger instrumentation need, we need to explicitly pass the instrumentation options that were requested. To support that I'll bring back "es_instrument" with its prior meaning, and instead add "es_query_instr" to pass down the query instrumentation to the trigger instrumentation calls. > > > > > From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 200= 1 > > > From: Lukas Fittl > > > Date: Sat, 7 Mar 2026 17:52:24 -0800 > > > Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr > > > instructions by inlining > > > > > > For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens= in > > > ExecProcNodeInstr when starting/stopping instrumentation for that nod= e. > > > > > > Previously each ExecProcNodeInstr would check which instrumentation > > > options are active in the InstrStartNode/InstrStopNode calls, and do = the > > > corresponding work (timers, instrumentation stack, etc.). These > > > conditionals being checked for each tuple being emitted add up, and c= ause > > > non-optimal set of instructions to be generated by the compiler. > > > > > > Because we already have an existing mechanism to specify a function > > > pointer when instrumentation is enabled, we can instead create specia= lized > > > functions that are tailored to the instrumentation options enabled, a= nd > > > avoid conditionals on subsequent ExecProcNodeInstr calls. This result= s in > > > the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a str= ess > > > test with a large COUNT(*) that does many ExecProcNode calls from ~ 2= 0% on > > > top of actual runtime to ~ 3%. When using BUFFERS ON the same query g= oes > > > from ~ 20% to ~ 10% on top of actual runtime. > > > > I assume this is to a significant degree due to to allowing for inlinin= g. Have > > you checked how much of the effort you get by just putting ExecProcNode= Instr() > > into instrument.c? > > Worth a try - I haven't tested that yet - I'll come back to this > separately and verify how much that buys us, vs spelling out the > different variants. I've run a test of just putting ExecProcNodeInstr into instrument.c (and adding an inline keyword to the functions it calls), and it does help over not doing it at all, but its not the full experience: CREATE TABLE lotsarows(key int not null); INSERT INTO lotsarows SELECT generate_series(1, 50000000); VACUUM FREEZE lotsarows; EXPLAIN (ANALYZE, ...) SELECT count(*) FROM lotsarows; Below measurements are best out of three, for these three versions: (1) with stack only (2) with stack + move ExecProcNodeInstr with no changes (your idea) (3) with stack + move + avoid branches (current patch set) BUFFERS OFF, TIMING OFF: (1): 309ms (2): 292ms (3): 283ms BUFFERS ON, TIMING OFF: (1): 322ms (2): 314ms (3): 294ms BUFFERS ON, TIMING ON: (1): 829ms (2): 814ms (3): 803ms I suspect the discrepancy for BUFFERS in particular is because the commit has an optimized form of the stack popping (InstrPopStackTo), but I have not taken a close look at the assembly differences here. For now I'll keep this as-is, but that can be changed quickly. Thanks, Lukas --=20 Lukas Fittl