public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Refactor query normalization into core query jumbling
14+ messages / 3 participants
[nested] [flat]

* Re: Refactor query normalization into core query jumbling
@ 2026-01-06 20:20  Sami Imseih <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Sami Imseih @ 2026-01-06 20:20 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

> >> backwards compatible. Basically, we keep JumbleState a non-constant,
> >> but provide core APIs for any operation, such as
> >> generate_normalized_query,
> >> that needs to modify the state. So, my approach was not about enforcing a
> >> read-only JumbleState, but about providing the API to dissuade an author
> >> from modifying a JumbleState.
>
> > Given the lack of public APIs to modify JumbleState today, I don't see how
> > an extension would do
> > modifications in a meaningful way, short of copying the code. I think we
> > should be a bit bolder here in
> > enforcing a convention, either clearly making it read-only or dropping the
> > argument again.
>
> Based on the discussion so far I am leaning towards making JumbleState
> read-only as described here [0]. I don't see how we can drop JumbleState
> completely from hooks, since normalization needs to occur on-demand
> by the extension.

Here is v4.

0001 - addresses the comments made by Bertrand.
0002 - makes JumbleState a constant when passed to post_parse_analyze
and updates all downstream code that consume the JumbleState. This
means we now need to copy the locations from JState into a local/temporary
array when generating the normalized string.

--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/octet-stream] v4-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (8.5K, 2-v4-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
  download | inline diff:
From 12575c2db5ccfc070b4024d644e4c72e32497486 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v4 2/2] Make JumbleState const in post_parse_analyze hook

Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.

Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   |  8 ++--
 src/backend/nodes/queryjumblefuncs.c          | 39 +++++++++++++------
 src/include/nodes/queryjumble.h               |  2 +-
 src/include/parser/analyze.h                  |  2 +-
 4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b90891859d9..b796d839cec 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -332,7 +332,7 @@ static void pgss_shmem_request(void);
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-									JumbleState *jstate);
+									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
 								 const char *query_string,
 								 int cursorOptions,
@@ -356,7 +356,7 @@ static void pgss_store(const char *query, int64 queryId,
 					   const BufferUsage *bufusage,
 					   const WalUsage *walusage,
 					   const struct JitInstrumentation *jitusage,
-					   JumbleState *jstate,
+					   const JumbleState *jstate,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
@@ -832,7 +832,7 @@ error:
  * Post-parse-analysis hook: mark query with a queryId
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
 {
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1289,7 +1289,7 @@ pgss_store(const char *query, int64 queryId,
 		   const BufferUsage *bufusage,
 		   const WalUsage *walusage,
 		   const struct JitInstrumentation *jitusage,
-		   JumbleState *jstate,
+		   const JumbleState *jstate,
 		   int parallel_workers_to_launch,
 		   int parallel_workers_launched,
 		   PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index dd53c82b2a8..4dfe9087547 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
 }
 
 /*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
  *
  * The constants may use any allowed constant syntax, such as float literals,
  * bit-strings, single-quoted strings and dollar-quoted strings.  This is
@@ -803,8 +805,8 @@ CompLocation(const void *a, const void *b)
  * a negative numeric constant.  This precludes there ever being another
  * reason for a constant to start with a '-'.
  */
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
+static LocationLen *
+SetConstantLengths(const JumbleState *jstate, const char *query,
 				   int query_loc)
 {
 	LocationLen *locs;
@@ -813,14 +815,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	core_YYSTYPE yylval;
 	YYLTYPE		yylloc;
 
+	if (jstate->clocations_count == 0)
+		return NULL;
+
+	/* Copy constant locations to avoid modifying jstate */
+	locs = palloc(jstate->clocations_count * sizeof(LocationLen));
+	memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
 	/*
 	 * Sort the records by location so that we can process them in order while
 	 * scanning the query text.
 	 */
 	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
+		qsort(locs, jstate->clocations_count,
 			  sizeof(LocationLen), CompLocation);
-	locs = jstate->clocations;
 
 	/* initialize the flex scanner --- should match raw_parser() */
 	yyscanner = scanner_init(query,
@@ -908,6 +916,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	}
 
 	scanner_finish(yyscanner);
+
+	return locs;
 }
 
 /*
@@ -926,7 +936,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
  * Returns a palloc'd string.
  */
 char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 						int query_loc, int *query_len_p)
 {
 	char	   *norm_query;
@@ -938,12 +948,13 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 				last_off = 0,	/* Offset from start for previous tok */
 				last_tok_len = 0;	/* Length (in bytes) of that tok */
 	int			num_constants_replaced = 0;
+	LocationLen *locs = NULL;
 
 	/*
 	 * Set constants' lengths in JumbleState, as only locations are set during
 	 * DoJumble(). Note this also ensures the items are sorted by location.
 	 */
-	SetConstantLengths(jstate, query, query_loc);
+	locs = SetConstantLengths(jstate, query, query_loc);
 
 	/*
 	 * Allow for $n symbols to be longer than the constants they replace.
@@ -969,15 +980,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 * the parameter in the next iteration (or after the loop is done),
 		 * which is a bit odd but seems to work okay in most cases.
 		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+		if (locs[i].extern_param && !jstate->has_squashed_lists)
 			continue;
 
-		off = jstate->clocations[i].location;
+		off = locs[i].location;
 
 		/* Adjust the constant's location (see SetConstantLengths) */
 		off -= query_loc;
 
-		tok_len = jstate->clocations[i].length;
+		tok_len = locs[i].length;
 
 		if (tok_len < 0)
 			continue;			/* ignore any duplicates */
@@ -996,7 +1007,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 */
 		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
 							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+							  locs[i].squashed ? " /*, ... */" : "");
 		num_constants_replaced++;
 
 		/* move forward */
@@ -1005,6 +1016,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		last_tok_len = tok_len;
 	}
 
+	/* Clean up temporary location array before any assertions */
+	if (locs)
+		pfree(locs);
+
 	/*
 	 * We've copied up until the last ignorable constant.  Copy over the
 	 * remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index cb8741b1d05..bc3dc4d846f 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,7 +91,7 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index f29ed03b476..f80da72933c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
 /* Hook for plugins to get control at end of parse analysis */
 typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
 											  Query *query,
-											  JumbleState *jstate);
+											  const JumbleState *jstate);
 extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
 
-- 
2.43.0



  [application/octet-stream] v4-0001-pg_stat_statements-Move-query-normalization-to-co.patch (20.8K, 3-v4-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
  download | inline diff:
From 63741aec2ca489d066ed4cdc8f50ec36b7894ffd Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v4 1/2] pg_stat_statements: Move query normalization to core

pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.

Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.

This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.

Functions are renamed to match core naming conventions and comments
are updated for clarity.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 276 +-----------------
 src/backend/nodes/queryjumblefuncs.c          | 248 ++++++++++++++++
 src/include/nodes/queryjumble.h               |   2 +
 3 files changed, 260 insertions(+), 266 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 39208f80b5b..b90891859d9 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
-#include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -59,7 +58,6 @@
 #include "nodes/queryjumble.h"
 #include "optimizer/planner.h"
 #include "parser/analyze.h"
-#include "parser/scanner.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -377,11 +375,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
 static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
-									   int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
-									 int query_loc);
-static int	comp_location(const void *a, const void *b);
 
 
 /*
@@ -1359,9 +1352,16 @@ pgss_store(const char *query, int64 queryId,
 		if (jstate)
 		{
 			LWLockRelease(pgss->lock);
-			norm_query = generate_normalized_query(jstate, query,
-												   query_location,
-												   &query_len);
+
+			/*
+			 * generate the normalized query. Note that the normalized
+			 * representation may well vary depending on just which
+			 * "equivalent" query is used to create the hashtable entry. We
+			 * assume this is OK.
+			 */
+			norm_query = GenerateNormalizedQuery(jstate, query,
+												 query_location,
+												 &query_len);
 			LWLockAcquire(pgss->lock, LW_SHARED);
 		}
 
@@ -2823,259 +2823,3 @@ release_lock:
 
 	return stats_reset;
 }
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit.  The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
-						  int query_loc, int *query_len_p)
-{
-	char	   *norm_query;
-	int			query_len = *query_len_p;
-	int			norm_query_buflen,	/* Space allowed for norm_query */
-				len_to_wrt,		/* Length (in bytes) to write */
-				quer_loc = 0,	/* Source query byte location */
-				n_quer_loc = 0, /* Normalized query byte location */
-				last_off = 0,	/* Offset from start for previous tok */
-				last_tok_len = 0;	/* Length (in bytes) of that tok */
-	int			num_constants_replaced = 0;
-
-	/*
-	 * Get constants' lengths (core system only gives us locations).  Note
-	 * this also ensures the items are sorted by location.
-	 */
-	fill_in_constant_lengths(jstate, query, query_loc);
-
-	/*
-	 * Allow for $n symbols to be longer than the constants they replace.
-	 * Constants must take at least one byte in text form, while a $n symbol
-	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
-	 * could refine that limit based on the max value of n for the current
-	 * query, but it hardly seems worth any extra effort to do so.
-	 */
-	norm_query_buflen = query_len + jstate->clocations_count * 10;
-
-	/* Allocate result buffer */
-	norm_query = palloc(norm_query_buflen + 1);
-
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			off,		/* Offset from start for cur tok */
-					tok_len;	/* Length (in bytes) of that tok */
-
-		/*
-		 * If we have an external param at this location, but no lists are
-		 * being squashed across the query, then we skip here; this will make
-		 * us print the characters found in the original query that represent
-		 * the parameter in the next iteration (or after the loop is done),
-		 * which is a bit odd but seems to work okay in most cases.
-		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
-			continue;
-
-		off = jstate->clocations[i].location;
-
-		/* Adjust recorded location if we're dealing with partial string */
-		off -= query_loc;
-
-		tok_len = jstate->clocations[i].length;
-
-		if (tok_len < 0)
-			continue;			/* ignore any duplicates */
-
-		/* Copy next chunk (what precedes the next constant) */
-		len_to_wrt = off - last_off;
-		len_to_wrt -= last_tok_len;
-		Assert(len_to_wrt >= 0);
-		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-		n_quer_loc += len_to_wrt;
-
-		/*
-		 * And insert a param symbol in place of the constant token; and, if
-		 * we have a squashable list, insert a placeholder comment starting
-		 * from the list's second value.
-		 */
-		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
-							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
-		num_constants_replaced++;
-
-		/* move forward */
-		quer_loc = off + tok_len;
-		last_off = off;
-		last_tok_len = tok_len;
-	}
-
-	/*
-	 * We've copied up until the last ignorable constant.  Copy over the
-	 * remaining bytes of the original query string.
-	 */
-	len_to_wrt = query_len - quer_loc;
-
-	Assert(len_to_wrt >= 0);
-	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-	n_quer_loc += len_to_wrt;
-
-	Assert(n_quer_loc <= norm_query_buflen);
-	norm_query[n_quer_loc] = '\0';
-
-	*query_len_p = n_quer_loc;
-	return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings.  This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations.  Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location.  We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant.  This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
-						 int query_loc)
-{
-	LocationLen *locs;
-	core_yyscan_t yyscanner;
-	core_yy_extra_type yyextra;
-	core_YYSTYPE yylval;
-	YYLTYPE		yylloc;
-
-	/*
-	 * Sort the records by location so that we can process them in order while
-	 * scanning the query text.
-	 */
-	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
-			  sizeof(LocationLen), comp_location);
-	locs = jstate->clocations;
-
-	/* initialize the flex scanner --- should match raw_parser() */
-	yyscanner = scanner_init(query,
-							 &yyextra,
-							 &ScanKeywords,
-							 ScanKeywordTokens);
-
-	/* we don't want to re-emit any escape string warnings */
-	yyextra.escape_string_warning = false;
-
-	/* Search for each constant, in sequence */
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			loc;
-		int			tok;
-
-		/* Ignore constants after the first one in the same location */
-		if (i > 0 && locs[i].location == locs[i - 1].location)
-		{
-			locs[i].length = -1;
-			continue;
-		}
-
-		if (locs[i].squashed)
-			continue;			/* squashable list, ignore */
-
-		/* Adjust recorded location if we're dealing with partial string */
-		loc = locs[i].location - query_loc;
-		Assert(loc >= 0);
-
-		/*
-		 * We have a valid location for a constant that's not a dupe. Lex
-		 * tokens until we find the desired constant.
-		 */
-		for (;;)
-		{
-			tok = core_yylex(&yylval, &yylloc, yyscanner);
-
-			/* We should not hit end-of-string, but if we do, behave sanely */
-			if (tok == 0)
-				break;			/* out of inner for-loop */
-
-			/*
-			 * We should find the token position exactly, but if we somehow
-			 * run past it, work with that.
-			 */
-			if (yylloc >= loc)
-			{
-				if (query[loc] == '-')
-				{
-					/*
-					 * It's a negative value - this is the one and only case
-					 * where we replace more than a single token.
-					 *
-					 * Do not compensate for the core system's special-case
-					 * adjustment of location to that of the leading '-'
-					 * operator in the event of a negative constant.  It is
-					 * also useful for our purposes to start from the minus
-					 * symbol.  In this way, queries like "select * from foo
-					 * where bar = 1" and "select * from foo where bar = -2"
-					 * will have identical normalized query strings.
-					 */
-					tok = core_yylex(&yylval, &yylloc, yyscanner);
-					if (tok == 0)
-						break;	/* out of inner for-loop */
-				}
-
-				/*
-				 * We now rely on the assumption that flex has placed a zero
-				 * byte after the text of the current token in scanbuf.
-				 */
-				locs[i].length = strlen(yyextra.scanbuf + loc);
-				break;			/* out of inner for-loop */
-			}
-		}
-
-		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
-		if (tok == 0)
-			break;
-	}
-
-	scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
-	int			l = ((const LocationLen *) a)->location;
-	int			r = ((const LocationLen *) b)->location;
-
-	return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index ffc230af427..dd53c82b2a8 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
 #include "access/transam.h"
 #include "catalog/pg_proc.h"
 #include "common/hashfn.h"
+#include "common/int.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/queryjumble.h"
 #include "utils/lsyscache.h"
+#include "parser/scanner.h"
 #include "parser/scansup.h"
 
 #define JUMBLE_SIZE				1024	/* query serialization buffer size */
@@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
 	 */
 	JUMBLE_STRING(aliasname);
 }
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+	int			l = ((const LocationLen *) a)->location;
+	int			r = ((const LocationLen *) b)->location;
+
+	return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings.  This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location.  We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant.  This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+				   int query_loc)
+{
+	LocationLen *locs;
+	core_yyscan_t yyscanner;
+	core_yy_extra_type yyextra;
+	core_YYSTYPE yylval;
+	YYLTYPE		yylloc;
+
+	/*
+	 * Sort the records by location so that we can process them in order while
+	 * scanning the query text.
+	 */
+	if (jstate->clocations_count > 1)
+		qsort(jstate->clocations, jstate->clocations_count,
+			  sizeof(LocationLen), CompLocation);
+	locs = jstate->clocations;
+
+	/* initialize the flex scanner --- should match raw_parser() */
+	yyscanner = scanner_init(query,
+							 &yyextra,
+							 &ScanKeywords,
+							 ScanKeywordTokens);
+
+	/* we don't want to re-emit any escape string warnings */
+	yyextra.escape_string_warning = false;
+
+	/* Search for each constant, in sequence */
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			loc;
+		int			tok;
+
+		/* Ignore constants after the first one in the same location */
+		if (i > 0 && locs[i].location == locs[i - 1].location)
+		{
+			locs[i].length = -1;
+			continue;
+		}
+
+		if (locs[i].squashed)
+			continue;			/* squashable list, ignore */
+
+		/*
+		 * Adjust the constant's location using the provided starting location
+		 * of the current statement. This allows us to avoid scanning a
+		 * multi-statement string from the beginning.
+		 */
+		loc = locs[i].location - query_loc;
+		Assert(loc >= 0);
+
+		/*
+		 * We have a valid location for a constant that's not a dupe. Lex
+		 * tokens until we find the desired constant.
+		 */
+		for (;;)
+		{
+			tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+			/* We should not hit end-of-string, but if we do, behave sanely */
+			if (tok == 0)
+				break;			/* out of inner for-loop */
+
+			/*
+			 * We should find the token position exactly, but if we somehow
+			 * run past it, work with that.
+			 */
+			if (yylloc >= loc)
+			{
+				if (query[loc] == '-')
+				{
+					/*
+					 * It's a negative value - this is the one and only case
+					 * where we replace more than a single token.
+					 *
+					 * Do not compensate for the special-case adjustment of
+					 * location to that of the leading '-' operator in the
+					 * event of a negative constant (see doNegate() in
+					 * gram.y).  It is also useful for our purposes to start
+					 * from the minus symbol.  In this way, queries like
+					 * "select * from foo where bar = 1" and "select * from
+					 * foo where bar = -2" will have identical normalized
+					 * query strings.
+					 */
+					tok = core_yylex(&yylval, &yylloc, yyscanner);
+					if (tok == 0)
+						break;	/* out of inner for-loop */
+				}
+
+				/*
+				 * We now rely on the assumption that flex has placed a zero
+				 * byte after the text of the current token in scanbuf.
+				 */
+				locs[i].length = strlen(yyextra.scanbuf + loc);
+				break;			/* out of inner for-loop */
+			}
+		}
+
+		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
+		if (tok == 0)
+			break;
+	}
+
+	scanner_finish(yyscanner);
+}
+
+/*
+ * Callback to generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit.  The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+						int query_loc, int *query_len_p)
+{
+	char	   *norm_query;
+	int			query_len = *query_len_p;
+	int			norm_query_buflen,	/* Space allowed for norm_query */
+				len_to_wrt,		/* Length (in bytes) to write */
+				quer_loc = 0,	/* Source query byte location */
+				n_quer_loc = 0, /* Normalized query byte location */
+				last_off = 0,	/* Offset from start for previous tok */
+				last_tok_len = 0;	/* Length (in bytes) of that tok */
+	int			num_constants_replaced = 0;
+
+	/*
+	 * Set constants' lengths in JumbleState, as only locations are set during
+	 * DoJumble(). Note this also ensures the items are sorted by location.
+	 */
+	SetConstantLengths(jstate, query, query_loc);
+
+	/*
+	 * Allow for $n symbols to be longer than the constants they replace.
+	 * Constants must take at least one byte in text form, while a $n symbol
+	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
+	 * could refine that limit based on the max value of n for the current
+	 * query, but it hardly seems worth any extra effort to do so.
+	 */
+	norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+	/* Allocate result buffer */
+	norm_query = palloc(norm_query_buflen + 1);
+
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			off,		/* Offset from start for cur tok */
+					tok_len;	/* Length (in bytes) of that tok */
+
+		/*
+		 * If we have an external param at this location, but no lists are
+		 * being squashed across the query, then we skip here; this will make
+		 * us print the characters found in the original query that represent
+		 * the parameter in the next iteration (or after the loop is done),
+		 * which is a bit odd but seems to work okay in most cases.
+		 */
+		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+			continue;
+
+		off = jstate->clocations[i].location;
+
+		/* Adjust the constant's location (see SetConstantLengths) */
+		off -= query_loc;
+
+		tok_len = jstate->clocations[i].length;
+
+		if (tok_len < 0)
+			continue;			/* ignore any duplicates */
+
+		/* Copy next chunk (what precedes the next constant) */
+		len_to_wrt = off - last_off;
+		len_to_wrt -= last_tok_len;
+		Assert(len_to_wrt >= 0);
+		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+		n_quer_loc += len_to_wrt;
+
+		/*
+		 * And insert a param symbol in place of the constant token; and, if
+		 * we have a squashable list, insert a placeholder comment starting
+		 * from the list's second value.
+		 */
+		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
+							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+		num_constants_replaced++;
+
+		/* move forward */
+		quer_loc = off + tok_len;
+		last_off = off;
+		last_tok_len = tok_len;
+	}
+
+	/*
+	 * We've copied up until the last ignorable constant.  Copy over the
+	 * remaining bytes of the original query string.
+	 */
+	len_to_wrt = query_len - quer_loc;
+
+	Assert(len_to_wrt >= 0);
+	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+	n_quer_loc += len_to_wrt;
+
+	Assert(n_quer_loc <= norm_query_buflen);
+	norm_query[n_quer_loc] = '\0';
+
+	*query_len_p = n_quer_loc;
+	return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index dcb36dcb44f..cb8741b1d05 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-17 02:12  Sami Imseih <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Sami Imseih @ 2026-03-17 02:12 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

> Here is v4.
>
> 0001 - addresses the comments made by Bertrand.
> 0002 - makes JumbleState a constant when passed to post_parse_analyze
> and updates all downstream code that consume the JumbleState. This
> means we now need to copy the locations from JState into a local/temporary
> array when generating the normalized string.

PFA a mandatory rebase.

--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/octet-stream] v5-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (8.5K, 2-v5-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
  download | inline diff:
From 348377e0473d2b28cc50db0ba5e950f9e28bbe87 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v5 2/2] Make JumbleState const in post_parse_analyze hook

Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.

Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   |  8 ++--
 src/backend/nodes/queryjumblefuncs.c          | 39 +++++++++++++------
 src/include/nodes/queryjumble.h               |  2 +-
 src/include/parser/analyze.h                  |  2 +-
 4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c9338442d96..386a1419232 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -333,7 +333,7 @@ static void pgss_shmem_request(void);
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-									JumbleState *jstate);
+									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
 								 const char *query_string,
 								 int cursorOptions,
@@ -357,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
 					   const BufferUsage *bufusage,
 					   const WalUsage *walusage,
 					   const struct JitInstrumentation *jitusage,
-					   JumbleState *jstate,
+					   const JumbleState *jstate,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
@@ -833,7 +833,7 @@ error:
  * Post-parse-analysis hook: mark query with a queryId
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
 {
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1290,7 +1290,7 @@ pgss_store(const char *query, int64 queryId,
 		   const BufferUsage *bufusage,
 		   const WalUsage *walusage,
 		   const struct JitInstrumentation *jitusage,
-		   JumbleState *jstate,
+		   const JumbleState *jstate,
 		   int parallel_workers_to_launch,
 		   int parallel_workers_launched,
 		   PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index d4b26202c47..fe8f0ccd278 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
 }
 
 /*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
  *
  * The constants may use any allowed constant syntax, such as float literals,
  * bit-strings, single-quoted strings and dollar-quoted strings.  This is
@@ -803,8 +805,8 @@ CompLocation(const void *a, const void *b)
  * a negative numeric constant.  This precludes there ever being another
  * reason for a constant to start with a '-'.
  */
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
+static LocationLen *
+SetConstantLengths(const JumbleState *jstate, const char *query,
 				   int query_loc)
 {
 	LocationLen *locs;
@@ -813,14 +815,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	core_YYSTYPE yylval;
 	YYLTYPE		yylloc;
 
+	if (jstate->clocations_count == 0)
+		return NULL;
+
+	/* Copy constant locations to avoid modifying jstate */
+	locs = palloc(jstate->clocations_count * sizeof(LocationLen));
+	memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
 	/*
 	 * Sort the records by location so that we can process them in order while
 	 * scanning the query text.
 	 */
 	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
+		qsort(locs, jstate->clocations_count,
 			  sizeof(LocationLen), CompLocation);
-	locs = jstate->clocations;
 
 	/* initialize the flex scanner --- should match raw_parser() */
 	yyscanner = scanner_init(query,
@@ -908,6 +916,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	}
 
 	scanner_finish(yyscanner);
+
+	return locs;
 }
 
 /*
@@ -926,7 +936,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
  * Returns a palloc'd string.
  */
 char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 						int query_loc, int *query_len_p)
 {
 	char	   *norm_query;
@@ -938,12 +948,13 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 				last_off = 0,	/* Offset from start for previous tok */
 				last_tok_len = 0;	/* Length (in bytes) of that tok */
 	int			num_constants_replaced = 0;
+	LocationLen *locs = NULL;
 
 	/*
 	 * Set constants' lengths in JumbleState, as only locations are set during
 	 * DoJumble(). Note this also ensures the items are sorted by location.
 	 */
-	SetConstantLengths(jstate, query, query_loc);
+	locs = SetConstantLengths(jstate, query, query_loc);
 
 	/*
 	 * Allow for $n symbols to be longer than the constants they replace.
@@ -969,15 +980,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 * the parameter in the next iteration (or after the loop is done),
 		 * which is a bit odd but seems to work okay in most cases.
 		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+		if (locs[i].extern_param && !jstate->has_squashed_lists)
 			continue;
 
-		off = jstate->clocations[i].location;
+		off = locs[i].location;
 
 		/* Adjust the constant's location (see SetConstantLengths) */
 		off -= query_loc;
 
-		tok_len = jstate->clocations[i].length;
+		tok_len = locs[i].length;
 
 		if (tok_len < 0)
 			continue;			/* ignore any duplicates */
@@ -996,7 +1007,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 */
 		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
 							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+							  locs[i].squashed ? " /*, ... */" : "");
 		num_constants_replaced++;
 
 		/* move forward */
@@ -1005,6 +1016,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		last_tok_len = tok_len;
 	}
 
+	/* Clean up temporary location array before any assertions */
+	if (locs)
+		pfree(locs);
+
 	/*
 	 * We've copied up until the last ignorable constant.  Copy over the
 	 * remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index e354f857c37..b709033ce27 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,7 +91,7 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
 /* Hook for plugins to get control at end of parse analysis */
 typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
 											  Query *query,
-											  JumbleState *jstate);
+											  const JumbleState *jstate);
 extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
 
-- 
2.47.3



  [application/octet-stream] v5-0001-pg_stat_statements-Move-query-normalization-to-co.patch (20.7K, 3-v5-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
  download | inline diff:
From b7a1957e00b1d0f743295e7a19ec37ddf4da1f03 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v5 1/2] pg_stat_statements: Move query normalization to core

pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.

Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.

This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.

Functions are renamed to match core naming conventions and comments
are updated for clarity.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 273 +-----------------
 src/backend/nodes/queryjumblefuncs.c          | 248 ++++++++++++++++
 src/include/nodes/queryjumble.h               |   2 +
 3 files changed, 260 insertions(+), 263 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..c9338442d96 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
-#include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -59,7 +58,6 @@
 #include "nodes/queryjumble.h"
 #include "optimizer/planner.h"
 #include "parser/analyze.h"
-#include "parser/scanner.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -378,11 +376,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
 static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
-									   int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
-									 int query_loc);
-static int	comp_location(const void *a, const void *b);
 
 
 /*
@@ -1360,9 +1353,16 @@ pgss_store(const char *query, int64 queryId,
 		if (jstate)
 		{
 			LWLockRelease(pgss->lock);
-			norm_query = generate_normalized_query(jstate, query,
-												   query_location,
-												   &query_len);
+
+			/*
+			 * generate the normalized query. Note that the normalized
+			 * representation may well vary depending on just which
+			 * "equivalent" query is used to create the hashtable entry. We
+			 * assume this is OK.
+			 */
+			norm_query = GenerateNormalizedQuery(jstate, query,
+												 query_location,
+												 &query_len);
 			LWLockAcquire(pgss->lock, LW_SHARED);
 		}
 
@@ -2824,256 +2824,3 @@ release_lock:
 
 	return stats_reset;
 }
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit.  The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
-						  int query_loc, int *query_len_p)
-{
-	char	   *norm_query;
-	int			query_len = *query_len_p;
-	int			norm_query_buflen,	/* Space allowed for norm_query */
-				len_to_wrt,		/* Length (in bytes) to write */
-				quer_loc = 0,	/* Source query byte location */
-				n_quer_loc = 0, /* Normalized query byte location */
-				last_off = 0,	/* Offset from start for previous tok */
-				last_tok_len = 0;	/* Length (in bytes) of that tok */
-	int			num_constants_replaced = 0;
-
-	/*
-	 * Get constants' lengths (core system only gives us locations).  Note
-	 * this also ensures the items are sorted by location.
-	 */
-	fill_in_constant_lengths(jstate, query, query_loc);
-
-	/*
-	 * Allow for $n symbols to be longer than the constants they replace.
-	 * Constants must take at least one byte in text form, while a $n symbol
-	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
-	 * could refine that limit based on the max value of n for the current
-	 * query, but it hardly seems worth any extra effort to do so.
-	 */
-	norm_query_buflen = query_len + jstate->clocations_count * 10;
-
-	/* Allocate result buffer */
-	norm_query = palloc(norm_query_buflen + 1);
-
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			off,		/* Offset from start for cur tok */
-					tok_len;	/* Length (in bytes) of that tok */
-
-		/*
-		 * If we have an external param at this location, but no lists are
-		 * being squashed across the query, then we skip here; this will make
-		 * us print the characters found in the original query that represent
-		 * the parameter in the next iteration (or after the loop is done),
-		 * which is a bit odd but seems to work okay in most cases.
-		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
-			continue;
-
-		off = jstate->clocations[i].location;
-
-		/* Adjust recorded location if we're dealing with partial string */
-		off -= query_loc;
-
-		tok_len = jstate->clocations[i].length;
-
-		if (tok_len < 0)
-			continue;			/* ignore any duplicates */
-
-		/* Copy next chunk (what precedes the next constant) */
-		len_to_wrt = off - last_off;
-		len_to_wrt -= last_tok_len;
-		Assert(len_to_wrt >= 0);
-		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-		n_quer_loc += len_to_wrt;
-
-		/*
-		 * And insert a param symbol in place of the constant token; and, if
-		 * we have a squashable list, insert a placeholder comment starting
-		 * from the list's second value.
-		 */
-		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
-							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
-		num_constants_replaced++;
-
-		/* move forward */
-		quer_loc = off + tok_len;
-		last_off = off;
-		last_tok_len = tok_len;
-	}
-
-	/*
-	 * We've copied up until the last ignorable constant.  Copy over the
-	 * remaining bytes of the original query string.
-	 */
-	len_to_wrt = query_len - quer_loc;
-
-	Assert(len_to_wrt >= 0);
-	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-	n_quer_loc += len_to_wrt;
-
-	Assert(n_quer_loc <= norm_query_buflen);
-	norm_query[n_quer_loc] = '\0';
-
-	*query_len_p = n_quer_loc;
-	return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings.  This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations.  Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location.  We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant.  This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
-						 int query_loc)
-{
-	LocationLen *locs;
-	core_yyscan_t yyscanner;
-	core_yy_extra_type yyextra;
-	core_YYSTYPE yylval;
-	YYLTYPE		yylloc;
-
-	/*
-	 * Sort the records by location so that we can process them in order while
-	 * scanning the query text.
-	 */
-	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
-			  sizeof(LocationLen), comp_location);
-	locs = jstate->clocations;
-
-	/* initialize the flex scanner --- should match raw_parser() */
-	yyscanner = scanner_init(query,
-							 &yyextra,
-							 &ScanKeywords,
-							 ScanKeywordTokens);
-
-	/* Search for each constant, in sequence */
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			loc;
-		int			tok;
-
-		/* Ignore constants after the first one in the same location */
-		if (i > 0 && locs[i].location == locs[i - 1].location)
-		{
-			locs[i].length = -1;
-			continue;
-		}
-
-		if (locs[i].squashed)
-			continue;			/* squashable list, ignore */
-
-		/* Adjust recorded location if we're dealing with partial string */
-		loc = locs[i].location - query_loc;
-		Assert(loc >= 0);
-
-		/*
-		 * We have a valid location for a constant that's not a dupe. Lex
-		 * tokens until we find the desired constant.
-		 */
-		for (;;)
-		{
-			tok = core_yylex(&yylval, &yylloc, yyscanner);
-
-			/* We should not hit end-of-string, but if we do, behave sanely */
-			if (tok == 0)
-				break;			/* out of inner for-loop */
-
-			/*
-			 * We should find the token position exactly, but if we somehow
-			 * run past it, work with that.
-			 */
-			if (yylloc >= loc)
-			{
-				if (query[loc] == '-')
-				{
-					/*
-					 * It's a negative value - this is the one and only case
-					 * where we replace more than a single token.
-					 *
-					 * Do not compensate for the core system's special-case
-					 * adjustment of location to that of the leading '-'
-					 * operator in the event of a negative constant.  It is
-					 * also useful for our purposes to start from the minus
-					 * symbol.  In this way, queries like "select * from foo
-					 * where bar = 1" and "select * from foo where bar = -2"
-					 * will have identical normalized query strings.
-					 */
-					tok = core_yylex(&yylval, &yylloc, yyscanner);
-					if (tok == 0)
-						break;	/* out of inner for-loop */
-				}
-
-				/*
-				 * We now rely on the assumption that flex has placed a zero
-				 * byte after the text of the current token in scanbuf.
-				 */
-				locs[i].length = strlen(yyextra.scanbuf + loc);
-				break;			/* out of inner for-loop */
-			}
-		}
-
-		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
-		if (tok == 0)
-			break;
-	}
-
-	scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
-	int			l = ((const LocationLen *) a)->location;
-	int			r = ((const LocationLen *) b)->location;
-
-	return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..d4b26202c47 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
 #include "access/transam.h"
 #include "catalog/pg_proc.h"
 #include "common/hashfn.h"
+#include "common/int.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/queryjumble.h"
 #include "utils/lsyscache.h"
+#include "parser/scanner.h"
 #include "parser/scansup.h"
 
 #define JUMBLE_SIZE				1024	/* query serialization buffer size */
@@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
 	 */
 	JUMBLE_STRING(aliasname);
 }
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+	int			l = ((const LocationLen *) a)->location;
+	int			r = ((const LocationLen *) b)->location;
+
+	return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings.  This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location.  We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant.  This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+				   int query_loc)
+{
+	LocationLen *locs;
+	core_yyscan_t yyscanner;
+	core_yy_extra_type yyextra;
+	core_YYSTYPE yylval;
+	YYLTYPE		yylloc;
+
+	/*
+	 * Sort the records by location so that we can process them in order while
+	 * scanning the query text.
+	 */
+	if (jstate->clocations_count > 1)
+		qsort(jstate->clocations, jstate->clocations_count,
+			  sizeof(LocationLen), CompLocation);
+	locs = jstate->clocations;
+
+	/* initialize the flex scanner --- should match raw_parser() */
+	yyscanner = scanner_init(query,
+							 &yyextra,
+							 &ScanKeywords,
+							 ScanKeywordTokens);
+
+	/* we don't want to re-emit any escape string warnings */
+	yyextra.escape_string_warning = false;
+
+	/* Search for each constant, in sequence */
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			loc;
+		int			tok;
+
+		/* Ignore constants after the first one in the same location */
+		if (i > 0 && locs[i].location == locs[i - 1].location)
+		{
+			locs[i].length = -1;
+			continue;
+		}
+
+		if (locs[i].squashed)
+			continue;			/* squashable list, ignore */
+
+		/*
+		 * Adjust the constant's location using the provided starting location
+		 * of the current statement. This allows us to avoid scanning a
+		 * multi-statement string from the beginning.
+		 */
+		loc = locs[i].location - query_loc;
+		Assert(loc >= 0);
+
+		/*
+		 * We have a valid location for a constant that's not a dupe. Lex
+		 * tokens until we find the desired constant.
+		 */
+		for (;;)
+		{
+			tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+			/* We should not hit end-of-string, but if we do, behave sanely */
+			if (tok == 0)
+				break;			/* out of inner for-loop */
+
+			/*
+			 * We should find the token position exactly, but if we somehow
+			 * run past it, work with that.
+			 */
+			if (yylloc >= loc)
+			{
+				if (query[loc] == '-')
+				{
+					/*
+					 * It's a negative value - this is the one and only case
+					 * where we replace more than a single token.
+					 *
+					 * Do not compensate for the special-case adjustment of
+					 * location to that of the leading '-' operator in the
+					 * event of a negative constant (see doNegate() in
+					 * gram.y).  It is also useful for our purposes to start
+					 * from the minus symbol.  In this way, queries like
+					 * "select * from foo where bar = 1" and "select * from
+					 * foo where bar = -2" will have identical normalized
+					 * query strings.
+					 */
+					tok = core_yylex(&yylval, &yylloc, yyscanner);
+					if (tok == 0)
+						break;	/* out of inner for-loop */
+				}
+
+				/*
+				 * We now rely on the assumption that flex has placed a zero
+				 * byte after the text of the current token in scanbuf.
+				 */
+				locs[i].length = strlen(yyextra.scanbuf + loc);
+				break;			/* out of inner for-loop */
+			}
+		}
+
+		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
+		if (tok == 0)
+			break;
+	}
+
+	scanner_finish(yyscanner);
+}
+
+/*
+ * Callback to generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit.  The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+						int query_loc, int *query_len_p)
+{
+	char	   *norm_query;
+	int			query_len = *query_len_p;
+	int			norm_query_buflen,	/* Space allowed for norm_query */
+				len_to_wrt,		/* Length (in bytes) to write */
+				quer_loc = 0,	/* Source query byte location */
+				n_quer_loc = 0, /* Normalized query byte location */
+				last_off = 0,	/* Offset from start for previous tok */
+				last_tok_len = 0;	/* Length (in bytes) of that tok */
+	int			num_constants_replaced = 0;
+
+	/*
+	 * Set constants' lengths in JumbleState, as only locations are set during
+	 * DoJumble(). Note this also ensures the items are sorted by location.
+	 */
+	SetConstantLengths(jstate, query, query_loc);
+
+	/*
+	 * Allow for $n symbols to be longer than the constants they replace.
+	 * Constants must take at least one byte in text form, while a $n symbol
+	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
+	 * could refine that limit based on the max value of n for the current
+	 * query, but it hardly seems worth any extra effort to do so.
+	 */
+	norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+	/* Allocate result buffer */
+	norm_query = palloc(norm_query_buflen + 1);
+
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			off,		/* Offset from start for cur tok */
+					tok_len;	/* Length (in bytes) of that tok */
+
+		/*
+		 * If we have an external param at this location, but no lists are
+		 * being squashed across the query, then we skip here; this will make
+		 * us print the characters found in the original query that represent
+		 * the parameter in the next iteration (or after the loop is done),
+		 * which is a bit odd but seems to work okay in most cases.
+		 */
+		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+			continue;
+
+		off = jstate->clocations[i].location;
+
+		/* Adjust the constant's location (see SetConstantLengths) */
+		off -= query_loc;
+
+		tok_len = jstate->clocations[i].length;
+
+		if (tok_len < 0)
+			continue;			/* ignore any duplicates */
+
+		/* Copy next chunk (what precedes the next constant) */
+		len_to_wrt = off - last_off;
+		len_to_wrt -= last_tok_len;
+		Assert(len_to_wrt >= 0);
+		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+		n_quer_loc += len_to_wrt;
+
+		/*
+		 * And insert a param symbol in place of the constant token; and, if
+		 * we have a squashable list, insert a placeholder comment starting
+		 * from the list's second value.
+		 */
+		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
+							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+		num_constants_replaced++;
+
+		/* move forward */
+		quer_loc = off + tok_len;
+		last_off = off;
+		last_tok_len = tok_len;
+	}
+
+	/*
+	 * We've copied up until the last ignorable constant.  Copy over the
+	 * remaining bytes of the original query string.
+	 */
+	len_to_wrt = query_len - quer_loc;
+
+	Assert(len_to_wrt >= 0);
+	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+	n_quer_loc += len_to_wrt;
+
+	Assert(n_quer_loc <= norm_query_buflen);
+	norm_query[n_quer_loc] = '\0';
+
+	*query_len_p = n_quer_loc;
+	return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..e354f857c37 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
 
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 01:34  Lukas Fittl <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Lukas Fittl @ 2026-03-26 01:34 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

Hi Sami,

On Mon, Mar 16, 2026 at 7:12 PM Sami Imseih <[email protected]> wrote:
>
> > Here is v4.
> >
> > 0001 - addresses the comments made by Bertrand.
> > 0002 - makes JumbleState a constant when passed to post_parse_analyze
> > and updates all downstream code that consume the JumbleState. This
> > means we now need to copy the locations from JState into a local/temporary
> > array when generating the normalized string.

In 0001:

> diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
> > index 87db8dc1a32..d4b26202c47 100644
> --- a/src/backend/nodes/queryjumblefuncs.c
> +++ b/src/backend/nodes/queryjumblefuncs.c
> ...
> @@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
> +
> +    /* we don't want to re-emit any escape string warnings */
> +    yyextra.escape_string_warning = false;
> +

I don't think this is needed anymore, as of
45762084545ec14dbbe66ace1d69d7e89f8978ac.

> +/*
> + * Callback to generate a normalized version of the query string that will be used to
> + * represent all similar queries.
> + *

I don't think the term "Callback" makes sense here - I think you could
just keep the original wording.

In 0002:

> diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
> index d4b26202c47..fe8f0ccd278 100644
> --- a/src/backend/nodes/queryjumblefuncs.c
> +++ b/src/backend/nodes/queryjumblefuncs.c
> ...
> @@ -813,14 +815,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
>     core_YYSTYPE yylval;
>     YYLTYPE        yylloc;
>
> +    if (jstate->clocations_count == 0)
> +        return NULL;
> +
> +    /* Copy constant locations to avoid modifying jstate */
> +    locs = palloc(jstate->clocations_count * sizeof(LocationLen));
> +    memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
> +

You could use palloc_array for locs here.

> @@ -938,12 +948,13 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
>                 last_off = 0,    /* Offset from start for previous tok */
>                 last_tok_len = 0;    /* Length (in bytes) of that tok */
>     int            num_constants_replaced = 0;
> +    LocationLen *locs = NULL;
>
>     /*
>      * Set constants' lengths in JumbleState, as only locations are set during
>      * DoJumble(). Note this also ensures the items are sorted by location.
>      */
> -    SetConstantLengths(jstate, query, query_loc);
> +    locs = SetConstantLengths(jstate, query, query_loc);

I think we should update the comment here to reflect the fact that
we're no longer modifying JumbleState.

Otherwise these patches look good - it'd be nice to still get this
into 19 so we have less code duplication across the different
extensions working with normalized query text.

Thanks,
Lukas

-- 
Lukas Fittl





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 03:31  Sami Imseih <[email protected]>
  parent: Lukas Fittl <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Sami Imseih @ 2026-03-26 03:31 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

Thanks for looking!

> I don't think this is needed anymore, as of
> 45762084545ec14dbbe66ace1d69d7e89f8978ac.

Correct. That showed up after my last rebase.

> > +/*
> > + * Callback to generate a normalized version of the query string that will be used to
> > + * represent all similar queries.
> > + *
>
> I don't think the term "Callback" makes sense here - I think you could
> just keep the original wording.

This was a remnant of my earlier experimentation. I removed.

A few notes on the comments:

- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.

This was in the original generate_normalize_query and since it mentions
hashtable entry, I moved the comment (in-spirit) to where pg_stat_statements
calls GenerateNormailzeQuery

* If query_loc > 0, then "query" has been advanced by that much compared to
* the original string start, as is the case with multi-statement strings, so
* we need to translate the provided locations to compensate. (This lets us
* avoid re-scanning statements before the one of interest, so it's worth doing.)
*
This comment was originally duplicated in both SetConstantLengths, so I
just kept it as-is in SetConstantLengths and added a shorter reference in
GenerateNormalizeQuery

Also, this comment "It is the caller's job to ensure that the string
is a valid SQL statement..."
made more sense in GenerateNormalizeQuery rather than SetConstantLengths, since
GenerateNormalizeQuery is the public function.


> In 0002:
> You could use palloc_array for locs here.

done.

> I think we should update the comment here to reflect the fact that
> we're no longer modifying JumbleState.

done.

> Otherwise these patches look good - it'd be nice to still get this
> into 19 so we have less code duplication across the different
> extensions working with normalized query text.

I agree!

v6 addresses the comments.

--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/octet-stream] v6-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (8.6K, 2-v6-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
  download | inline diff:
From 528f16102ecb47fb447e472b43b906e3f37b09fc Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v6 2/2] Make JumbleState const in post_parse_analyze hook

Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.

Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   |  8 ++--
 src/backend/nodes/queryjumblefuncs.c          | 44 +++++++++++++------
 src/include/nodes/queryjumble.h               |  2 +-
 src/include/parser/analyze.h                  |  2 +-
 4 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c9338442d96..386a1419232 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -333,7 +333,7 @@ static void pgss_shmem_request(void);
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-									JumbleState *jstate);
+									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
 								 const char *query_string,
 								 int cursorOptions,
@@ -357,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
 					   const BufferUsage *bufusage,
 					   const WalUsage *walusage,
 					   const struct JitInstrumentation *jitusage,
-					   JumbleState *jstate,
+					   const JumbleState *jstate,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
@@ -833,7 +833,7 @@ error:
  * Post-parse-analysis hook: mark query with a queryId
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
 {
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1290,7 +1290,7 @@ pgss_store(const char *query, int64 queryId,
 		   const BufferUsage *bufusage,
 		   const WalUsage *walusage,
 		   const struct JitInstrumentation *jitusage,
-		   JumbleState *jstate,
+		   const JumbleState *jstate,
 		   int parallel_workers_to_launch,
 		   int parallel_workers_launched,
 		   PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index c997a1ef609..0a40bc89e0a 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
 }
 
 /*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
  *
  * The constants may use any allowed constant syntax, such as float literals,
  * bit-strings, single-quoted strings and dollar-quoted strings.  This is
@@ -808,8 +810,8 @@ CompLocation(const void *a, const void *b)
  * a negative numeric constant.  This precludes there ever being another
  * reason for a constant to start with a '-'.
  */
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
+static LocationLen *
+SetConstantLengths(const JumbleState *jstate, const char *query,
 				   int query_loc)
 {
 	LocationLen *locs;
@@ -818,14 +820,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	core_YYSTYPE yylval;
 	YYLTYPE		yylloc;
 
+	if (jstate->clocations_count == 0)
+		return NULL;
+
+	/* Copy constant locations to avoid modifying jstate */
+	locs = palloc_array(LocationLen, jstate->clocations_count);
+	memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
 	/*
 	 * Sort the records by location so that we can process them in order while
 	 * scanning the query text.
 	 */
 	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
+		qsort(locs, jstate->clocations_count,
 			  sizeof(LocationLen), CompLocation);
-	locs = jstate->clocations;
 
 	/* initialize the flex scanner --- should match raw_parser() */
 	yyscanner = scanner_init(query,
@@ -910,6 +918,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	}
 
 	scanner_finish(yyscanner);
+
+	return locs;
 }
 
 /*
@@ -931,7 +941,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
  * Returns a palloc'd string.
  */
 char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 						int query_loc, int *query_len_p)
 {
 	char	   *norm_query;
@@ -943,12 +953,14 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 				last_off = 0,	/* Offset from start for previous tok */
 				last_tok_len = 0;	/* Length (in bytes) of that tok */
 	int			num_constants_replaced = 0;
+	LocationLen *locs = NULL;
 
 	/*
-	 * Set constants' lengths in JumbleState, as only locations are set during
-	 * DoJumble(). Note this also ensures the items are sorted by location.
+	 * Determine constants' lengths, which are not set during DoJumble(), and
+	 * return a sorted copy of jstate's LocationLen data with lengths filled
+	 * in.
 	 */
-	SetConstantLengths(jstate, query, query_loc);
+	locs = SetConstantLengths(jstate, query, query_loc);
 
 	/*
 	 * Allow for $n symbols to be longer than the constants they replace.
@@ -974,15 +986,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 * the parameter in the next iteration (or after the loop is done),
 		 * which is a bit odd but seems to work okay in most cases.
 		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+		if (locs[i].extern_param && !jstate->has_squashed_lists)
 			continue;
 
-		off = jstate->clocations[i].location;
+		off = locs[i].location;
 
 		/* Adjust the constant's location (see SetConstantLengths) */
 		off -= query_loc;
 
-		tok_len = jstate->clocations[i].length;
+		tok_len = locs[i].length;
 
 		if (tok_len < 0)
 			continue;			/* ignore any duplicates */
@@ -1001,7 +1013,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 */
 		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
 							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+							  locs[i].squashed ? " /*, ... */" : "");
 		num_constants_replaced++;
 
 		/* move forward */
@@ -1010,6 +1022,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		last_tok_len = tok_len;
 	}
 
+	/* Clean up temporary location array before any assertions */
+	if (locs)
+		pfree(locs);
+
 	/*
 	 * We've copied up until the last ignorable constant.  Copy over the
 	 * remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index e354f857c37..b709033ce27 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,7 +91,7 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
 /* Hook for plugins to get control at end of parse analysis */
 typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
 											  Query *query,
-											  JumbleState *jstate);
+											  const JumbleState *jstate);
 extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
 
-- 
2.47.3



  [application/octet-stream] v6-0001-pg_stat_statements-Move-query-normalization-to-co.patch (21.1K, 3-v6-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
  download | inline diff:
From 79e94343d46ad1e8f16f30449492ebe01e4419a0 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v6 1/2] pg_stat_statements: Move query normalization to core

pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.

Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.

This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.

Functions are renamed to match core naming conventions and comments
are updated for clarity.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 273 +-----------------
 src/backend/nodes/queryjumblefuncs.c          | 253 ++++++++++++++++
 src/include/nodes/queryjumble.h               |   2 +
 3 files changed, 265 insertions(+), 263 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..c9338442d96 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
-#include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -59,7 +58,6 @@
 #include "nodes/queryjumble.h"
 #include "optimizer/planner.h"
 #include "parser/analyze.h"
-#include "parser/scanner.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -378,11 +376,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
 static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
-									   int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
-									 int query_loc);
-static int	comp_location(const void *a, const void *b);
 
 
 /*
@@ -1360,9 +1353,16 @@ pgss_store(const char *query, int64 queryId,
 		if (jstate)
 		{
 			LWLockRelease(pgss->lock);
-			norm_query = generate_normalized_query(jstate, query,
-												   query_location,
-												   &query_len);
+
+			/*
+			 * generate the normalized query. Note that the normalized
+			 * representation may well vary depending on just which
+			 * "equivalent" query is used to create the hashtable entry. We
+			 * assume this is OK.
+			 */
+			norm_query = GenerateNormalizedQuery(jstate, query,
+												 query_location,
+												 &query_len);
 			LWLockAcquire(pgss->lock, LW_SHARED);
 		}
 
@@ -2824,256 +2824,3 @@ release_lock:
 
 	return stats_reset;
 }
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit.  The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
-						  int query_loc, int *query_len_p)
-{
-	char	   *norm_query;
-	int			query_len = *query_len_p;
-	int			norm_query_buflen,	/* Space allowed for norm_query */
-				len_to_wrt,		/* Length (in bytes) to write */
-				quer_loc = 0,	/* Source query byte location */
-				n_quer_loc = 0, /* Normalized query byte location */
-				last_off = 0,	/* Offset from start for previous tok */
-				last_tok_len = 0;	/* Length (in bytes) of that tok */
-	int			num_constants_replaced = 0;
-
-	/*
-	 * Get constants' lengths (core system only gives us locations).  Note
-	 * this also ensures the items are sorted by location.
-	 */
-	fill_in_constant_lengths(jstate, query, query_loc);
-
-	/*
-	 * Allow for $n symbols to be longer than the constants they replace.
-	 * Constants must take at least one byte in text form, while a $n symbol
-	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
-	 * could refine that limit based on the max value of n for the current
-	 * query, but it hardly seems worth any extra effort to do so.
-	 */
-	norm_query_buflen = query_len + jstate->clocations_count * 10;
-
-	/* Allocate result buffer */
-	norm_query = palloc(norm_query_buflen + 1);
-
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			off,		/* Offset from start for cur tok */
-					tok_len;	/* Length (in bytes) of that tok */
-
-		/*
-		 * If we have an external param at this location, but no lists are
-		 * being squashed across the query, then we skip here; this will make
-		 * us print the characters found in the original query that represent
-		 * the parameter in the next iteration (or after the loop is done),
-		 * which is a bit odd but seems to work okay in most cases.
-		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
-			continue;
-
-		off = jstate->clocations[i].location;
-
-		/* Adjust recorded location if we're dealing with partial string */
-		off -= query_loc;
-
-		tok_len = jstate->clocations[i].length;
-
-		if (tok_len < 0)
-			continue;			/* ignore any duplicates */
-
-		/* Copy next chunk (what precedes the next constant) */
-		len_to_wrt = off - last_off;
-		len_to_wrt -= last_tok_len;
-		Assert(len_to_wrt >= 0);
-		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-		n_quer_loc += len_to_wrt;
-
-		/*
-		 * And insert a param symbol in place of the constant token; and, if
-		 * we have a squashable list, insert a placeholder comment starting
-		 * from the list's second value.
-		 */
-		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
-							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
-		num_constants_replaced++;
-
-		/* move forward */
-		quer_loc = off + tok_len;
-		last_off = off;
-		last_tok_len = tok_len;
-	}
-
-	/*
-	 * We've copied up until the last ignorable constant.  Copy over the
-	 * remaining bytes of the original query string.
-	 */
-	len_to_wrt = query_len - quer_loc;
-
-	Assert(len_to_wrt >= 0);
-	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-	n_quer_loc += len_to_wrt;
-
-	Assert(n_quer_loc <= norm_query_buflen);
-	norm_query[n_quer_loc] = '\0';
-
-	*query_len_p = n_quer_loc;
-	return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings.  This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations.  Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location.  We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant.  This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
-						 int query_loc)
-{
-	LocationLen *locs;
-	core_yyscan_t yyscanner;
-	core_yy_extra_type yyextra;
-	core_YYSTYPE yylval;
-	YYLTYPE		yylloc;
-
-	/*
-	 * Sort the records by location so that we can process them in order while
-	 * scanning the query text.
-	 */
-	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
-			  sizeof(LocationLen), comp_location);
-	locs = jstate->clocations;
-
-	/* initialize the flex scanner --- should match raw_parser() */
-	yyscanner = scanner_init(query,
-							 &yyextra,
-							 &ScanKeywords,
-							 ScanKeywordTokens);
-
-	/* Search for each constant, in sequence */
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			loc;
-		int			tok;
-
-		/* Ignore constants after the first one in the same location */
-		if (i > 0 && locs[i].location == locs[i - 1].location)
-		{
-			locs[i].length = -1;
-			continue;
-		}
-
-		if (locs[i].squashed)
-			continue;			/* squashable list, ignore */
-
-		/* Adjust recorded location if we're dealing with partial string */
-		loc = locs[i].location - query_loc;
-		Assert(loc >= 0);
-
-		/*
-		 * We have a valid location for a constant that's not a dupe. Lex
-		 * tokens until we find the desired constant.
-		 */
-		for (;;)
-		{
-			tok = core_yylex(&yylval, &yylloc, yyscanner);
-
-			/* We should not hit end-of-string, but if we do, behave sanely */
-			if (tok == 0)
-				break;			/* out of inner for-loop */
-
-			/*
-			 * We should find the token position exactly, but if we somehow
-			 * run past it, work with that.
-			 */
-			if (yylloc >= loc)
-			{
-				if (query[loc] == '-')
-				{
-					/*
-					 * It's a negative value - this is the one and only case
-					 * where we replace more than a single token.
-					 *
-					 * Do not compensate for the core system's special-case
-					 * adjustment of location to that of the leading '-'
-					 * operator in the event of a negative constant.  It is
-					 * also useful for our purposes to start from the minus
-					 * symbol.  In this way, queries like "select * from foo
-					 * where bar = 1" and "select * from foo where bar = -2"
-					 * will have identical normalized query strings.
-					 */
-					tok = core_yylex(&yylval, &yylloc, yyscanner);
-					if (tok == 0)
-						break;	/* out of inner for-loop */
-				}
-
-				/*
-				 * We now rely on the assumption that flex has placed a zero
-				 * byte after the text of the current token in scanbuf.
-				 */
-				locs[i].length = strlen(yyextra.scanbuf + loc);
-				break;			/* out of inner for-loop */
-			}
-		}
-
-		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
-		if (tok == 0)
-			break;
-	}
-
-	scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
-	int			l = ((const LocationLen *) a)->location;
-	int			r = ((const LocationLen *) b)->location;
-
-	return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..c997a1ef609 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
 #include "access/transam.h"
 #include "catalog/pg_proc.h"
 #include "common/hashfn.h"
+#include "common/int.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/queryjumble.h"
 #include "utils/lsyscache.h"
+#include "parser/scanner.h"
 #include "parser/scansup.h"
 
 #define JUMBLE_SIZE				1024	/* query serialization buffer size */
@@ -773,3 +775,254 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
 	 */
 	JUMBLE_STRING(aliasname);
 }
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+	int			l = ((const LocationLen *) a)->location;
+	int			r = ((const LocationLen *) b)->location;
+
+	return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings.  This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location.  We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * If query_loc > 0, then "query" has been advanced by that much compared to
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate.  (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth doing.)
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant.  This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+				   int query_loc)
+{
+	LocationLen *locs;
+	core_yyscan_t yyscanner;
+	core_yy_extra_type yyextra;
+	core_YYSTYPE yylval;
+	YYLTYPE		yylloc;
+
+	/*
+	 * Sort the records by location so that we can process them in order while
+	 * scanning the query text.
+	 */
+	if (jstate->clocations_count > 1)
+		qsort(jstate->clocations, jstate->clocations_count,
+			  sizeof(LocationLen), CompLocation);
+	locs = jstate->clocations;
+
+	/* initialize the flex scanner --- should match raw_parser() */
+	yyscanner = scanner_init(query,
+							 &yyextra,
+							 &ScanKeywords,
+							 ScanKeywordTokens);
+
+	/* Search for each constant, in sequence */
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			loc;
+		int			tok;
+
+		/* Ignore constants after the first one in the same location */
+		if (i > 0 && locs[i].location == locs[i - 1].location)
+		{
+			locs[i].length = -1;
+			continue;
+		}
+
+		if (locs[i].squashed)
+			continue;			/* squashable list, ignore */
+
+		/*
+		 * Adjust the constant's location using the provided starting location
+		 * of the current statement. This allows us to avoid scanning a
+		 * multi-statement string from the beginning.
+		 */
+		loc = locs[i].location - query_loc;
+		Assert(loc >= 0);
+
+		/*
+		 * We have a valid location for a constant that's not a dupe. Lex
+		 * tokens until we find the desired constant.
+		 */
+		for (;;)
+		{
+			tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+			/* We should not hit end-of-string, but if we do, behave sanely */
+			if (tok == 0)
+				break;			/* out of inner for-loop */
+
+			/*
+			 * We should find the token position exactly, but if we somehow
+			 * run past it, work with that.
+			 */
+			if (yylloc >= loc)
+			{
+				if (query[loc] == '-')
+				{
+					/*
+					 * It's a negative value - this is the one and only case
+					 * where we replace more than a single token.
+					 *
+					 * Do not compensate for the special-case adjustment of
+					 * location to that of the leading '-' operator in the
+					 * event of a negative constant (see doNegate() in
+					 * gram.y).  It is also useful for our purposes to start
+					 * from the minus symbol.  In this way, queries like
+					 * "select * from foo where bar = 1" and "select * from
+					 * foo where bar = -2" will have identical normalized
+					 * query strings.
+					 */
+					tok = core_yylex(&yylval, &yylloc, yyscanner);
+					if (tok == 0)
+						break;	/* out of inner for-loop */
+				}
+
+				/*
+				 * We now rely on the assumption that flex has placed a zero
+				 * byte after the text of the current token in scanbuf.
+				 */
+				locs[i].length = strlen(yyextra.scanbuf + loc);
+				break;			/* out of inner for-loop */
+			}
+		}
+
+		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
+		if (tok == 0)
+			break;
+	}
+
+	scanner_finish(yyscanner);
+}
+
+/*
+ * Generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit.  The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Similar to SetConstantLengths, we must also translate the provided location
+ * to compensate for multi-statement strings.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+						int query_loc, int *query_len_p)
+{
+	char	   *norm_query;
+	int			query_len = *query_len_p;
+	int			norm_query_buflen,	/* Space allowed for norm_query */
+				len_to_wrt,		/* Length (in bytes) to write */
+				quer_loc = 0,	/* Source query byte location */
+				n_quer_loc = 0, /* Normalized query byte location */
+				last_off = 0,	/* Offset from start for previous tok */
+				last_tok_len = 0;	/* Length (in bytes) of that tok */
+	int			num_constants_replaced = 0;
+
+	/*
+	 * Set constants' lengths in JumbleState, as only locations are set during
+	 * DoJumble(). Note this also ensures the items are sorted by location.
+	 */
+	SetConstantLengths(jstate, query, query_loc);
+
+	/*
+	 * Allow for $n symbols to be longer than the constants they replace.
+	 * Constants must take at least one byte in text form, while a $n symbol
+	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
+	 * could refine that limit based on the max value of n for the current
+	 * query, but it hardly seems worth any extra effort to do so.
+	 */
+	norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+	/* Allocate result buffer */
+	norm_query = palloc(norm_query_buflen + 1);
+
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			off,		/* Offset from start for cur tok */
+					tok_len;	/* Length (in bytes) of that tok */
+
+		/*
+		 * If we have an external param at this location, but no lists are
+		 * being squashed across the query, then we skip here; this will make
+		 * us print the characters found in the original query that represent
+		 * the parameter in the next iteration (or after the loop is done),
+		 * which is a bit odd but seems to work okay in most cases.
+		 */
+		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+			continue;
+
+		off = jstate->clocations[i].location;
+
+		/* Adjust the constant's location (see SetConstantLengths) */
+		off -= query_loc;
+
+		tok_len = jstate->clocations[i].length;
+
+		if (tok_len < 0)
+			continue;			/* ignore any duplicates */
+
+		/* Copy next chunk (what precedes the next constant) */
+		len_to_wrt = off - last_off;
+		len_to_wrt -= last_tok_len;
+		Assert(len_to_wrt >= 0);
+		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+		n_quer_loc += len_to_wrt;
+
+		/*
+		 * And insert a param symbol in place of the constant token; and, if
+		 * we have a squashable list, insert a placeholder comment starting
+		 * from the list's second value.
+		 */
+		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
+							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+		num_constants_replaced++;
+
+		/* move forward */
+		quer_loc = off + tok_len;
+		last_off = off;
+		last_tok_len = tok_len;
+	}
+
+	/*
+	 * We've copied up until the last ignorable constant.  Copy over the
+	 * remaining bytes of the original query string.
+	 */
+	len_to_wrt = query_len - quer_loc;
+
+	Assert(len_to_wrt >= 0);
+	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+	n_quer_loc += len_to_wrt;
+
+	Assert(n_quer_loc <= norm_query_buflen);
+	norm_query[n_quer_loc] = '\0';
+
+	*query_len_p = n_quer_loc;
+	return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..e354f857c37 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
 
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 16:35  Lukas Fittl <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Lukas Fittl @ 2026-03-26 16:35 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

Hi Sami,

On Wed, Mar 25, 2026 at 8:31 PM Sami Imseih <[email protected]> wrote:
>
> v6 addresses the comments.
>

Great - I've tested that again and changes look good.

Two more thoughts:

1) I think "SetConstantLengths" should be renamed to
"GetConstantLengths", or "CalculateConstantLengths" in 0002, since
we're no longer modifying JumbleState

2) I wonder if we should export this function in 0002 - that would
specifically help pg_tracing, which also wants to extract inline
parameter values in addition to replacing them with a $n parameter
reference marker - I could also see that being useful for any other
extension that's interested in pulling out parameter values

I've also done some testing with two previously mentioned extensions,
pg_stat_monitor [0] and pg_tracing [1]. Both look like they can be
adapted to the new method with their regression tests succeeding.

Thanks,
Lukas

[0]: https://github.com/lfittl/pg_stat_monitor/commit/d3521de9f6736c1bb745bc31dae736540efe1781
[1]: https://github.com/lfittl/pg_tracing/commit/ecff95a87a1c60e8a5fe47662cc6a2148bd3632f

-- 
Lukas Fittl





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 17:18  Sami Imseih <[email protected]>
  parent: Lukas Fittl <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Sami Imseih @ 2026-03-26 17:18 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

> 1) I think "SetConstantLengths" should be renamed to
> "GetConstantLengths", or "CalculateConstantLengths" in 0002, since
> we're no longer modifying JumbleState

Makes sense. I renamed it to ComputeConstantLengths; as this will reflect
that it's more than a getter function, and it's doing actual work. Compute
also seems more consistent with other function names we have out there.

> 2) I wonder if we should export this function in 0002 - that would
> specifically help pg_tracing, which also wants to extract inline
> parameter values in addition to replacing them with a $n parameter
> reference marker - I could also see that being useful for any other
> extension that's interested in pulling out parameter values

> I've also done some testing with two previously mentioned extensions,
> pg_stat_monitor [0] and pg_tracing [1]. Both look like they can be
> adapted to the new method with their regression tests succeeding.

That did cross my mind. I agree with this.

Both of those changes belong in 0002. See the attached v7.

--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/octet-stream] v7-0001-pg_stat_statements-Move-query-normalization-to-co.patch (21.1K, 2-v7-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
  download | inline diff:
From e5d441c4e4932b7453293c689c2867c1d9d1d53f Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v7 1/2] pg_stat_statements: Move query normalization to core

pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.

Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.

This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.

Functions are renamed to match core naming conventions and comments
are updated for clarity.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 273 +-----------------
 src/backend/nodes/queryjumblefuncs.c          | 253 ++++++++++++++++
 src/include/nodes/queryjumble.h               |   2 +
 3 files changed, 265 insertions(+), 263 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..c9338442d96 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
-#include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -59,7 +58,6 @@
 #include "nodes/queryjumble.h"
 #include "optimizer/planner.h"
 #include "parser/analyze.h"
-#include "parser/scanner.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -378,11 +376,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
 static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
-									   int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
-									 int query_loc);
-static int	comp_location(const void *a, const void *b);
 
 
 /*
@@ -1360,9 +1353,16 @@ pgss_store(const char *query, int64 queryId,
 		if (jstate)
 		{
 			LWLockRelease(pgss->lock);
-			norm_query = generate_normalized_query(jstate, query,
-												   query_location,
-												   &query_len);
+
+			/*
+			 * generate the normalized query. Note that the normalized
+			 * representation may well vary depending on just which
+			 * "equivalent" query is used to create the hashtable entry. We
+			 * assume this is OK.
+			 */
+			norm_query = GenerateNormalizedQuery(jstate, query,
+												 query_location,
+												 &query_len);
 			LWLockAcquire(pgss->lock, LW_SHARED);
 		}
 
@@ -2824,256 +2824,3 @@ release_lock:
 
 	return stats_reset;
 }
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit.  The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
-						  int query_loc, int *query_len_p)
-{
-	char	   *norm_query;
-	int			query_len = *query_len_p;
-	int			norm_query_buflen,	/* Space allowed for norm_query */
-				len_to_wrt,		/* Length (in bytes) to write */
-				quer_loc = 0,	/* Source query byte location */
-				n_quer_loc = 0, /* Normalized query byte location */
-				last_off = 0,	/* Offset from start for previous tok */
-				last_tok_len = 0;	/* Length (in bytes) of that tok */
-	int			num_constants_replaced = 0;
-
-	/*
-	 * Get constants' lengths (core system only gives us locations).  Note
-	 * this also ensures the items are sorted by location.
-	 */
-	fill_in_constant_lengths(jstate, query, query_loc);
-
-	/*
-	 * Allow for $n symbols to be longer than the constants they replace.
-	 * Constants must take at least one byte in text form, while a $n symbol
-	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
-	 * could refine that limit based on the max value of n for the current
-	 * query, but it hardly seems worth any extra effort to do so.
-	 */
-	norm_query_buflen = query_len + jstate->clocations_count * 10;
-
-	/* Allocate result buffer */
-	norm_query = palloc(norm_query_buflen + 1);
-
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			off,		/* Offset from start for cur tok */
-					tok_len;	/* Length (in bytes) of that tok */
-
-		/*
-		 * If we have an external param at this location, but no lists are
-		 * being squashed across the query, then we skip here; this will make
-		 * us print the characters found in the original query that represent
-		 * the parameter in the next iteration (or after the loop is done),
-		 * which is a bit odd but seems to work okay in most cases.
-		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
-			continue;
-
-		off = jstate->clocations[i].location;
-
-		/* Adjust recorded location if we're dealing with partial string */
-		off -= query_loc;
-
-		tok_len = jstate->clocations[i].length;
-
-		if (tok_len < 0)
-			continue;			/* ignore any duplicates */
-
-		/* Copy next chunk (what precedes the next constant) */
-		len_to_wrt = off - last_off;
-		len_to_wrt -= last_tok_len;
-		Assert(len_to_wrt >= 0);
-		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-		n_quer_loc += len_to_wrt;
-
-		/*
-		 * And insert a param symbol in place of the constant token; and, if
-		 * we have a squashable list, insert a placeholder comment starting
-		 * from the list's second value.
-		 */
-		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
-							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
-		num_constants_replaced++;
-
-		/* move forward */
-		quer_loc = off + tok_len;
-		last_off = off;
-		last_tok_len = tok_len;
-	}
-
-	/*
-	 * We've copied up until the last ignorable constant.  Copy over the
-	 * remaining bytes of the original query string.
-	 */
-	len_to_wrt = query_len - quer_loc;
-
-	Assert(len_to_wrt >= 0);
-	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
-	n_quer_loc += len_to_wrt;
-
-	Assert(n_quer_loc <= norm_query_buflen);
-	norm_query[n_quer_loc] = '\0';
-
-	*query_len_p = n_quer_loc;
-	return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings.  This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations.  Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location.  We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant.  This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
-						 int query_loc)
-{
-	LocationLen *locs;
-	core_yyscan_t yyscanner;
-	core_yy_extra_type yyextra;
-	core_YYSTYPE yylval;
-	YYLTYPE		yylloc;
-
-	/*
-	 * Sort the records by location so that we can process them in order while
-	 * scanning the query text.
-	 */
-	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
-			  sizeof(LocationLen), comp_location);
-	locs = jstate->clocations;
-
-	/* initialize the flex scanner --- should match raw_parser() */
-	yyscanner = scanner_init(query,
-							 &yyextra,
-							 &ScanKeywords,
-							 ScanKeywordTokens);
-
-	/* Search for each constant, in sequence */
-	for (int i = 0; i < jstate->clocations_count; i++)
-	{
-		int			loc;
-		int			tok;
-
-		/* Ignore constants after the first one in the same location */
-		if (i > 0 && locs[i].location == locs[i - 1].location)
-		{
-			locs[i].length = -1;
-			continue;
-		}
-
-		if (locs[i].squashed)
-			continue;			/* squashable list, ignore */
-
-		/* Adjust recorded location if we're dealing with partial string */
-		loc = locs[i].location - query_loc;
-		Assert(loc >= 0);
-
-		/*
-		 * We have a valid location for a constant that's not a dupe. Lex
-		 * tokens until we find the desired constant.
-		 */
-		for (;;)
-		{
-			tok = core_yylex(&yylval, &yylloc, yyscanner);
-
-			/* We should not hit end-of-string, but if we do, behave sanely */
-			if (tok == 0)
-				break;			/* out of inner for-loop */
-
-			/*
-			 * We should find the token position exactly, but if we somehow
-			 * run past it, work with that.
-			 */
-			if (yylloc >= loc)
-			{
-				if (query[loc] == '-')
-				{
-					/*
-					 * It's a negative value - this is the one and only case
-					 * where we replace more than a single token.
-					 *
-					 * Do not compensate for the core system's special-case
-					 * adjustment of location to that of the leading '-'
-					 * operator in the event of a negative constant.  It is
-					 * also useful for our purposes to start from the minus
-					 * symbol.  In this way, queries like "select * from foo
-					 * where bar = 1" and "select * from foo where bar = -2"
-					 * will have identical normalized query strings.
-					 */
-					tok = core_yylex(&yylval, &yylloc, yyscanner);
-					if (tok == 0)
-						break;	/* out of inner for-loop */
-				}
-
-				/*
-				 * We now rely on the assumption that flex has placed a zero
-				 * byte after the text of the current token in scanbuf.
-				 */
-				locs[i].length = strlen(yyextra.scanbuf + loc);
-				break;			/* out of inner for-loop */
-			}
-		}
-
-		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
-		if (tok == 0)
-			break;
-	}
-
-	scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
-	int			l = ((const LocationLen *) a)->location;
-	int			r = ((const LocationLen *) b)->location;
-
-	return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..c997a1ef609 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
 #include "access/transam.h"
 #include "catalog/pg_proc.h"
 #include "common/hashfn.h"
+#include "common/int.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "nodes/queryjumble.h"
 #include "utils/lsyscache.h"
+#include "parser/scanner.h"
 #include "parser/scansup.h"
 
 #define JUMBLE_SIZE				1024	/* query serialization buffer size */
@@ -773,3 +775,254 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
 	 */
 	JUMBLE_STRING(aliasname);
 }
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+	int			l = ((const LocationLen *) a)->location;
+	int			r = ((const LocationLen *) b)->location;
+
+	return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings.  This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location.  We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * If query_loc > 0, then "query" has been advanced by that much compared to
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate.  (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth doing.)
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant.  This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+				   int query_loc)
+{
+	LocationLen *locs;
+	core_yyscan_t yyscanner;
+	core_yy_extra_type yyextra;
+	core_YYSTYPE yylval;
+	YYLTYPE		yylloc;
+
+	/*
+	 * Sort the records by location so that we can process them in order while
+	 * scanning the query text.
+	 */
+	if (jstate->clocations_count > 1)
+		qsort(jstate->clocations, jstate->clocations_count,
+			  sizeof(LocationLen), CompLocation);
+	locs = jstate->clocations;
+
+	/* initialize the flex scanner --- should match raw_parser() */
+	yyscanner = scanner_init(query,
+							 &yyextra,
+							 &ScanKeywords,
+							 ScanKeywordTokens);
+
+	/* Search for each constant, in sequence */
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			loc;
+		int			tok;
+
+		/* Ignore constants after the first one in the same location */
+		if (i > 0 && locs[i].location == locs[i - 1].location)
+		{
+			locs[i].length = -1;
+			continue;
+		}
+
+		if (locs[i].squashed)
+			continue;			/* squashable list, ignore */
+
+		/*
+		 * Adjust the constant's location using the provided starting location
+		 * of the current statement. This allows us to avoid scanning a
+		 * multi-statement string from the beginning.
+		 */
+		loc = locs[i].location - query_loc;
+		Assert(loc >= 0);
+
+		/*
+		 * We have a valid location for a constant that's not a dupe. Lex
+		 * tokens until we find the desired constant.
+		 */
+		for (;;)
+		{
+			tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+			/* We should not hit end-of-string, but if we do, behave sanely */
+			if (tok == 0)
+				break;			/* out of inner for-loop */
+
+			/*
+			 * We should find the token position exactly, but if we somehow
+			 * run past it, work with that.
+			 */
+			if (yylloc >= loc)
+			{
+				if (query[loc] == '-')
+				{
+					/*
+					 * It's a negative value - this is the one and only case
+					 * where we replace more than a single token.
+					 *
+					 * Do not compensate for the special-case adjustment of
+					 * location to that of the leading '-' operator in the
+					 * event of a negative constant (see doNegate() in
+					 * gram.y).  It is also useful for our purposes to start
+					 * from the minus symbol.  In this way, queries like
+					 * "select * from foo where bar = 1" and "select * from
+					 * foo where bar = -2" will have identical normalized
+					 * query strings.
+					 */
+					tok = core_yylex(&yylval, &yylloc, yyscanner);
+					if (tok == 0)
+						break;	/* out of inner for-loop */
+				}
+
+				/*
+				 * We now rely on the assumption that flex has placed a zero
+				 * byte after the text of the current token in scanbuf.
+				 */
+				locs[i].length = strlen(yyextra.scanbuf + loc);
+				break;			/* out of inner for-loop */
+			}
+		}
+
+		/* If we hit end-of-string, give up, leaving remaining lengths -1 */
+		if (tok == 0)
+			break;
+	}
+
+	scanner_finish(yyscanner);
+}
+
+/*
+ * Generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit.  The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Similar to SetConstantLengths, we must also translate the provided location
+ * to compensate for multi-statement strings.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+						int query_loc, int *query_len_p)
+{
+	char	   *norm_query;
+	int			query_len = *query_len_p;
+	int			norm_query_buflen,	/* Space allowed for norm_query */
+				len_to_wrt,		/* Length (in bytes) to write */
+				quer_loc = 0,	/* Source query byte location */
+				n_quer_loc = 0, /* Normalized query byte location */
+				last_off = 0,	/* Offset from start for previous tok */
+				last_tok_len = 0;	/* Length (in bytes) of that tok */
+	int			num_constants_replaced = 0;
+
+	/*
+	 * Set constants' lengths in JumbleState, as only locations are set during
+	 * DoJumble(). Note this also ensures the items are sorted by location.
+	 */
+	SetConstantLengths(jstate, query, query_loc);
+
+	/*
+	 * Allow for $n symbols to be longer than the constants they replace.
+	 * Constants must take at least one byte in text form, while a $n symbol
+	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
+	 * could refine that limit based on the max value of n for the current
+	 * query, but it hardly seems worth any extra effort to do so.
+	 */
+	norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+	/* Allocate result buffer */
+	norm_query = palloc(norm_query_buflen + 1);
+
+	for (int i = 0; i < jstate->clocations_count; i++)
+	{
+		int			off,		/* Offset from start for cur tok */
+					tok_len;	/* Length (in bytes) of that tok */
+
+		/*
+		 * If we have an external param at this location, but no lists are
+		 * being squashed across the query, then we skip here; this will make
+		 * us print the characters found in the original query that represent
+		 * the parameter in the next iteration (or after the loop is done),
+		 * which is a bit odd but seems to work okay in most cases.
+		 */
+		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+			continue;
+
+		off = jstate->clocations[i].location;
+
+		/* Adjust the constant's location (see SetConstantLengths) */
+		off -= query_loc;
+
+		tok_len = jstate->clocations[i].length;
+
+		if (tok_len < 0)
+			continue;			/* ignore any duplicates */
+
+		/* Copy next chunk (what precedes the next constant) */
+		len_to_wrt = off - last_off;
+		len_to_wrt -= last_tok_len;
+		Assert(len_to_wrt >= 0);
+		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+		n_quer_loc += len_to_wrt;
+
+		/*
+		 * And insert a param symbol in place of the constant token; and, if
+		 * we have a squashable list, insert a placeholder comment starting
+		 * from the list's second value.
+		 */
+		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
+							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+		num_constants_replaced++;
+
+		/* move forward */
+		quer_loc = off + tok_len;
+		last_off = off;
+		last_tok_len = tok_len;
+	}
+
+	/*
+	 * We've copied up until the last ignorable constant.  Copy over the
+	 * remaining bytes of the original query string.
+	 */
+	len_to_wrt = query_len - quer_loc;
+
+	Assert(len_to_wrt >= 0);
+	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+	n_quer_loc += len_to_wrt;
+
+	Assert(n_quer_loc <= norm_query_buflen);
+	norm_query[n_quer_loc] = '\0';
+
+	*query_len_p = n_quer_loc;
+	return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..e354f857c37 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+									 int query_loc, int *query_len_p);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
 
-- 
2.47.3



  [application/octet-stream] v7-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (9.7K, 3-v7-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
  download | inline diff:
From 42360a306bfed6b899bd03e60c69b31db59e2ec0 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v7 2/2] Make JumbleState const in post_parse_analyze hook

Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.

Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.

Also, rename SetConstantLengths() to ComputeConstantLengths() as it is
no longer setting constant lengths, and it's also computing lengths
based on traversing through the LocationLen data.

Furthermore, ComputeConstantLengths is also exported to allow plug-ins
that wish to perform custom processing of literals from a query.

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   |  8 +--
 src/backend/nodes/queryjumblefuncs.c          | 50 ++++++++++++-------
 src/include/nodes/queryjumble.h               |  4 +-
 src/include/parser/analyze.h                  |  2 +-
 4 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c9338442d96..386a1419232 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -333,7 +333,7 @@ static void pgss_shmem_request(void);
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-									JumbleState *jstate);
+									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
 								 const char *query_string,
 								 int cursorOptions,
@@ -357,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
 					   const BufferUsage *bufusage,
 					   const WalUsage *walusage,
 					   const struct JitInstrumentation *jitusage,
-					   JumbleState *jstate,
+					   const JumbleState *jstate,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
@@ -833,7 +833,7 @@ error:
  * Post-parse-analysis hook: mark query with a queryId
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
 {
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1290,7 +1290,7 @@ pgss_store(const char *query, int64 queryId,
 		   const BufferUsage *bufusage,
 		   const WalUsage *walusage,
 		   const struct JitInstrumentation *jitusage,
-		   JumbleState *jstate,
+		   const JumbleState *jstate,
 		   int parallel_workers_to_launch,
 		   int parallel_workers_launched,
 		   PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index c997a1ef609..af9417797a3 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
 }
 
 /*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
  *
  * The constants may use any allowed constant syntax, such as float literals,
  * bit-strings, single-quoted strings and dollar-quoted strings.  This is
@@ -808,9 +810,9 @@ CompLocation(const void *a, const void *b)
  * a negative numeric constant.  This precludes there ever being another
  * reason for a constant to start with a '-'.
  */
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
-				   int query_loc)
+LocationLen *
+ComputeConstantLengths(const JumbleState *jstate, const char *query,
+					   int query_loc)
 {
 	LocationLen *locs;
 	core_yyscan_t yyscanner;
@@ -818,14 +820,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	core_YYSTYPE yylval;
 	YYLTYPE		yylloc;
 
+	if (jstate->clocations_count == 0)
+		return NULL;
+
+	/* Copy constant locations to avoid modifying jstate */
+	locs = palloc_array(LocationLen, jstate->clocations_count);
+	memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
 	/*
 	 * Sort the records by location so that we can process them in order while
 	 * scanning the query text.
 	 */
 	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
+		qsort(locs, jstate->clocations_count,
 			  sizeof(LocationLen), CompLocation);
-	locs = jstate->clocations;
 
 	/* initialize the flex scanner --- should match raw_parser() */
 	yyscanner = scanner_init(query,
@@ -910,6 +918,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
 	}
 
 	scanner_finish(yyscanner);
+
+	return locs;
 }
 
 /*
@@ -920,7 +930,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
  * the result string length on exit.  The resulting string might be longer
  * or shorter depending on what happens with replacement of constants.
  *
- * Similar to SetConstantLengths, we must also translate the provided location
+ * Similar to ComputeConstantLengths, we must also translate the provided location
  * to compensate for multi-statement strings.
  *
  * It is the caller's job to ensure that the string is a valid SQL statement
@@ -931,7 +941,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
  * Returns a palloc'd string.
  */
 char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 						int query_loc, int *query_len_p)
 {
 	char	   *norm_query;
@@ -943,12 +953,14 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 				last_off = 0,	/* Offset from start for previous tok */
 				last_tok_len = 0;	/* Length (in bytes) of that tok */
 	int			num_constants_replaced = 0;
+	LocationLen *locs = NULL;
 
 	/*
-	 * Set constants' lengths in JumbleState, as only locations are set during
-	 * DoJumble(). Note this also ensures the items are sorted by location.
+	 * Determine constants' lengths, which are not set during DoJumble(), and
+	 * return a sorted copy of jstate's LocationLen data with lengths filled
+	 * in.
 	 */
-	SetConstantLengths(jstate, query, query_loc);
+	locs = ComputeConstantLengths(jstate, query, query_loc);
 
 	/*
 	 * Allow for $n symbols to be longer than the constants they replace.
@@ -974,15 +986,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 * the parameter in the next iteration (or after the loop is done),
 		 * which is a bit odd but seems to work okay in most cases.
 		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+		if (locs[i].extern_param && !jstate->has_squashed_lists)
 			continue;
 
-		off = jstate->clocations[i].location;
+		off = locs[i].location;
 
-		/* Adjust the constant's location (see SetConstantLengths) */
+		/* Adjust the constant's location (see ComputeConstantLengths) */
 		off -= query_loc;
 
-		tok_len = jstate->clocations[i].length;
+		tok_len = locs[i].length;
 
 		if (tok_len < 0)
 			continue;			/* ignore any duplicates */
@@ -1001,7 +1013,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		 */
 		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
 							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+							  locs[i].squashed ? " /*, ... */" : "");
 		num_constants_replaced++;
 
 		/* move forward */
@@ -1010,6 +1022,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
 		last_tok_len = tok_len;
 	}
 
+	/* Clean up temporary location array before any assertions */
+	if (locs)
+		pfree(locs);
+
 	/*
 	 * We've copied up until the last ignorable constant.  Copy over the
 	 * remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index e354f857c37..8364d58e4cb 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,8 +91,10 @@ extern PGDLLIMPORT int compute_query_id;
 
 
 extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
 									 int query_loc, int *query_len_p);
+extern LocationLen *ComputeConstantLengths(const JumbleState *jstate, const char *query,
+										   int query_loc);
 extern JumbleState *JumbleQuery(Query *query);
 extern void EnableQueryId(void);
 
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
 /* Hook for plugins to get control at end of parse analysis */
 typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
 											  Query *query,
-											  JumbleState *jstate);
+											  const JumbleState *jstate);
 extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
 
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 01:50  Lukas Fittl <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Lukas Fittl @ 2026-03-27 01:50 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

On Thu, Mar 26, 2026 at 10:19 AM Sami Imseih <[email protected]> wrote:
> Both of those changes belong in 0002. See the attached v7.

Looks good!

I've also done a quick follow-up test with pg_tracing and that
simplifies its logic as expected [0] to be able to extract inline
parameter values.

Thanks,
Lukas

[0]: https://github.com/lfittl/pg_tracing/commit/ae937fdee4aa57206b31b5746f36dd8e55cf43d1#diff-41cf816684...

-- 
Lukas Fittl





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 05:17  Michael Paquier <[email protected]>
  parent: Lukas Fittl <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Michael Paquier @ 2026-03-27 05:17 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

On Thu, Mar 26, 2026 at 06:50:20PM -0700, Lukas Fittl wrote:
> On Thu, Mar 26, 2026 at 10:19 AM Sami Imseih <[email protected]> wrote:
> > Both of those changes belong in 0002. See the attached v7.
> 
> Looks good!
> 
> I've also done a quick follow-up test with pg_tracing and that
> simplifies its logic as expected [0] to be able to extract inline
> parameter values.

I have been looking at what you have here.  As mentioned upthread, I
am on board with aiming for JumbleState to be a const so as we enforce
as policy that no extension is allowed to touch what the core code has
computed.

However, I am not convinced by most of the patch.  The logic to
recompute the lengths of the constants is a concept proper to PGSS.
Perhaps we could reconsider moving that into core at some point, but
I am honestly not sure to see the range of benefits we'd gain from
that.

This line of arguments is stronger for the normalization of the query
string.  Why should the core code decide what a normalized string
should look like when it comes to the detection of the constants, if
any?  Instead of a dollar-quoted number, we could enforce a bunch of
things, like a '?' or a '$woozah$' at these locations.

Saying that, the key point of the patch is to take a copy of the
JumbleState locations, and recompute it for the needs of PGSS for the
sake of the normalized query representation.  Hence, why don't we just
do that at the end?  That should be enough to enforce the const marker
for the JumbleState across all the loaded extensions that want to look
at it.  This leads me to the simpler patch attached.

Comments or tomatoes?  That's simpler, and I'd be OK with just that
for v19.  That would be much better than the current state of affairs,
where PGSS decides to enforce its own ideas in the JumbleState, ideas 
fed to anything looping into the post-parse-analyze hook after PGSS.
--
Michael

From 336e6aa38d6dc3901c3fa1f746796a0b8fc87cd2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Fri, 27 Mar 2026 14:13:00 +0900
Subject: [PATCH v8] Make JumbleState const in post_parse_analyze hook

Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.

Update pg_stat_statements and related functions to match the new const
signature.  Refactor fill_in_constant_lengths() to return a newly
allocated LocationLen array instead of modifying the JumbleState, so as
PGSS does not touch the now-const JumbleState.  The routine is renamed
to compute_constant_lengths().

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 src/include/parser/analyze.h                  |  2 +-
 .../pg_stat_statements/pg_stat_statements.c   | 75 ++++++++++++-------
 2 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ffd..b2db6fa4e8c4 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
 /* Hook for plugins to get control at end of parse analysis */
 typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
 											  Query *query,
-											  JumbleState *jstate);
+											  const JumbleState *jstate);
 extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3b..0b6be5e255a4 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -335,7 +335,7 @@ static void pgss_shmem_request(void);
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-									JumbleState *jstate);
+									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
 								 const char *query_string,
 								 int cursorOptions,
@@ -359,7 +359,7 @@ static void pgss_store(const char *query, int64 queryId,
 					   const BufferUsage *bufusage,
 					   const WalUsage *walusage,
 					   const struct JitInstrumentation *jitusage,
-					   JumbleState *jstate,
+					   const JumbleState *jstate,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
@@ -378,13 +378,14 @@ static char *qtext_fetch(Size query_offset, int query_len,
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
 static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
+static char *generate_normalized_query(const JumbleState *jstate,
+									   const char *query,
 									   int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
-									 int query_loc);
+static LocationLen *compute_constant_lengths(const JumbleState *jstate,
+											 const char *query,
+											 int query_loc);
 static int	comp_location(const void *a, const void *b);
 
-
 /*
  * Module load callback
  */
@@ -840,7 +841,7 @@ error:
  * Post-parse-analysis hook: mark query with a queryId
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
 {
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1297,7 +1298,7 @@ pgss_store(const char *query, int64 queryId,
 		   const BufferUsage *bufusage,
 		   const WalUsage *walusage,
 		   const struct JitInstrumentation *jitusage,
-		   JumbleState *jstate,
+		   const JumbleState *jstate,
 		   int parallel_workers_to_launch,
 		   int parallel_workers_launched,
 		   PlannedStmtOrigin planOrigin)
@@ -2845,7 +2846,7 @@ release_lock:
  * Returns a palloc'd string.
  */
 static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
+generate_normalized_query(const JumbleState *jstate, const char *query,
 						  int query_loc, int *query_len_p)
 {
 	char	   *norm_query;
@@ -2857,12 +2858,14 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 				last_off = 0,	/* Offset from start for previous tok */
 				last_tok_len = 0;	/* Length (in bytes) of that tok */
 	int			num_constants_replaced = 0;
+	LocationLen *locs = NULL;
 
 	/*
-	 * Get constants' lengths (core system only gives us locations).  Note
-	 * this also ensures the items are sorted by location.
+	 * Determine constants' lengths (core system only gives us locations),
+	 * and return a sorted copy of jstate's LocationLen data with lengths
+	 * filled in.
 	 */
-	fill_in_constant_lengths(jstate, query, query_loc);
+	locs = compute_constant_lengths(jstate, query, query_loc);
 
 	/*
 	 * Allow for $n symbols to be longer than the constants they replace.
@@ -2888,15 +2891,15 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 		 * the parameter in the next iteration (or after the loop is done),
 		 * which is a bit odd but seems to work okay in most cases.
 		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+		if (locs[i].extern_param && !jstate->has_squashed_lists)
 			continue;
 
-		off = jstate->clocations[i].location;
+		off = locs[i].location;
 
 		/* Adjust recorded location if we're dealing with partial string */
 		off -= query_loc;
 
-		tok_len = jstate->clocations[i].length;
+		tok_len = locs[i].length;
 
 		if (tok_len < 0)
 			continue;			/* ignore any duplicates */
@@ -2915,7 +2918,7 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 		 */
 		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
 							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+							  locs[i].squashed ? " /*, ... */" : "");
 		num_constants_replaced++;
 
 		/* move forward */
@@ -2924,6 +2927,10 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 		last_tok_len = tok_len;
 	}
 
+	/* Clean up, if needed */
+	if (locs)
+		pfree(locs);
+
 	/*
 	 * We've copied up until the last ignorable constant.  Copy over the
 	 * remaining bytes of the original query string.
@@ -2942,8 +2949,9 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 }
 
 /*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants.
  *
  * The constants may use any allowed constant syntax, such as float literals,
  * bit-strings, single-quoted strings and dollar-quoted strings.  This is
@@ -2959,16 +2967,19 @@ generate_normalized_query(JumbleState *jstate, const char *query,
  * past the first to -1 so that they can later be ignored.
  *
  * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate.  (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth
+ * doing.)
  *
  * N.B. There is an assumption that a '-' character at a Const location begins
  * a negative numeric constant.  This precludes there ever being another
  * reason for a constant to start with a '-'.
+ *
+ * It is the caller's responsibility to free the result, if necessary.
  */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
+static LocationLen *
+compute_constant_lengths(const JumbleState *jstate, const char *query,
 						 int query_loc)
 {
 	LocationLen *locs;
@@ -2977,14 +2988,20 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
 	core_YYSTYPE yylval;
 	YYLTYPE		yylloc;
 
+	if (jstate->clocations_count == 0)
+		return NULL;
+
+	/* Copy constant locations to avoid modifying jstate */
+	locs = palloc_array(LocationLen, jstate->clocations_count);
+	memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
 	/*
 	 * Sort the records by location so that we can process them in order while
 	 * scanning the query text.
 	 */
 	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
+		qsort(locs, jstate->clocations_count,
 			  sizeof(LocationLen), comp_location);
-	locs = jstate->clocations;
 
 	/* initialize the flex scanner --- should match raw_parser() */
 	yyscanner = scanner_init(query,
@@ -3008,7 +3025,11 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
 		if (locs[i].squashed)
 			continue;			/* squashable list, ignore */
 
-		/* Adjust recorded location if we're dealing with partial string */
+		/*
+		 * Adjust the constant's location using the provided starting location
+		 * of the current statement.  This allows us to avoid scanning a
+		 * multi-statement string from the beginning.
+		 */
 		loc = locs[i].location - query_loc;
 		Assert(loc >= 0);
 
@@ -3064,6 +3085,8 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
 	}
 
 	scanner_finish(yyscanner);
+
+	return locs;
 }
 
 /*
-- 
2.53.0



Attachments:

  [text/plain] v8-0001-Make-JumbleState-const-in-post_parse_analyze-hook.patch (9.6K, 2-v8-0001-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
  download | inline diff:
From 336e6aa38d6dc3901c3fa1f746796a0b8fc87cd2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Fri, 27 Mar 2026 14:13:00 +0900
Subject: [PATCH v8] Make JumbleState const in post_parse_analyze hook

Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.

Update pg_stat_statements and related functions to match the new const
signature.  Refactor fill_in_constant_lengths() to return a newly
allocated LocationLen array instead of modifying the JumbleState, so as
PGSS does not touch the now-const JumbleState.  The routine is renamed
to compute_constant_lengths().

Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
 src/include/parser/analyze.h                  |  2 +-
 .../pg_stat_statements/pg_stat_statements.c   | 75 ++++++++++++-------
 2 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ffd..b2db6fa4e8c4 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
 /* Hook for plugins to get control at end of parse analysis */
 typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
 											  Query *query,
-											  JumbleState *jstate);
+											  const JumbleState *jstate);
 extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
 
 
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3b..0b6be5e255a4 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -335,7 +335,7 @@ static void pgss_shmem_request(void);
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
 static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
-									JumbleState *jstate);
+									const JumbleState *jstate);
 static PlannedStmt *pgss_planner(Query *parse,
 								 const char *query_string,
 								 int cursorOptions,
@@ -359,7 +359,7 @@ static void pgss_store(const char *query, int64 queryId,
 					   const BufferUsage *bufusage,
 					   const WalUsage *walusage,
 					   const struct JitInstrumentation *jitusage,
-					   JumbleState *jstate,
+					   const JumbleState *jstate,
 					   int parallel_workers_to_launch,
 					   int parallel_workers_launched,
 					   PlannedStmtOrigin planOrigin);
@@ -378,13 +378,14 @@ static char *qtext_fetch(Size query_offset, int query_len,
 static bool need_gc_qtexts(void);
 static void gc_qtexts(void);
 static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
+static char *generate_normalized_query(const JumbleState *jstate,
+									   const char *query,
 									   int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
-									 int query_loc);
+static LocationLen *compute_constant_lengths(const JumbleState *jstate,
+											 const char *query,
+											 int query_loc);
 static int	comp_location(const void *a, const void *b);
 
-
 /*
  * Module load callback
  */
@@ -840,7 +841,7 @@ error:
  * Post-parse-analysis hook: mark query with a queryId
  */
 static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
 {
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1297,7 +1298,7 @@ pgss_store(const char *query, int64 queryId,
 		   const BufferUsage *bufusage,
 		   const WalUsage *walusage,
 		   const struct JitInstrumentation *jitusage,
-		   JumbleState *jstate,
+		   const JumbleState *jstate,
 		   int parallel_workers_to_launch,
 		   int parallel_workers_launched,
 		   PlannedStmtOrigin planOrigin)
@@ -2845,7 +2846,7 @@ release_lock:
  * Returns a palloc'd string.
  */
 static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
+generate_normalized_query(const JumbleState *jstate, const char *query,
 						  int query_loc, int *query_len_p)
 {
 	char	   *norm_query;
@@ -2857,12 +2858,14 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 				last_off = 0,	/* Offset from start for previous tok */
 				last_tok_len = 0;	/* Length (in bytes) of that tok */
 	int			num_constants_replaced = 0;
+	LocationLen *locs = NULL;
 
 	/*
-	 * Get constants' lengths (core system only gives us locations).  Note
-	 * this also ensures the items are sorted by location.
+	 * Determine constants' lengths (core system only gives us locations),
+	 * and return a sorted copy of jstate's LocationLen data with lengths
+	 * filled in.
 	 */
-	fill_in_constant_lengths(jstate, query, query_loc);
+	locs = compute_constant_lengths(jstate, query, query_loc);
 
 	/*
 	 * Allow for $n symbols to be longer than the constants they replace.
@@ -2888,15 +2891,15 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 		 * the parameter in the next iteration (or after the loop is done),
 		 * which is a bit odd but seems to work okay in most cases.
 		 */
-		if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+		if (locs[i].extern_param && !jstate->has_squashed_lists)
 			continue;
 
-		off = jstate->clocations[i].location;
+		off = locs[i].location;
 
 		/* Adjust recorded location if we're dealing with partial string */
 		off -= query_loc;
 
-		tok_len = jstate->clocations[i].length;
+		tok_len = locs[i].length;
 
 		if (tok_len < 0)
 			continue;			/* ignore any duplicates */
@@ -2915,7 +2918,7 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 		 */
 		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
 							  num_constants_replaced + 1 + jstate->highest_extern_param_id,
-							  jstate->clocations[i].squashed ? " /*, ... */" : "");
+							  locs[i].squashed ? " /*, ... */" : "");
 		num_constants_replaced++;
 
 		/* move forward */
@@ -2924,6 +2927,10 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 		last_tok_len = tok_len;
 	}
 
+	/* Clean up, if needed */
+	if (locs)
+		pfree(locs);
+
 	/*
 	 * We've copied up until the last ignorable constant.  Copy over the
 	 * remaining bytes of the original query string.
@@ -2942,8 +2949,9 @@ generate_normalized_query(JumbleState *jstate, const char *query,
 }
 
 /*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants.
  *
  * The constants may use any allowed constant syntax, such as float literals,
  * bit-strings, single-quoted strings and dollar-quoted strings.  This is
@@ -2959,16 +2967,19 @@ generate_normalized_query(JumbleState *jstate, const char *query,
  * past the first to -1 so that they can later be ignored.
  *
  * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate.  (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate.  (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth
+ * doing.)
  *
  * N.B. There is an assumption that a '-' character at a Const location begins
  * a negative numeric constant.  This precludes there ever being another
  * reason for a constant to start with a '-'.
+ *
+ * It is the caller's responsibility to free the result, if necessary.
  */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
+static LocationLen *
+compute_constant_lengths(const JumbleState *jstate, const char *query,
 						 int query_loc)
 {
 	LocationLen *locs;
@@ -2977,14 +2988,20 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
 	core_YYSTYPE yylval;
 	YYLTYPE		yylloc;
 
+	if (jstate->clocations_count == 0)
+		return NULL;
+
+	/* Copy constant locations to avoid modifying jstate */
+	locs = palloc_array(LocationLen, jstate->clocations_count);
+	memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
 	/*
 	 * Sort the records by location so that we can process them in order while
 	 * scanning the query text.
 	 */
 	if (jstate->clocations_count > 1)
-		qsort(jstate->clocations, jstate->clocations_count,
+		qsort(locs, jstate->clocations_count,
 			  sizeof(LocationLen), comp_location);
-	locs = jstate->clocations;
 
 	/* initialize the flex scanner --- should match raw_parser() */
 	yyscanner = scanner_init(query,
@@ -3008,7 +3025,11 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
 		if (locs[i].squashed)
 			continue;			/* squashable list, ignore */
 
-		/* Adjust recorded location if we're dealing with partial string */
+		/*
+		 * Adjust the constant's location using the provided starting location
+		 * of the current statement.  This allows us to avoid scanning a
+		 * multi-statement string from the beginning.
+		 */
 		loc = locs[i].location - query_loc;
 		Assert(loc >= 0);
 
@@ -3064,6 +3085,8 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
 	}
 
 	scanner_finish(yyscanner);
+
+	return locs;
 }
 
 /*
-- 
2.53.0



  [application/pgp-signature] signature.asc (833B, 3-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 07:19  Lukas Fittl <[email protected]>
  parent: Michael Paquier <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Lukas Fittl @ 2026-03-27 07:19 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

On Thu, Mar 26, 2026 at 10:18 PM Michael Paquier <[email protected]> wrote:
> This line of arguments is stronger for the normalization of the query
> string.  Why should the core code decide what a normalized string
> should look like when it comes to the detection of the constants, if
> any?  Instead of a dollar-quoted number, we could enforce a bunch of
> things, like a '?' or a '$woozah$' at these locations.

Fair enough, though I haven't seen any extensions that do that in
practice - its reasonable to have normalization result in a query
string that's parsable again and can be passed to EXPLAIN
(GENERIC_PLAN).

> Saying that, the key point of the patch is to take a copy of the
> JumbleState locations, and recompute it for the needs of PGSS for the
> sake of the normalized query representation.  Hence, why don't we just
> do that at the end?  That should be enough to enforce the const marker
> for the JumbleState across all the loaded extensions that want to look
> at it.  This leads me to the simpler patch attached.
>
> Comments or tomatoes?  That's simpler, and I'd be OK with just that
> for v19.  That would be much better than the current state of affairs,
> where PGSS decides to enforce its own ideas in the JumbleState, ideas
> fed to anything looping into the post-parse-analyze hook after PGSS.

I'm not convinced that making the const change alone is a good idea,
without also providing some helpers to reduce the repeated code in
extensions.

What if we only put the ComputeConstantLengths (as Sami had it in v7)
in core, together with making JumbleState const?

Thanks,
Lukas

-- 
Lukas Fittl





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 17:09  Sami Imseih <[email protected]>
  parent: Lukas Fittl <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Sami Imseih @ 2026-03-27 17:09 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

> On Thu, Mar 26, 2026 at 10:18 PM Michael Paquier <[email protected]> wrote:
> > This line of arguments is stronger for the normalization of the query
> > string.  Why should the core code decide what a normalized string
> > should look like when it comes to the detection of the constants, if
> > any?  Instead of a dollar-quoted number, we could enforce a bunch of
> > things, like a '?' or a '$woozah$' at these locations.
>
> Fair enough, though I haven't seen any extensions that do that in
> practice - its reasonable to have normalization result in a query
> string that's parsable again and can be passed to EXPLAIN
> (GENERIC_PLAN).

with regards to generate_normalized_query, AFAICT, the most common
case is extensions are using it for is dollar quoted number, but I agree
this one is a gray area.

> What if we only put the ComputeConstantLengths (as Sami had it in v7)
> in core, together with making JumbleState const?

I agree that ComputeConstantLengths should be in core. This one is
not a gray area IMO. The query jumble already records constant locations,
but leaves the lengths unset. ComputeConstantLengths is just the
completion of that  work. There could be no other interpretation,
unlike generate_normalized_query, of what the lengths should be.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-28 01:42  Michael Paquier <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Michael Paquier @ 2026-03-28 01:42 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>



> On Mar 28, 2026, at 2:09, Sami Imseih <[email protected]> wrote:
> I agree that ComputeConstantLengths should be in core. This one is
> not a gray area IMO. The query jumble already records constant locations,
> but leaves the lengths unset. ComputeConstantLengths is just the
> completion of that  work. There could be no other interpretation,
> unlike generate_normalized_query, of what the lengths should be.

This is an interesting remark. One problem with any patches presented yet is that we don’t attach the lengths as part of the in-core query jumbling procedure: we plug the lengths later using the same structure as the query jumbling. It seems to me that this is half-baked overall. I think that we don’t want to pay the extra length computation in the core query jumbling at the end, then why does it make sense to include the lengths in the JumbleState structure at all? Shouldn’t we use a different structure filled inside PGSS for this purpose rather than reuse the same thing for PGSS and the in-core part (don’t have the code in front of me now).
--
Michael







^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-30 05:21  Lukas Fittl <[email protected]>
  parent: Michael Paquier <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Lukas Fittl @ 2026-03-30 05:21 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

On Fri, Mar 27, 2026 at 6:42 PM Michael Paquier <[email protected]> wrote:
>
> > On Mar 28, 2026, at 2:09, Sami Imseih <[email protected]> wrote:
> > I agree that ComputeConstantLengths should be in core. This one is
> > not a gray area IMO. The query jumble already records constant locations,
> > but leaves the lengths unset. ComputeConstantLengths is just the
> > completion of that  work. There could be no other interpretation,
> > unlike generate_normalized_query, of what the lengths should be.
>
> This is an interesting remark. One problem with any patches presented yet is that we don’t attach the lengths as part of the in-core query jumbling procedure: we plug the lengths later using the same structure as the query jumbling. It seems to me that this is half-baked overall. I think that we don’t want to pay the extra length computation in the core query jumbling at the end, then why does it make sense to include the lengths in the JumbleState structure at all? Shouldn’t we use a different structure filled inside PGSS for this purpose rather than reuse the same thing for PGSS and the in-core part (don’t have the code in front of me now).

I see where you're coming from on that, but I don't think we can
remove anything here in practice:

typedef struct LocationLen
{
    int            location; /* Required */
    int            length; /* Set by _jumbleElements */
    bool        squashed; /* Set by RecordConstLocation called from
_jumbleElements */
    bool        extern_param; /* Set by RecordConstLocation called
from  _jumbleParam */
} LocationLen;

typedef struct JumbleState
{
    unsigned char *jumble; /* Required */
    Size        jumble_len; /* Required */
    LocationLen *clocations; /* Required to track constant locations */
    int            clocations_buf_size; /* Required to track constant
locations */
    int            clocations_count; /* Required to track constant locations */
    int            highest_extern_param_id; /* Set by _jumbleParam */
    bool        has_squashed_lists; /* Set by _jumbleElements */
    unsigned int pending_nulls; /* Required */
    Size        total_jumble_len; /* Required */
} JumbleState;

The only refactoring I could think of is to write out the
_jumbleElements information separately, then we could actually drop
length, and maybe squashed, from LocationLen. But I'm not sure that
really buys us much? It would be more clear I guess, because the mixed
locations of where length gets set is indeed a bit odd.

I still think it'd be reasonable for us to include
ComputeConstantLengths in core to complete the picture of what we're
doing with _jumbleElements and the length field already anyway. Its
basically a way to fully hydrate the partially filled out JumbleState
from the initial jumble.

Thanks,
Lukas

-- 
Lukas Fittl





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-30 19:20  Sami Imseih <[email protected]>
  parent: Lukas Fittl <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Sami Imseih @ 2026-03-30 19:20 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

> > > I agree that ComputeConstantLengths should be in core. This one is
> > > not a gray area IMO. The query jumble already records constant locations,
> > > but leaves the lengths unset. ComputeConstantLengths is just the
> > > completion of that  work. There could be no other interpretation,
> > > unlike generate_normalized_query, of what the lengths should be.
> >
> > This is an interesting remark. One problem with any patches presented yet is that we don’t attach the lengths as part of the in-core query jumbling procedure: we plug the lengths later using the same structure as the query jumbling. It seems to me that this is half-baked overall. I think that we don’t want to pay the extra length computation in the core query jumbling at the end, then why does it make sense to include the lengths in the JumbleState structure at all? Shouldn’t we use a different structure filled inside PGSS for this purpose rather than reuse the same thing for PGSS and the in-core part (don’t have the code in front of me now).
>
> I see where you're coming from on that, but I don't think we can
> remove anything here in practice:

Yes. Not unless we want to rely on the parser to track the lengths in
Const, which could be invasive, but I have not looked into it.

Paying the length computation at the end of every jumble is also
going to impact performance, so that will not be a good idea.

> I still think it'd be reasonable for us to include
> ComputeConstantLengths in core to complete the picture of what we're
> doing with _jumbleElements and the length field already anyway. Its
> basically a way to fully hydrate the partially filled out JumbleState
> from the initial jumble.

I fully agree, ComputeConstantLengths is an optional post-jumble-query step
for a consumer that wishes to calculate the lengths. The length calculation
is not unique to a plug-in, so in my mind the work it's doing is core
jumbling functionality.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Refactor query normalization into core query jumbling
@ 2026-03-30 23:52  Michael Paquier <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 0 replies; 14+ messages in thread

From: Michael Paquier @ 2026-03-30 23:52 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>

On Mon, Mar 30, 2026 at 02:20:47PM -0500, Sami Imseih wrote:
>> I see where you're coming from on that, but I don't think we can
>> remove anything here in practice:
> 
> Yes. Not unless we want to rely on the parser to track the lengths in
> Const, which could be invasive, but I have not looked into it.

Hmm.  We may not want to get down to that, still that would be
cheaper than reprocessing the parsing of the query string twice. 

>> I still think it'd be reasonable for us to include
>> ComputeConstantLengths in core to complete the picture of what we're
>> doing with _jumbleElements and the length field already anyway. Its
>> basically a way to fully hydrate the partially filled out JumbleState
>> from the initial jumble.
> 
> I fully agree, ComputeConstantLengths is an optional post-jumble-query step
> for a consumer that wishes to calculate the lengths. The length calculation
> is not unique to a plug-in, so in my mind the work it's doing is core
> jumbling functionality.

Okay.  I could fall into that for this release.  Marking the
JumbleState as a const is the most important piece here.  I'm +-0
regarding this routine, but I can also see your point about how it's
useful to give at least the option to extensions to have a
recomputation of the Const lengths, the same way as PGSS.  What are
the extensions that would use that?
--
Michael


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 14+ messages in thread


end of thread, other threads:[~2026-03-30 23:52 UTC | newest]

Thread overview: 14+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-06 20:20 Re: Refactor query normalization into core query jumbling Sami Imseih <[email protected]>
2026-03-17 02:12 ` Sami Imseih <[email protected]>
2026-03-26 01:34   ` Lukas Fittl <[email protected]>
2026-03-26 03:31     ` Sami Imseih <[email protected]>
2026-03-26 16:35       ` Lukas Fittl <[email protected]>
2026-03-26 17:18         ` Sami Imseih <[email protected]>
2026-03-27 01:50           ` Lukas Fittl <[email protected]>
2026-03-27 05:17             ` Michael Paquier <[email protected]>
2026-03-27 07:19               ` Lukas Fittl <[email protected]>
2026-03-27 17:09                 ` Sami Imseih <[email protected]>
2026-03-28 01:42                   ` Michael Paquier <[email protected]>
2026-03-30 05:21                     ` Lukas Fittl <[email protected]>
2026-03-30 19:20                       ` Sami Imseih <[email protected]>
2026-03-30 23:52                         ` Michael Paquier <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox