public inbox for [email protected]
help / color / mirror / Atom feedRefactor query normalization into core query jumbling
35+ messages / 5 participants
[nested] [flat]
* Refactor query normalization into core query jumbling
@ 2025-12-19 00:44 Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2025-12-19 00:44 UTC (permalink / raw)
To: pgsql-hackers
Hi,
A while back, there was a discussion about moving the query
normalization code out of pg_stat_statements and into core
query jumbling [0]. This conversation also led to hardening
the hooks (currently only post_parse_analyze) that receive
JumbleState [1].
For the first point, `generate_normalized_query` takes the query
string and the jumbleState with constant locations from core
query jumbling. There is nothing uniquely special about
pg_stat_statements in this code, and it can be used by other extensions.
Extensions could even pass their own query string and JumbleState
if they choose. Patch 0001 moves this function and its helpers to
queryjumblefuncs.c.
A Github search also shows that there are quite a few extensions
that may be copying this code [2] [3], which means they will lose
out on potential improvements and fixes.
As part of this patch, the moved code itself did not change,
but I did improve a comment:
```
* If query_loc > 0, then "query" has been advanced by that much compared to
* the original string start, so we need to translate the provided locations
* to compensate. (This lets us avoid re-scanning statements before the one
* of interest, so it's worth doing.)
```
This comment appeared at the top of generate_normalized_query and
fill_in_constant_lengths. I simplified it and moved a shortened
version inside the functions.
Also, I did not think `fill_in_constant_lengths` should be a global
function, as I cannot think of a good reason it would be used
on its own, though someone may have a different opinion there.
For the second point, since JumbleState can be shared by multiple
extensions, hooks should receive it as a const pointer. This
signals read-only intent and prevents extensions from
accidentally modifying it through the hook. This change is in
0002. This does mean that extensions will need to update their
hooks, but I do not see that as an issue for a major version.
Thoughts?
[0] https://postgr.es/m/aQA9v9nLu5qsX8IE%40paquier.xyz
[1] https://postgr.es/m/202510281023.4u5aszccvsct%40alvherre.pgsql
[2] https://github.com/search?q=fill_in_constant_lengths&type=code
[3] https://github.com/search?q=generate_normalized_query&type=code
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v1-0002-Make-JumbleState-a-const-pointer-for-plug-ins.patch (4.2K, 2-v1-0002-Make-JumbleState-a-const-pointer-for-plug-ins.patch)
download | inline diff:
From 38494be7d8dd33089b27af2e8b9458a62f1c9df4 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 19:14:10 +0000
Subject: [PATCH v1 2/2] Make JumbleState a const pointer for plug-ins
post_parse_analyze_hook currently exposes JumbleState without
protecting it against modifications by extensions. Since JumbleState
can be shared by multiple hooks, it should be passed as a const pointer.
This signals read-only intent and ensures extensions cannot accidentally
modify it through the hook.
This change updates pg_stat_statements functions and the relevant headers
to take `const JumbleState *` where appropriate.
Discussion: https://postgr.es/m/202510281023.4u5aszccvsct%40alvherre.pgsql
---
contrib/pg_stat_statements/pg_stat_statements.c | 8 ++++----
src/include/nodes/queryjumble.h | 7 ++++++-
src/include/parser/analyze.h | 2 +-
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 3eb29dfbd47..508627833e2 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -332,7 +332,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -356,7 +356,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -832,7 +832,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1289,7 +1289,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 8207227e79e..5116896ec64 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -34,6 +34,11 @@ typedef struct LocationLen
/*
* Working state for computing a query jumble and producing a normalized
* query string
+ *
+ * When passed to hooks (e.g., post_parse_analyze_hook), this state is
+ * intended to be read-only. Extensions should not modify the contents;
+ * hooks must receive it as a 'const JumbleState *' to enforce this
+ * intent.
*/
typedef struct JumbleState
{
@@ -91,7 +96,7 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *generate_normalized_query(JumbleState *jstate, const char *query,
+extern char *generate_normalized_query(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index f29ed03b476..f80da72933c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
--
2.43.0
[application/octet-stream] v1-0001-pg_stat_statements-Move-jumbling-related-function.patch (20.6K, 3-v1-0001-pg_stat_statements-Move-jumbling-related-function.patch)
download | inline diff:
From 4eb9fef9a0e83d6a66d9253387c311c07bac995d Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v1 1/2] pg_stat_statements: Move jumbling related functions
A normalized query string generated by generate_normalized_query can be
used by other extensions, as all will use the same query string and
constant locations generated by core query jumbling. There is nothing
specific to pg_stat_statements about this function, so it should be moved
to core as a global function. This also moves the related helper functions
fill_in_constant_length and comp_location.
Discussion: https://postgr.es/m/aQA9v9nLu5qsX8IE%40paquier.xyz
---
.../pg_stat_statements/pg_stat_statements.c | 270 +-----------------
src/backend/nodes/queryjumblefuncs.c | 252 ++++++++++++++++
src/include/nodes/queryjumble.h | 2 +
3 files changed, 261 insertions(+), 263 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 39208f80b5b..3eb29dfbd47 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -377,11 +375,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1359,6 +1352,13 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
norm_query = generate_normalized_query(jstate, query,
query_location,
&query_len);
@@ -2823,259 +2823,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* we don't want to re-emit any escape string warnings */
- yyextra.escape_string_warning = false;
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index ffc230af427..07511496779 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -633,6 +635,18 @@ _jumbleList(JumbleState *jstate, Node *node)
}
}
+/*
+ * comp_location: comparator for qsorting LocationLen structs by location
+ */
+static int
+comp_location(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
/*
* We try to jumble lists of expressions as one individual item regardless
* of how many elements are in the list. This is know as squashing, which
@@ -773,3 +787,241 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with constants at the indicated locations. Since in practice the string
+ * has already been parsed, and the locations that the caller provides will
+ * have originated from within the authoritative parser, this should not be
+ * a problem.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+fill_in_constant_lengths(const JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), comp_location);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* we don't want to re-emit any escape string warnings */
+ yyextra.escape_string_warning = false;
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the core system's special-case
+ * adjustment of location to that of the leading '-'
+ * operator in the event of a negative constant. It is
+ * also useful for our purposes to start from the minus
+ * symbol. In this way, queries like "select * from foo
+ * where bar = 1" and "select * from foo where bar = -2"
+ * will have identical normalized query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * Note that the normalized representation may well vary depending on
+ * just which "equivalent" query is used to create the hashtable entry.
+ * We assume this is OK.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string that the caller must pfree.
+ */
+char *
+generate_normalized_query(const JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Get constants' lengths (core system only gives us locations). Note
+ * this also ensures the items are sorted by location.
+ */
+ fill_in_constant_lengths(jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see fill_in_constant_lengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index dcb36dcb44f..8207227e79e 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *generate_normalized_query(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
--
2.43.0
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-19 06:49 Bertrand Drouvot <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Bertrand Drouvot @ 2025-12-19 06:49 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: pgsql-hackers
Hi,
On Thu, Dec 18, 2025 at 06:44:17PM -0600, Sami Imseih wrote:
> For the second point, since JumbleState can be shared by multiple
> extensions, hooks should receive it as a const pointer. This
> signals read-only intent and prevents extensions from
> accidentally modifying it through the hook. This change is in
> 0002. This does mean that extensions will need to update their
> hooks, but I do not see that as an issue for a major version.
I can see that 0001 is doing:
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
and then:
+static void
+fill_in_constant_lengths(const JumbleState *jstate, const char *query,
Should the const addition be in 0002? But...
While looking at fill_in_constant_lengths(), I can see that it is doing:
locs = jstate->clocations;
and then things like:
locs[i].length = -1
or
locs[i].length = strlen(yyextra.scanbuf + loc)
While this is technically correct so the compiler does not complain (because
clocations is a non const pointer in JumbleState and the added const does not
apply to what clocations points to), I think that adding const here is misleading.
Indeed, the function clearly modifies data accessible through the parameter.
Thoughts?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-19 21:36 Sami Imseih <[email protected]>
parent: Bertrand Drouvot <[email protected]>
0 siblings, 2 replies; 35+ messages in thread
From: Sami Imseih @ 2025-12-19 21:36 UTC (permalink / raw)
To: Bertrand Drouvot <[email protected]>; +Cc: pgsql-hackers
> While this is technically correct so the compiler does not complain (because
> clocations is a non const pointer in JumbleState and the added const does not
> apply to what clocations points to), I think that adding const here is misleading.
Yes, I am not happy with this. I initially thought it would be OK to
modify the JumbleState as long as it is done in a core function, but
after much thought, this is neither a good idea nor safe. If a pointer
is marked as a Constant, we should not modify it.
So, I went back to think about this, and the core problem as I see it
is that multiple hooks on the chain can modify the constant lengths. For
example, pg_stat_statements can now modify the constant lengths in
JumbleState via fill_in_constant_lengths, and the same JumbleState can
have its constant locations modified in a different way.
At this time, constant lengths are the only part of the JumbleState that
is not set by core. So, I don't think post_parse_analyze receiving
JumbleState as a constant is the right approach, nor do we need it.
I think we should allow JumbleState to define a normalization callback,
which defaults to a core normalization function rather than an
extension specific one:
```
jstate->normalize_query = GenerateNormalizedQuery;
```
This way, any extension that wishes to return a normalized string from
the same JumbleState can invoke this callback and get consistent results.
pg_stat_statements and other extensions with a need to normalize a query
string based on the locations of a JumbleState do not need to care about the
internals of normalization, they simply invoke the callback and
receive the final
string.
So v2-0001 implements this callback and moves the normalization logic
into core. Both changes must be done in the same patch. The comments
are also updated where they are no longer applicable or could be improved.
One additional improvement that this patch did not include is a bool in
JumbleState that tracks whether a normalized string has already been
generated. This way, repeated calls to the callback would not need to
regenerate the string; only the first call would perform the work,
while subsequent calls could simply return the previously computed
normalized string.
I do like the simplicity of this approach and it removes pg_stat_statements
from having to own the normalization code and how well different extensions
sharing the same JumbleState can play together. Not yet sure if this is the
correct direction, and I am open to other ideas.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v2-0001-pg_stat_statements-enforce-single-query-normaliza.patch (22.0K, 2-v2-0001-pg_stat_statements-enforce-single-query-normaliza.patch)
download | inline diff:
From 2fb151b850f87a3190e49ecb77e95cbc71ca9f07 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v2 1/1] pg_stat_statements: enforce single query normalization
per JumbleState
A JumbleState instance can be used by more than one extension, and all
will use the same query string and constant locations produced by core
query jumbling. However, generate_normalized_query in
pg_stat_statements currently modifies the constant lengths stored in
JumbleState, which could be overwritten by another extension implementing
its own normalization logic. While this has not been reported in practice,
it is an undesirable interaction and should be fixed.
To better enforce this invariant, expose a normalization callback in
JumbleState initialized to generate_normalized_query as part of moving all
related normalization code from pg_stat_statements into core query jumbling.
Normalization is invoked through this callback, ensuring that all hooks
on the call chain accessing the same JumbleState use the same normalization
function.
This refactor renames the functions to match the naming conventions used
by the core query jumbling code in queryjumblefuncs.c, and updates code
comments that no longer apply or that benefit from clarification.
Discussion: https://postgr.es/m/aQA9v9nLu5qsX8IE%40paquier.xyz
---
.../pg_stat_statements/pg_stat_statements.c | 277 +-----------------
src/backend/nodes/queryjumblefuncs.c | 251 ++++++++++++++++
src/include/nodes/queryjumble.h | 9 +
3 files changed, 271 insertions(+), 266 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 39208f80b5b..5c11939b656 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -377,11 +375,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1359,9 +1352,17 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
- norm_query = generate_normalized_query(jstate, query,
- query_location,
- &query_len);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ Assert(jstate->normalize_query != NULL);
+ norm_query = jstate->normalize_query(jstate, query,
+ query_location,
+ &query_len);
LWLockAcquire(pgss->lock, LW_SHARED);
}
@@ -2823,259 +2824,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* we don't want to re-emit any escape string warnings */
- yyextra.escape_string_warning = false;
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index ffc230af427..9dd43a7456f 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -68,6 +70,8 @@ static void FlushPendingNulls(JumbleState *jstate);
static void RecordConstLocation(JumbleState *jstate,
bool extern_param,
int location, int len);
+static char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
static void _jumbleNode(JumbleState *jstate, Node *node);
static void _jumbleList(JumbleState *jstate, Node *node);
static void _jumbleElements(JumbleState *jstate, List *elements, Node *node);
@@ -196,6 +200,7 @@ InitJumble(void)
#ifdef USE_ASSERT_CHECKING
jstate->total_jumble_len = 0;
#endif
+ jstate->normalize_query = GenerateNormalizedQuery;
return jstate;
}
@@ -773,3 +778,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* we don't want to re-emit any escape string warnings */
+ yyextra.escape_string_warning = false;
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Callback to generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string.
+ */
+static char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Set constants' lengths in JumbleState, as only locations are set during
+ * DoJumble(). Note this also ensures the items are sorted by location.
+ */
+ SetConstantLengths((JumbleState *) jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see SetConstantLengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index dcb36dcb44f..a756d9b2446 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -75,6 +75,15 @@ typedef struct JumbleState
/* The total number of bytes added to the jumble buffer */
Size total_jumble_len;
#endif
+
+ /*
+ * Internal callback for query normalization. The callback may modify
+ * internal state (e.g., filling in constant lengths). This is set to
+ * GenerateNormalizedQuery during InitJumble(), but third-party plugins
+ * that want to provide their own normalization may modify it.
+ */
+ char *(*normalize_query) (struct JumbleState *jstate,
+ const char *query, int query_loc, int *query_len_p);
} JumbleState;
/* Values for the compute_query_id GUC */
--
2.43.0
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-19 23:22 Michael Paquier <[email protected]>
parent: Sami Imseih <[email protected]>
1 sibling, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2025-12-19 23:22 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Bertrand Drouvot <[email protected]>; pgsql-hackers
On Fri, Dec 19, 2025 at 03:36:16PM -0600, Sami Imseih wrote:
> This way, any extension that wishes to return a normalized string from
> the same JumbleState can invoke this callback and get consistent results.
> pg_stat_statements and other extensions with a need to normalize a query
> string based on the locations of a JumbleState do not need to care about the
> internals of normalization, they simply invoke the callback and
> receive the final
> string.
Hmm. I did not wrap completely my head with your problem, but,
assuming that what you are proposing goes in the right direction, I am
wondering if we should not expose a bit more the jumble query APIs so
as the normal default callback can be reused by out-of-core rather
than hide it entirely. This would mean exposing
GenerateNormalizedQuery(), which also giving a way for callers of
JumbleQuery() to pass down a custom callback? This would imply
thinking harder about the initialization state we expect in the
structure, but I think that we should try to design things so as
extensions do not need to copy-paste more code from the core tree at
the end, just less of it.
Of course, this sentence is written with the same line of thoughts as
previously mentioned in the other thread we have discussed: extensions
should not be allowed to update a JumbleState after it's been set by
the backend code, so as once the same JumbleState pointer is passed
down across multiple extensions they don't get confused. If an
extension wants to use their own policy within the JumbleState, they
had better recreate a new independent one if they are unhappy about
has been generated previously.
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-22 23:40 Sami Imseih <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 0 replies; 35+ messages in thread
From: Sami Imseih @ 2025-12-22 23:40 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: Bertrand Drouvot <[email protected]>; pgsql-hackers
> > This way, any extension that wishes to return a normalized string from
> > the same JumbleState can invoke this callback and get consistent results.
> > pg_stat_statements and other extensions with a need to normalize a query
> > string based on the locations of a JumbleState do not need to care about the
> > internals of normalization, they simply invoke the callback and
> > receive the final
> > string.
>
> Hmm. I did not wrap completely my head with your problem, but,
> assuming that what you are proposing goes in the right direction,
The first goal is to move all query-normalization-related infrastructure
that pg_stat_statements (and other extensions) rely on into core, so
extensions no longer need to copy or reimplement normalization logic and
can all depend on a single, shared implementation.
In addition, query normalization necessarily modifies JumbleState (to
record constant locations and lengths). This responsibility should not
fall to extensions and should instead be delegated to core. I will argue
that the current design, in which extensions handle this directly, is a
layering violation.
As a first step, we can move generate_normalized_query to core as a global
function, allowing extensions to simply call it.
> I am wondering if we should not expose a bit more the jumble query APIs so
> as the normal default callback can be reused by out-of-core rather
> than hide it entirely. This would mean exposing
> GenerateNormalizedQuery(), which also giving a way for callers of
> JumbleQuery() to pass down a custom callback? This would imply
> thinking harder about the initialization state we expect in the
> structure, but I think that we should try to design things so as
> extensions do not need to copy-paste more code from the core tree at
> the end, just less of it.
... and this will be taking the next step which is providing callbacks
and making
more jumbling utilities global. This will require more discussion, but I
would think we would expose InitJumble() and it will do the bare minimum
to initialize a JumbleState, and some fields that can define callbacks after
the fact. There will be a callback for a normalization function and a
callback function that will allow the user to implement jumbling functions
for nodes that are currently not included in queryjumblefuncs.switch.c, or
perhaps they can override the existing logic in this generated file.
> Of course, this sentence is written with the same line of thoughts as
> previously mentioned in the other thread we have discussed: extensions
> should not be allowed to update a JumbleState after it's been set by
> the backend code, so as once the same JumbleState pointer is passed
> down across multiple extensions they don't get confused. If an
> extension wants to use their own policy within the JumbleState, they
> had better recreate a new independent one if they are unhappy about
> has been generated previously.
Yes, correct. If we provide the interface to create an additional JumbleState,
they can create an independent state.
For this thread, I would like to focus on the first goal.
What do you think?
--
Sami Imseih
Amazon Web Services (AWS)
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-23 03:39 =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
parent: Sami Imseih <[email protected]>
1 sibling, 1 reply; 35+ messages in thread
From: =?ISO-8859-1?B?emVuZ21hbg==?= @ 2025-12-23 03:39 UTC (permalink / raw)
To: =?ISO-8859-1?B?U2FtaSBJbXNlaWg=?= <[email protected]>; +Cc: pgsql-hackers
Hi,
Though this may be tangential to the current topic, I have long been wanting to revise the two instances of `Assert(len_to_wrt >= 0);`
in the code to the implementation below. Would you kindly advise if this modification is worthwhile?
```
if (len_to_wrt > 0)
{
memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
n_quer_loc += len_to_wrt;
}
```
--
Regards,
Man Zeng
www.openhalo.org
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-23 16:35 Sami Imseih <[email protected]>
parent: =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
0 siblings, 3 replies; 35+ messages in thread
From: Sami Imseih @ 2025-12-23 16:35 UTC (permalink / raw)
To: zengman <[email protected]>; +Cc: pgsql-hackers
> Though this may be tangential to the current topic, I have long been wanting to revise the two instances of `Assert(len_to_wrt >= 0);`
> in the code to the implementation below. Would you kindly advise if this modification is worthwhile?
>
> ```
> if (len_to_wrt > 0)
> {
> memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
> n_quer_loc += len_to_wrt;
> }
> ```
I am not sure I like this idea. The len_to_wrt being less than 0
indicates a bug which will be hidden behind such a loop. I
think it's better to keep the Assert as-is.
> > > This way, any extension that wishes to return a normalized string from
> > > the same JumbleState can invoke this callback and get consistent results.
> > > pg_stat_statements and other extensions with a need to normalize a query
> > > string based on the locations of a JumbleState do not need to care about the
> > > internals of normalization, they simply invoke the callback and
> > > receive the final
> > > string.
> >
> > Hmm. I did not wrap completely my head with your problem, but,
> > assuming that what you are proposing goes in the right direction,
>
> The first goal is to move all query-normalization-related infrastructure
> that pg_stat_statements (and other extensions) rely on into core, so
> extensions no longer need to copy or reimplement normalization logic and
> can all depend on a single, shared implementation.
>
> In addition, query normalization necessarily modifies JumbleState (to
> record constant locations and lengths). This responsibility should not
> fall to extensions and should instead be delegated to core. I will argue
> that the current design, in which extensions handle this directly, is a
> layering violation.
>
> As a first step, we can move generate_normalized_query to core as a global
> function, allowing extensions to simply call it.
v3 implements this approach without a callback. This establishes a clear
boundary: core owns JumbleState modifications, extensions consume the
results through the API.
I kept the post_parse_analyze hook signature unchanged since
GenerateNormalizedQuery still needs to modify JumbleState
(to populate constant lengths). Making it const would be misleading.
For the second part of making more jumbling utilities global, this can
be taken up in another thread. I am now thinking we would be better
off actually taking all the generic jumbling routines and separating
them into their own C file. So we can have a new file like primjumble.c
which will have all the "primitive" jumbling utilities, and queryjumblefuncs.c
can worry about its core business of jumbling a Query tree. Anyhow,
these are just some high level thoughts on this part for now.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v3-0001-pg_stat_statements-Move-query-normalization-to-co.patch (20.8K, 2-v3-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
download | inline diff:
From ffe6963fd217f64230d5d638ab9a16570a895172 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v3] pg_stat_statements: Move query normalization to core
pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.
Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.
This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.
Functions are renamed to match core naming conventions and comments
are updated for clarity.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 276 +-----------------
src/backend/nodes/queryjumblefuncs.c | 248 ++++++++++++++++
src/include/nodes/queryjumble.h | 2 +
3 files changed, 260 insertions(+), 266 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 39208f80b5b..b90891859d9 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -377,11 +375,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1359,9 +1352,16 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
- norm_query = generate_normalized_query(jstate, query,
- query_location,
- &query_len);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ norm_query = GenerateNormalizedQuery(jstate, query,
+ query_location,
+ &query_len);
LWLockAcquire(pgss->lock, LW_SHARED);
}
@@ -2823,259 +2823,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* we don't want to re-emit any escape string warnings */
- yyextra.escape_string_warning = false;
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index ffc230af427..9e04dd6fbc6 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* we don't want to re-emit any escape string warnings */
+ yyextra.escape_string_warning = false;
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Callback to generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Set constants' lengths in JumbleState, as only locations are set during
+ * DoJumble(). Note this also ensures the items are sorted by location.
+ */
+ SetConstantLengths((JumbleState *) jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see SetConstantLengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index dcb36dcb44f..cb8741b1d05 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
--
2.43.0
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-24 02:50 =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
parent: Sami Imseih <[email protected]>
2 siblings, 1 reply; 35+ messages in thread
From: =?ISO-8859-1?B?emVuZ21hbg==?= @ 2025-12-24 02:50 UTC (permalink / raw)
To: =?ISO-8859-1?B?U2FtaSBJbXNlaWg=?= <[email protected]>; +Cc: pgsql-hackers
> I am not sure I like this idea. The len_to_wrt being less than 0
> indicates a bug which will be hidden behind such a loop. I
> think it's better to keep the Assert as-is.
Well, my main concern is that assertions are unlikely to be enabled in production environments.
Also, the new patch looks really good. Besides, it seems you accidentally forgot to cc Michael.
--
Regards,
Man Zeng
www.openhalo.org
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-24 02:57 Michael Paquier <[email protected]>
parent: =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
0 siblings, 0 replies; 35+ messages in thread
From: Michael Paquier @ 2025-12-24 02:57 UTC (permalink / raw)
To: zengman <[email protected]>; +Cc: Sami Imseih <[email protected]>; pgsql-hackers
On Wed, Dec 24, 2025 at 10:50:16AM +0800, zengman wrote:
> Besides, it seems you accidentally forgot to cc Michael.
This is not a problem. I still get the messages by being registered
to this list :D
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-24 04:06 Michael Paquier <[email protected]>
parent: Sami Imseih <[email protected]>
2 siblings, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2025-12-24 04:06 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: zengman <[email protected]>; pgsql-hackers
On Tue, Dec 23, 2025 at 10:35:18AM -0600, Sami Imseih wrote:
> > Though this may be tangential to the current topic, I have long been wanting to revise the two instances of `Assert(len_to_wrt >= 0);`
> > in the code to the implementation below. Would you kindly advise if this modification is worthwhile?
> >
> > ```
> > if (len_to_wrt > 0)
> > {
> > memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
> > n_quer_loc += len_to_wrt;
> > }
> > ```
>
> I am not sure I like this idea. The len_to_wrt being less than 0
> indicates a bug which will be hidden behind such a loop. I
> think it's better to keep the Assert as-is.
This set of asserts are critical to keep. An incorrect computation is
a synonym of an out-of-bound write in this case, which would classify
any problem as something that could be worth a CVE.
> v3 implements this approach without a callback. This establishes a clear
> boundary: core owns JumbleState modifications, extensions consume the
> results through the API.
>
> I kept the post_parse_analyze hook signature unchanged since
> GenerateNormalizedQuery still needs to modify JumbleState
> (to populate constant lengths). Making it const would be misleading.
>
> For the second part of making more jumbling utilities global, this can
> be taken up in another thread. I am now thinking we would be better
> off actually taking all the generic jumbling routines and separating
> them into their own C file. So we can have a new file like primjumble.c
> which will have all the "primitive" jumbling utilities, and queryjumblefuncs.c
> can worry about its core business of jumbling a Query tree. Anyhow,
> these are just some high level thoughts on this part for now.
Hmm. Moving from pgss to the core backup the code that updates the
clocations of an existing JumbleState does not solve the issue that
we should not modify the internals of the JumbleState computed. So
this does not move the ball at all.
The updates of clocations depend on GenerateNormalizedQuery(), which
depends on the pre-work done in CleanQuerytext() to get a query string
and its location for a multi-query string case. If we really want to
make a JumbleState not touchable at all, we could either push more
work of CleanQuerytext() and GenerateNormalizedQuery() into core.
Or a second thing that we can do, which would be actually much
simpler, is to rework entirely fill_in_constant_lengths() so as we
generate a *copy* of the location data PGSS wants rather than force it
into a JumbleState that would be pushed across more stacks of the
post-parse-analyze hook. Doing that would allow us to pull the plug
into making JumbleState and the original copies of clocations a set of
consts when given to extensions.
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-24 17:32 Sami Imseih <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2025-12-24 17:32 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: zengman <[email protected]>; pgsql-hackers
> > For the second part of making more jumbling utilities global, this can
> > be taken up in another thread. I am now thinking we would be better
> > off actually taking all the generic jumbling routines and separating
> > them into their own C file. So we can have a new file like primjumble.c
> > which will have all the "primitive" jumbling utilities, and queryjumblefuncs.c
> > can worry about its core business of jumbling a Query tree. Anyhow,
> > these are just some high level thoughts on this part for now.
>
> Hmm. Moving from pgss to the core backup the code that updates the
> clocations of an existing JumbleState does not solve the issue that
> we should not modify the internals of the JumbleState computed. So
> this does not move the ball at all.
This is for the point "I am wondering if we should not expose a bit more
the jumble query APIs" [0], which as I mention is a separate topic and is
not about extensions being able to modify JumbleState.
> The updates of clocations depend on GenerateNormalizedQuery(), which
> depends on the pre-work done in CleanQuerytext() to get a query string
> and its location for a multi-query string case. If we really want to
> make a JumbleState not touchable at all, we could either push more
> work of CleanQuerytext() and GenerateNormalizedQuery() into core.
When you say “push more work into core,” I understand that to mean this work
should occur during jumbling. If so, there are several complications.
1/ CleanQueryText() requires access to the query text, which we do not have
during core jumbling as of 2ecbb0a4935.
2/ GenerateNormalizedQuery() should be invoked on demand by the extension
that needs the normalized string, for example when pg_stat_statements needs
to store a query string.
Both operations are expensive, especially GenerateNormalizedQuery(). Doing
this work on every JumbleQuery() would introduce unnecessary overhead.
> Or a second thing that we can do, which would be actually much
> simpler, is to rework entirely fill_in_constant_lengths() so as we
> generate a *copy* of the location data PGSS wants rather than force it
> into a JumbleState
yes, that's an idea that did cross my mind. I have hesitation of copying
clocations, but that may not be such a big deal, since it will only be
occurring on demand when the extension requests a normalized string.
> that would be pushed across more stacks of the
> post-parse-analyze hook. Doing that would allow us to pull the plug
> into making JumbleState and the original copies of clocations a set of
> consts when given to extensions.
Yes correct. The hook signature will turn into, so all extensions now
just have a constant pointer to the jumblestate. They can of course
work hard to cast away constants, but at that point, they are doing
things the wrong way.
```
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
const JumbleState *jstate);
```
This of course will be a big breaking change to all the extensions out
there using this hook. This is not just about a signature change of
the hook, but an author now has to figure out how to deal with a
constant JumbleState in their code, which may not be trivial.
My proposal in v3 was a bit more softer, and keeps the hook
backwards compatible. Basically, we keep JumbleState a non-constant,
but provide core APIs for any operation, such as generate_normalized_query,
that needs to modify the state. So, my approach was not about enforcing a
read-only JumbleState, but about providing the API to dissuade an author
from modifying a JumbleState.
If we need a stronger API contract, then we can go with the hook receving
JumbleState as a constant and implement the copy of clocations to return
back to extensions that need to normalize a query. We also have to keep
in mind that if there are future requirements where the state has to be
modified by an extension, we will need to implement more copy functions
for those field.
[0] https://www.postgresql.org/message-id/aUXeTEpSymo6n7lF%40paquier.xyz
--
Sami Imseih
Amazon Web Services (AWS)
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-30 02:34 Lukas Fittl <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 2 replies; 35+ messages in thread
From: Lukas Fittl @ 2025-12-30 02:34 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Wed, Dec 24, 2025 at 9:33 AM Sami Imseih <[email protected]> wrote:
> > that would be pushed across more stacks of the
> > post-parse-analyze hook. Doing that would allow us to pull the plug
> > into making JumbleState and the original copies of clocations a set of
> > consts when given to extensions.
>
> Yes correct. The hook signature will turn into, so all extensions now
> just have a constant pointer to the jumblestate. They can of course
> work hard to cast away constants, but at that point, they are doing
> things the wrong way.
>
> ```
> typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
>
> Query *query,
>
> const JumbleState *jstate);
> ```
>
> This of course will be a big breaking change to all the extensions out
> there using this hook. This is not just about a signature change of
> the hook, but an author now has to figure out how to deal with a
> constant JumbleState in their code, which may not be trivial.
>
I wonder if there is a single extension out there that actually utilizes
JumbleState in that hook -
it was added in 5fd9dfa5f50e4906c35133a414ebec5b6d518493 presumably because
pg_stat_statements
needed it, but even Julien at the time was critical of adding it, mainly
considering it for pgss needs if I read
the archive correctly [0]. CCing Julien to correct me if I misinterpret the
historic record here.
Reading through the discussion in this thread here I do wonder a bit if we
shouldn't just take that out of the
hook again, and instead provide separate functions for getting the
normalized query string (generating it the
first time its called).
My proposal in v3 was a bit more softer, and keeps the hook
> backwards compatible. Basically, we keep JumbleState a non-constant,
> but provide core APIs for any operation, such as
> generate_normalized_query,
> that needs to modify the state. So, my approach was not about enforcing a
> read-only JumbleState, but about providing the API to dissuade an author
> from modifying a JumbleState.
Given the lack of public APIs to modify JumbleState today, I don't see how
an extension would do
modifications in a meaningful way, short of copying the code. I think we
should be a bit bolder here in
enforcing a convention, either clearly making it read-only or dropping the
argument again.
Thinking through a couple of theoretical use cases:
1) An extension that wants to display normalized query strings
This seems to be the biggest kind of what I can find with code search.
Extensions like pg_stat_monitor [1], that
want to do something like pg_stat_statements, and have to copy a bunch of
normalization code today that is 1:1 what
Postgres does. Such extensions don't need the JumbleState argument if they
can get the normalized text directly.
2) An extension that wants to capture parameter values
Some extensions may want to remember additional context for normalized
queries, like pg_tracing's logic for
optionally passing parameter values (post normalization) in the trace
context [2]. If we kept passing a read-only
JumbleState then such extensions could presumably still get this, but I
wonder if it wouldn't be better for core to
have a helper for this?
3) An extension that modifies the JumbleState to cause different
normalization to happen
I'm not aware of any extensions doing this, and I couldn't find any
either. And since such theoretical extensions
can directly modify query->queryId in the same hook, why not do that?
4) Extensions using the presence of JumbleState to determine whether query
IDs are available (?)
I noticed pg_hint_plan checks the JumbleState argument to determine whether
or not to run an early check
of the hint logic. I'm a little confused by this (and if this is about
query IDs being available, couldn't it just check the
Query having a non-zero queryID?), but FWIW: [3]
[0]:
https://www.postgresql.org/message-id/flat/CAOBaU_ZbeQrUESCuLGk3sRZWxXFGaBBO39CxSsFxLeZGicUrJw%40mai...
[1]:
https://github.com/percona/pg_stat_monitor/blob/main/pg_stat_monitor.c#L476
[2]:
https://github.com/DataDog/pg_tracing/blob/main/src/pg_tracing_query_process.c#L428
[3]:
https://github.com/ossc-db/pg_hint_plan/blob/master/pg_hint_plan.c#L3111
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-30 04:31 Michael Paquier <[email protected]>
parent: Lukas Fittl <[email protected]>
1 sibling, 0 replies; 35+ messages in thread
From: Michael Paquier @ 2025-12-30 04:31 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Mon, Dec 29, 2025 at 06:34:18PM -0800, Lukas Fittl wrote:
> I noticed pg_hint_plan checks the JumbleState argument to determine whether
> or not to run an early check of the hint logic. I'm a little
> confused by this (and if this is about query IDs being available,
> couldn't it just check the Query having a non-zero queryID?), but
> FWIW: [3]
See also the code before 32ced2e13e75. We wanted a JumbleState in
this case to be able to generate a normalized query for the hint
table. Since this commit we only rely on the query ID in the hint
table, so it would not matter for pg_hint_plan if the JumbleState is
removed from this portions of the hooks.
Saying that, yes I think that you are right, we should be able to
remove this dependency on JumbleState in
pg_hint_plan_post_parse_analyze() and rely on the query ID. I'm
writing down a note about cleaning that on the HEAD branch of the
module..
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2025-12-30 08:13 Bertrand Drouvot <[email protected]>
parent: Sami Imseih <[email protected]>
2 siblings, 0 replies; 35+ messages in thread
From: Bertrand Drouvot @ 2025-12-30 08:13 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: zengman <[email protected]>; pgsql-hackers
Hi,
On Tue, Dec 23, 2025 at 10:35:18AM -0600, Sami Imseih wrote:
> v3 implements this approach without a callback. This establishes a clear
> boundary: core owns JumbleState modifications, extensions consume the
> results through the API.
>
Thanks for the new patch version.
Some random comments:
=== 1
+ SetConstantLengths((JumbleState *) jstate, query, query_loc);
This cast seems unnecessary.
=== 2
+CompLocation(const void *a, const void *b)
In the commit message I can see "Functions are renamed to match core naming
conventions" but wasn't comp_location() better?
=== 3
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ norm_query = GenerateNormalizedQuery(jstate, query,
Should part of this comment be on top of the GenerateNormalizedQuery()
definition instead?
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-01-06 20:20 Sami Imseih <[email protected]>
parent: Lukas Fittl <[email protected]>
1 sibling, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-01-06 20:20 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> >> backwards compatible. Basically, we keep JumbleState a non-constant,
> >> but provide core APIs for any operation, such as
> >> generate_normalized_query,
> >> that needs to modify the state. So, my approach was not about enforcing a
> >> read-only JumbleState, but about providing the API to dissuade an author
> >> from modifying a JumbleState.
>
> > Given the lack of public APIs to modify JumbleState today, I don't see how
> > an extension would do
> > modifications in a meaningful way, short of copying the code. I think we
> > should be a bit bolder here in
> > enforcing a convention, either clearly making it read-only or dropping the
> > argument again.
>
> Based on the discussion so far I am leaning towards making JumbleState
> read-only as described here [0]. I don't see how we can drop JumbleState
> completely from hooks, since normalization needs to occur on-demand
> by the extension.
Here is v4.
0001 - addresses the comments made by Bertrand.
0002 - makes JumbleState a constant when passed to post_parse_analyze
and updates all downstream code that consume the JumbleState. This
means we now need to copy the locations from JState into a local/temporary
array when generating the normalized string.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v4-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (8.5K, 2-v4-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
download | inline diff:
From 12575c2db5ccfc070b4024d644e4c72e32497486 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v4 2/2] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 8 ++--
src/backend/nodes/queryjumblefuncs.c | 39 +++++++++++++------
src/include/nodes/queryjumble.h | 2 +-
src/include/parser/analyze.h | 2 +-
4 files changed, 33 insertions(+), 18 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b90891859d9..b796d839cec 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -332,7 +332,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -356,7 +356,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -832,7 +832,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1289,7 +1289,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index dd53c82b2a8..4dfe9087547 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
}
/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
*
* The constants may use any allowed constant syntax, such as float literals,
* bit-strings, single-quoted strings and dollar-quoted strings. This is
@@ -803,8 +805,8 @@ CompLocation(const void *a, const void *b)
* a negative numeric constant. This precludes there ever being another
* reason for a constant to start with a '-'.
*/
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
+static LocationLen *
+SetConstantLengths(const JumbleState *jstate, const char *query,
int query_loc)
{
LocationLen *locs;
@@ -813,14 +815,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
core_YYSTYPE yylval;
YYLTYPE yylloc;
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc(jstate->clocations_count * sizeof(LocationLen));
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
/*
* Sort the records by location so that we can process them in order while
* scanning the query text.
*/
if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
+ qsort(locs, jstate->clocations_count,
sizeof(LocationLen), CompLocation);
- locs = jstate->clocations;
/* initialize the flex scanner --- should match raw_parser() */
yyscanner = scanner_init(query,
@@ -908,6 +916,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
}
scanner_finish(yyscanner);
+
+ return locs;
}
/*
@@ -926,7 +936,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
* Returns a palloc'd string.
*/
char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -938,12 +948,13 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
* Set constants' lengths in JumbleState, as only locations are set during
* DoJumble(). Note this also ensures the items are sorted by location.
*/
- SetConstantLengths(jstate, query, query_loc);
+ locs = SetConstantLengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -969,15 +980,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
/* Adjust the constant's location (see SetConstantLengths) */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -996,7 +1007,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -1005,6 +1016,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up temporary location array before any assertions */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index cb8741b1d05..bc3dc4d846f 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,7 +91,7 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index f29ed03b476..f80da72933c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
--
2.43.0
[application/octet-stream] v4-0001-pg_stat_statements-Move-query-normalization-to-co.patch (20.8K, 3-v4-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
download | inline diff:
From 63741aec2ca489d066ed4cdc8f50ec36b7894ffd Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v4 1/2] pg_stat_statements: Move query normalization to core
pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.
Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.
This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.
Functions are renamed to match core naming conventions and comments
are updated for clarity.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 276 +-----------------
src/backend/nodes/queryjumblefuncs.c | 248 ++++++++++++++++
src/include/nodes/queryjumble.h | 2 +
3 files changed, 260 insertions(+), 266 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 39208f80b5b..b90891859d9 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -377,11 +375,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1359,9 +1352,16 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
- norm_query = generate_normalized_query(jstate, query,
- query_location,
- &query_len);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ norm_query = GenerateNormalizedQuery(jstate, query,
+ query_location,
+ &query_len);
LWLockAcquire(pgss->lock, LW_SHARED);
}
@@ -2823,259 +2823,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* we don't want to re-emit any escape string warnings */
- yyextra.escape_string_warning = false;
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index ffc230af427..dd53c82b2a8 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* we don't want to re-emit any escape string warnings */
+ yyextra.escape_string_warning = false;
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Callback to generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Set constants' lengths in JumbleState, as only locations are set during
+ * DoJumble(). Note this also ensures the items are sorted by location.
+ */
+ SetConstantLengths(jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see SetConstantLengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index dcb36dcb44f..cb8741b1d05 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
--
2.43.0
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-17 02:12 Sami Imseih <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-03-17 02:12 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> Here is v4.
>
> 0001 - addresses the comments made by Bertrand.
> 0002 - makes JumbleState a constant when passed to post_parse_analyze
> and updates all downstream code that consume the JumbleState. This
> means we now need to copy the locations from JState into a local/temporary
> array when generating the normalized string.
PFA a mandatory rebase.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v5-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (8.5K, 2-v5-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
download | inline diff:
From 348377e0473d2b28cc50db0ba5e950f9e28bbe87 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v5 2/2] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 8 ++--
src/backend/nodes/queryjumblefuncs.c | 39 +++++++++++++------
src/include/nodes/queryjumble.h | 2 +-
src/include/parser/analyze.h | 2 +-
4 files changed, 33 insertions(+), 18 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c9338442d96..386a1419232 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -333,7 +333,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -357,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -833,7 +833,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1290,7 +1290,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index d4b26202c47..fe8f0ccd278 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
}
/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
*
* The constants may use any allowed constant syntax, such as float literals,
* bit-strings, single-quoted strings and dollar-quoted strings. This is
@@ -803,8 +805,8 @@ CompLocation(const void *a, const void *b)
* a negative numeric constant. This precludes there ever being another
* reason for a constant to start with a '-'.
*/
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
+static LocationLen *
+SetConstantLengths(const JumbleState *jstate, const char *query,
int query_loc)
{
LocationLen *locs;
@@ -813,14 +815,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
core_YYSTYPE yylval;
YYLTYPE yylloc;
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc(jstate->clocations_count * sizeof(LocationLen));
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
/*
* Sort the records by location so that we can process them in order while
* scanning the query text.
*/
if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
+ qsort(locs, jstate->clocations_count,
sizeof(LocationLen), CompLocation);
- locs = jstate->clocations;
/* initialize the flex scanner --- should match raw_parser() */
yyscanner = scanner_init(query,
@@ -908,6 +916,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
}
scanner_finish(yyscanner);
+
+ return locs;
}
/*
@@ -926,7 +936,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
* Returns a palloc'd string.
*/
char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -938,12 +948,13 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
* Set constants' lengths in JumbleState, as only locations are set during
* DoJumble(). Note this also ensures the items are sorted by location.
*/
- SetConstantLengths(jstate, query, query_loc);
+ locs = SetConstantLengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -969,15 +980,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
/* Adjust the constant's location (see SetConstantLengths) */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -996,7 +1007,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -1005,6 +1016,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up temporary location array before any assertions */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index e354f857c37..b709033ce27 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,7 +91,7 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
--
2.47.3
[application/octet-stream] v5-0001-pg_stat_statements-Move-query-normalization-to-co.patch (20.7K, 3-v5-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
download | inline diff:
From b7a1957e00b1d0f743295e7a19ec37ddf4da1f03 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v5 1/2] pg_stat_statements: Move query normalization to core
pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.
Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.
This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.
Functions are renamed to match core naming conventions and comments
are updated for clarity.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 273 +-----------------
src/backend/nodes/queryjumblefuncs.c | 248 ++++++++++++++++
src/include/nodes/queryjumble.h | 2 +
3 files changed, 260 insertions(+), 263 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..c9338442d96 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -378,11 +376,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1360,9 +1353,16 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
- norm_query = generate_normalized_query(jstate, query,
- query_location,
- &query_len);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ norm_query = GenerateNormalizedQuery(jstate, query,
+ query_location,
+ &query_len);
LWLockAcquire(pgss->lock, LW_SHARED);
}
@@ -2824,256 +2824,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..d4b26202c47 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* we don't want to re-emit any escape string warnings */
+ yyextra.escape_string_warning = false;
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Callback to generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Set constants' lengths in JumbleState, as only locations are set during
+ * DoJumble(). Note this also ensures the items are sorted by location.
+ */
+ SetConstantLengths(jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see SetConstantLengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..e354f857c37 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
--
2.47.3
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 01:34 Lukas Fittl <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-03-26 01:34 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
Hi Sami,
On Mon, Mar 16, 2026 at 7:12 PM Sami Imseih <[email protected]> wrote:
>
> > Here is v4.
> >
> > 0001 - addresses the comments made by Bertrand.
> > 0002 - makes JumbleState a constant when passed to post_parse_analyze
> > and updates all downstream code that consume the JumbleState. This
> > means we now need to copy the locations from JState into a local/temporary
> > array when generating the normalized string.
In 0001:
> diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
> > index 87db8dc1a32..d4b26202c47 100644
> --- a/src/backend/nodes/queryjumblefuncs.c
> +++ b/src/backend/nodes/queryjumblefuncs.c
> ...
> @@ -773,3 +775,249 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
> +
> + /* we don't want to re-emit any escape string warnings */
> + yyextra.escape_string_warning = false;
> +
I don't think this is needed anymore, as of
45762084545ec14dbbe66ace1d69d7e89f8978ac.
> +/*
> + * Callback to generate a normalized version of the query string that will be used to
> + * represent all similar queries.
> + *
I don't think the term "Callback" makes sense here - I think you could
just keep the original wording.
In 0002:
> diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
> index d4b26202c47..fe8f0ccd278 100644
> --- a/src/backend/nodes/queryjumblefuncs.c
> +++ b/src/backend/nodes/queryjumblefuncs.c
> ...
> @@ -813,14 +815,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
> core_YYSTYPE yylval;
> YYLTYPE yylloc;
>
> + if (jstate->clocations_count == 0)
> + return NULL;
> +
> + /* Copy constant locations to avoid modifying jstate */
> + locs = palloc(jstate->clocations_count * sizeof(LocationLen));
> + memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
> +
You could use palloc_array for locs here.
> @@ -938,12 +948,13 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
> last_off = 0, /* Offset from start for previous tok */
> last_tok_len = 0; /* Length (in bytes) of that tok */
> int num_constants_replaced = 0;
> + LocationLen *locs = NULL;
>
> /*
> * Set constants' lengths in JumbleState, as only locations are set during
> * DoJumble(). Note this also ensures the items are sorted by location.
> */
> - SetConstantLengths(jstate, query, query_loc);
> + locs = SetConstantLengths(jstate, query, query_loc);
I think we should update the comment here to reflect the fact that
we're no longer modifying JumbleState.
Otherwise these patches look good - it'd be nice to still get this
into 19 so we have less code duplication across the different
extensions working with normalized query text.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 03:31 Sami Imseih <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-03-26 03:31 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
Thanks for looking!
> I don't think this is needed anymore, as of
> 45762084545ec14dbbe66ace1d69d7e89f8978ac.
Correct. That showed up after my last rebase.
> > +/*
> > + * Callback to generate a normalized version of the query string that will be used to
> > + * represent all similar queries.
> > + *
>
> I don't think the term "Callback" makes sense here - I think you could
> just keep the original wording.
This was a remnant of my earlier experimentation. I removed.
A few notes on the comments:
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
This was in the original generate_normalize_query and since it mentions
hashtable entry, I moved the comment (in-spirit) to where pg_stat_statements
calls GenerateNormailzeQuery
* If query_loc > 0, then "query" has been advanced by that much compared to
* the original string start, as is the case with multi-statement strings, so
* we need to translate the provided locations to compensate. (This lets us
* avoid re-scanning statements before the one of interest, so it's worth doing.)
*
This comment was originally duplicated in both SetConstantLengths, so I
just kept it as-is in SetConstantLengths and added a shorter reference in
GenerateNormalizeQuery
Also, this comment "It is the caller's job to ensure that the string
is a valid SQL statement..."
made more sense in GenerateNormalizeQuery rather than SetConstantLengths, since
GenerateNormalizeQuery is the public function.
> In 0002:
> You could use palloc_array for locs here.
done.
> I think we should update the comment here to reflect the fact that
> we're no longer modifying JumbleState.
done.
> Otherwise these patches look good - it'd be nice to still get this
> into 19 so we have less code duplication across the different
> extensions working with normalized query text.
I agree!
v6 addresses the comments.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v6-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (8.6K, 2-v6-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
download | inline diff:
From 528f16102ecb47fb447e472b43b906e3f37b09fc Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v6 2/2] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 8 ++--
src/backend/nodes/queryjumblefuncs.c | 44 +++++++++++++------
src/include/nodes/queryjumble.h | 2 +-
src/include/parser/analyze.h | 2 +-
4 files changed, 36 insertions(+), 20 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c9338442d96..386a1419232 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -333,7 +333,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -357,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -833,7 +833,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1290,7 +1290,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index c997a1ef609..0a40bc89e0a 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
}
/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
*
* The constants may use any allowed constant syntax, such as float literals,
* bit-strings, single-quoted strings and dollar-quoted strings. This is
@@ -808,8 +810,8 @@ CompLocation(const void *a, const void *b)
* a negative numeric constant. This precludes there ever being another
* reason for a constant to start with a '-'.
*/
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
+static LocationLen *
+SetConstantLengths(const JumbleState *jstate, const char *query,
int query_loc)
{
LocationLen *locs;
@@ -818,14 +820,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
core_YYSTYPE yylval;
YYLTYPE yylloc;
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc_array(LocationLen, jstate->clocations_count);
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
/*
* Sort the records by location so that we can process them in order while
* scanning the query text.
*/
if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
+ qsort(locs, jstate->clocations_count,
sizeof(LocationLen), CompLocation);
- locs = jstate->clocations;
/* initialize the flex scanner --- should match raw_parser() */
yyscanner = scanner_init(query,
@@ -910,6 +918,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
}
scanner_finish(yyscanner);
+
+ return locs;
}
/*
@@ -931,7 +941,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
* Returns a palloc'd string.
*/
char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -943,12 +953,14 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
- * Set constants' lengths in JumbleState, as only locations are set during
- * DoJumble(). Note this also ensures the items are sorted by location.
+ * Determine constants' lengths, which are not set during DoJumble(), and
+ * return a sorted copy of jstate's LocationLen data with lengths filled
+ * in.
*/
- SetConstantLengths(jstate, query, query_loc);
+ locs = SetConstantLengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -974,15 +986,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
/* Adjust the constant's location (see SetConstantLengths) */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -1001,7 +1013,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -1010,6 +1022,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up temporary location array before any assertions */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index e354f857c37..b709033ce27 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,7 +91,7 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
--
2.47.3
[application/octet-stream] v6-0001-pg_stat_statements-Move-query-normalization-to-co.patch (21.1K, 3-v6-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
download | inline diff:
From 79e94343d46ad1e8f16f30449492ebe01e4419a0 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v6 1/2] pg_stat_statements: Move query normalization to core
pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.
Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.
This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.
Functions are renamed to match core naming conventions and comments
are updated for clarity.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 273 +-----------------
src/backend/nodes/queryjumblefuncs.c | 253 ++++++++++++++++
src/include/nodes/queryjumble.h | 2 +
3 files changed, 265 insertions(+), 263 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..c9338442d96 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -378,11 +376,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1360,9 +1353,16 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
- norm_query = generate_normalized_query(jstate, query,
- query_location,
- &query_len);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ norm_query = GenerateNormalizedQuery(jstate, query,
+ query_location,
+ &query_len);
LWLockAcquire(pgss->lock, LW_SHARED);
}
@@ -2824,256 +2824,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..c997a1ef609 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -773,3 +775,254 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * If query_loc > 0, then "query" has been advanced by that much compared to
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate. (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth doing.)
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Similar to SetConstantLengths, we must also translate the provided location
+ * to compensate for multi-statement strings.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Set constants' lengths in JumbleState, as only locations are set during
+ * DoJumble(). Note this also ensures the items are sorted by location.
+ */
+ SetConstantLengths(jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see SetConstantLengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..e354f857c37 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
--
2.47.3
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 16:35 Lukas Fittl <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-03-26 16:35 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
Hi Sami,
On Wed, Mar 25, 2026 at 8:31 PM Sami Imseih <[email protected]> wrote:
>
> v6 addresses the comments.
>
Great - I've tested that again and changes look good.
Two more thoughts:
1) I think "SetConstantLengths" should be renamed to
"GetConstantLengths", or "CalculateConstantLengths" in 0002, since
we're no longer modifying JumbleState
2) I wonder if we should export this function in 0002 - that would
specifically help pg_tracing, which also wants to extract inline
parameter values in addition to replacing them with a $n parameter
reference marker - I could also see that being useful for any other
extension that's interested in pulling out parameter values
I've also done some testing with two previously mentioned extensions,
pg_stat_monitor [0] and pg_tracing [1]. Both look like they can be
adapted to the new method with their regression tests succeeding.
Thanks,
Lukas
[0]: https://github.com/lfittl/pg_stat_monitor/commit/d3521de9f6736c1bb745bc31dae736540efe1781
[1]: https://github.com/lfittl/pg_tracing/commit/ecff95a87a1c60e8a5fe47662cc6a2148bd3632f
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-26 17:18 Sami Imseih <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-03-26 17:18 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> 1) I think "SetConstantLengths" should be renamed to
> "GetConstantLengths", or "CalculateConstantLengths" in 0002, since
> we're no longer modifying JumbleState
Makes sense. I renamed it to ComputeConstantLengths; as this will reflect
that it's more than a getter function, and it's doing actual work. Compute
also seems more consistent with other function names we have out there.
> 2) I wonder if we should export this function in 0002 - that would
> specifically help pg_tracing, which also wants to extract inline
> parameter values in addition to replacing them with a $n parameter
> reference marker - I could also see that being useful for any other
> extension that's interested in pulling out parameter values
> I've also done some testing with two previously mentioned extensions,
> pg_stat_monitor [0] and pg_tracing [1]. Both look like they can be
> adapted to the new method with their regression tests succeeding.
That did cross my mind. I agree with this.
Both of those changes belong in 0002. See the attached v7.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
[application/octet-stream] v7-0001-pg_stat_statements-Move-query-normalization-to-co.patch (21.1K, 2-v7-0001-pg_stat_statements-Move-query-normalization-to-co.patch)
download | inline diff:
From e5d441c4e4932b7453293c689c2867c1d9d1d53f Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Thu, 18 Dec 2025 18:59:31 +0000
Subject: [PATCH v7 1/2] pg_stat_statements: Move query normalization to core
pg_stat_statements was modifying core generated JumbleState instances
by calling fill_in_constant_lengths() via generate_normalized_query()
to populate constant lengths. This creates problems since extensions
should not modify shared core data structures that may be used by
other extensions, and this could be considered a layering violation.
Move query normalization functions to core as a global function to
fix this. Extensions now call GenerateNormalizedQuery() instead of
directly modifying JumbleState.
This change also addresses code duplication, as several extensions
have been copying generate_normalized_query() to implement their own
normalization. The new core API provides a single, consistent
implementation for all extensions to use.
Functions are renamed to match core naming conventions and comments
are updated for clarity.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 273 +-----------------
src/backend/nodes/queryjumblefuncs.c | 253 ++++++++++++++++
src/include/nodes/queryjumble.h | 2 +
3 files changed, 265 insertions(+), 263 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..c9338442d96 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -378,11 +376,6 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
/*
@@ -1360,9 +1353,16 @@ pgss_store(const char *query, int64 queryId,
if (jstate)
{
LWLockRelease(pgss->lock);
- norm_query = generate_normalized_query(jstate, query,
- query_location,
- &query_len);
+
+ /*
+ * generate the normalized query. Note that the normalized
+ * representation may well vary depending on just which
+ * "equivalent" query is used to create the hashtable entry. We
+ * assume this is OK.
+ */
+ norm_query = GenerateNormalizedQuery(jstate, query,
+ query_location,
+ &query_len);
LWLockAcquire(pgss->lock, LW_SHARED);
}
@@ -2824,256 +2824,3 @@ release_lock:
return stats_reset;
}
-
-/*
- * Generate a normalized version of the query string that will be used to
- * represent all similar queries.
- *
- * Note that the normalized representation may well vary depending on
- * just which "equivalent" query is used to create the hashtable entry.
- * We assume this is OK.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * *query_len_p contains the input string length, and is updated with
- * the result string length on exit. The resulting string might be longer
- * or shorter depending on what happens with replacement of constants.
- *
- * Returns a palloc'd string.
- */
-static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
- int query_loc, int *query_len_p)
-{
- char *norm_query;
- int query_len = *query_len_p;
- int norm_query_buflen, /* Space allowed for norm_query */
- len_to_wrt, /* Length (in bytes) to write */
- quer_loc = 0, /* Source query byte location */
- n_quer_loc = 0, /* Normalized query byte location */
- last_off = 0, /* Offset from start for previous tok */
- last_tok_len = 0; /* Length (in bytes) of that tok */
- int num_constants_replaced = 0;
-
- /*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
- */
- fill_in_constant_lengths(jstate, query, query_loc);
-
- /*
- * Allow for $n symbols to be longer than the constants they replace.
- * Constants must take at least one byte in text form, while a $n symbol
- * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
- * could refine that limit based on the max value of n for the current
- * query, but it hardly seems worth any extra effort to do so.
- */
- norm_query_buflen = query_len + jstate->clocations_count * 10;
-
- /* Allocate result buffer */
- norm_query = palloc(norm_query_buflen + 1);
-
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int off, /* Offset from start for cur tok */
- tok_len; /* Length (in bytes) of that tok */
-
- /*
- * If we have an external param at this location, but no lists are
- * being squashed across the query, then we skip here; this will make
- * us print the characters found in the original query that represent
- * the parameter in the next iteration (or after the loop is done),
- * which is a bit odd but seems to work okay in most cases.
- */
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
- continue;
-
- off = jstate->clocations[i].location;
-
- /* Adjust recorded location if we're dealing with partial string */
- off -= query_loc;
-
- tok_len = jstate->clocations[i].length;
-
- if (tok_len < 0)
- continue; /* ignore any duplicates */
-
- /* Copy next chunk (what precedes the next constant) */
- len_to_wrt = off - last_off;
- len_to_wrt -= last_tok_len;
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- /*
- * And insert a param symbol in place of the constant token; and, if
- * we have a squashable list, insert a placeholder comment starting
- * from the list's second value.
- */
- n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
- num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
- num_constants_replaced++;
-
- /* move forward */
- quer_loc = off + tok_len;
- last_off = off;
- last_tok_len = tok_len;
- }
-
- /*
- * We've copied up until the last ignorable constant. Copy over the
- * remaining bytes of the original query string.
- */
- len_to_wrt = query_len - quer_loc;
-
- Assert(len_to_wrt >= 0);
- memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
- n_quer_loc += len_to_wrt;
-
- Assert(n_quer_loc <= norm_query_buflen);
- norm_query[n_quer_loc] = '\0';
-
- *query_len_p = n_quer_loc;
- return norm_query;
-}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..c997a1ef609 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -773,3 +775,254 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records,
+ * fill in the textual lengths of those constants in JumbleState.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * If query_loc > 0, then "query" has been advanced by that much compared to
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate. (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth doing.)
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+static void
+SetConstantLengths(JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(jstate->clocations, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+ locs = jstate->clocations;
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+}
+
+/*
+ * Generate a normalized version of the query string that will be used to
+ * represent all similar queries.
+ *
+ * *query_len_p contains the input string length, and is updated with
+ * the result string length on exit. The resulting string might be longer
+ * or shorter depending on what happens with replacement of constants.
+ *
+ * Similar to SetConstantLengths, we must also translate the provided location
+ * to compensate for multi-statement strings.
+ *
+ * It is the caller's job to ensure that the string is a valid SQL statement
+ * with the correct constant locations in jstate->clocations. Since in
+ * practice the string has already been parsed using the authoritative parser
+ * and the locations are set by core query jumbling, this should not be a problem.
+ *
+ * Returns a palloc'd string.
+ */
+char *
+GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p)
+{
+ char *norm_query;
+ int query_len = *query_len_p;
+ int norm_query_buflen, /* Space allowed for norm_query */
+ len_to_wrt, /* Length (in bytes) to write */
+ quer_loc = 0, /* Source query byte location */
+ n_quer_loc = 0, /* Normalized query byte location */
+ last_off = 0, /* Offset from start for previous tok */
+ last_tok_len = 0; /* Length (in bytes) of that tok */
+ int num_constants_replaced = 0;
+
+ /*
+ * Set constants' lengths in JumbleState, as only locations are set during
+ * DoJumble(). Note this also ensures the items are sorted by location.
+ */
+ SetConstantLengths(jstate, query, query_loc);
+
+ /*
+ * Allow for $n symbols to be longer than the constants they replace.
+ * Constants must take at least one byte in text form, while a $n symbol
+ * certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
+ * could refine that limit based on the max value of n for the current
+ * query, but it hardly seems worth any extra effort to do so.
+ */
+ norm_query_buflen = query_len + jstate->clocations_count * 10;
+
+ /* Allocate result buffer */
+ norm_query = palloc(norm_query_buflen + 1);
+
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int off, /* Offset from start for cur tok */
+ tok_len; /* Length (in bytes) of that tok */
+
+ /*
+ * If we have an external param at this location, but no lists are
+ * being squashed across the query, then we skip here; this will make
+ * us print the characters found in the original query that represent
+ * the parameter in the next iteration (or after the loop is done),
+ * which is a bit odd but seems to work okay in most cases.
+ */
+ if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ continue;
+
+ off = jstate->clocations[i].location;
+
+ /* Adjust the constant's location (see SetConstantLengths) */
+ off -= query_loc;
+
+ tok_len = jstate->clocations[i].length;
+
+ if (tok_len < 0)
+ continue; /* ignore any duplicates */
+
+ /* Copy next chunk (what precedes the next constant) */
+ len_to_wrt = off - last_off;
+ len_to_wrt -= last_tok_len;
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ /*
+ * And insert a param symbol in place of the constant token; and, if
+ * we have a squashable list, insert a placeholder comment starting
+ * from the list's second value.
+ */
+ n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+ num_constants_replaced + 1 + jstate->highest_extern_param_id,
+ jstate->clocations[i].squashed ? " /*, ... */" : "");
+ num_constants_replaced++;
+
+ /* move forward */
+ quer_loc = off + tok_len;
+ last_off = off;
+ last_tok_len = tok_len;
+ }
+
+ /*
+ * We've copied up until the last ignorable constant. Copy over the
+ * remaining bytes of the original query string.
+ */
+ len_to_wrt = query_len - quer_loc;
+
+ Assert(len_to_wrt >= 0);
+ memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+ n_quer_loc += len_to_wrt;
+
+ Assert(n_quer_loc <= norm_query_buflen);
+ norm_query[n_quer_loc] = '\0';
+
+ *query_len_p = n_quer_loc;
+ return norm_query;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..e354f857c37 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,8 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+ int query_loc, int *query_len_p);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
--
2.47.3
[application/octet-stream] v7-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch (9.7K, 3-v7-0002-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
download | inline diff:
From 42360a306bfed6b899bd03e60c69b31db59e2ec0 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 6 Jan 2026 19:29:01 +0000
Subject: [PATCH v7 2/2] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
Update pg_stat_statements and related functions to match the new const
signature. Refactor SetConstantLengths() to return a newly allocated
LocationLen array instead of modifying the JumbleState, and update
GenerateNormalizedQuery() to work with the const JumbleState and
manage the returned array properly.
Also, rename SetConstantLengths() to ComputeConstantLengths() as it is
no longer setting constant lengths, and it's also computing lengths
based on traversing through the LocationLen data.
Furthermore, ComputeConstantLengths is also exported to allow plug-ins
that wish to perform custom processing of literals from a query.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 8 +--
src/backend/nodes/queryjumblefuncs.c | 50 ++++++++++++-------
src/include/nodes/queryjumble.h | 4 +-
src/include/parser/analyze.h | 2 +-
4 files changed, 41 insertions(+), 23 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c9338442d96..386a1419232 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -333,7 +333,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -357,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -833,7 +833,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1290,7 +1290,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index c997a1ef609..af9417797a3 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -789,8 +789,10 @@ CompLocation(const void *a, const void *b)
}
/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants in JumbleState.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
*
* The constants may use any allowed constant syntax, such as float literals,
* bit-strings, single-quoted strings and dollar-quoted strings. This is
@@ -808,9 +810,9 @@ CompLocation(const void *a, const void *b)
* a negative numeric constant. This precludes there ever being another
* reason for a constant to start with a '-'.
*/
-static void
-SetConstantLengths(JumbleState *jstate, const char *query,
- int query_loc)
+LocationLen *
+ComputeConstantLengths(const JumbleState *jstate, const char *query,
+ int query_loc)
{
LocationLen *locs;
core_yyscan_t yyscanner;
@@ -818,14 +820,20 @@ SetConstantLengths(JumbleState *jstate, const char *query,
core_YYSTYPE yylval;
YYLTYPE yylloc;
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc_array(LocationLen, jstate->clocations_count);
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
/*
* Sort the records by location so that we can process them in order while
* scanning the query text.
*/
if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
+ qsort(locs, jstate->clocations_count,
sizeof(LocationLen), CompLocation);
- locs = jstate->clocations;
/* initialize the flex scanner --- should match raw_parser() */
yyscanner = scanner_init(query,
@@ -910,6 +918,8 @@ SetConstantLengths(JumbleState *jstate, const char *query,
}
scanner_finish(yyscanner);
+
+ return locs;
}
/*
@@ -920,7 +930,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
* the result string length on exit. The resulting string might be longer
* or shorter depending on what happens with replacement of constants.
*
- * Similar to SetConstantLengths, we must also translate the provided location
+ * Similar to ComputeConstantLengths, we must also translate the provided location
* to compensate for multi-statement strings.
*
* It is the caller's job to ensure that the string is a valid SQL statement
@@ -931,7 +941,7 @@ SetConstantLengths(JumbleState *jstate, const char *query,
* Returns a palloc'd string.
*/
char *
-GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -943,12 +953,14 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
- * Set constants' lengths in JumbleState, as only locations are set during
- * DoJumble(). Note this also ensures the items are sorted by location.
+ * Determine constants' lengths, which are not set during DoJumble(), and
+ * return a sorted copy of jstate's LocationLen data with lengths filled
+ * in.
*/
- SetConstantLengths(jstate, query, query_loc);
+ locs = ComputeConstantLengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -974,15 +986,15 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
- /* Adjust the constant's location (see SetConstantLengths) */
+ /* Adjust the constant's location (see ComputeConstantLengths) */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -1001,7 +1013,7 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -1010,6 +1022,10 @@ GenerateNormalizedQuery(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up temporary location array before any assertions */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index e354f857c37..8364d58e4cb 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,8 +91,10 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
-extern char *GenerateNormalizedQuery(JumbleState *jstate, const char *query,
+extern char *GenerateNormalizedQuery(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p);
+extern LocationLen *ComputeConstantLengths(const JumbleState *jstate, const char *query,
+ int query_loc);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
--
2.47.3
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 01:50 Lukas Fittl <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-03-27 01:50 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Thu, Mar 26, 2026 at 10:19 AM Sami Imseih <[email protected]> wrote:
> Both of those changes belong in 0002. See the attached v7.
Looks good!
I've also done a quick follow-up test with pg_tracing and that
simplifies its logic as expected [0] to be able to extract inline
parameter values.
Thanks,
Lukas
[0]: https://github.com/lfittl/pg_tracing/commit/ae937fdee4aa57206b31b5746f36dd8e55cf43d1#diff-41cf816684...
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 05:17 Michael Paquier <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2026-03-27 05:17 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Thu, Mar 26, 2026 at 06:50:20PM -0700, Lukas Fittl wrote:
> On Thu, Mar 26, 2026 at 10:19 AM Sami Imseih <[email protected]> wrote:
> > Both of those changes belong in 0002. See the attached v7.
>
> Looks good!
>
> I've also done a quick follow-up test with pg_tracing and that
> simplifies its logic as expected [0] to be able to extract inline
> parameter values.
I have been looking at what you have here. As mentioned upthread, I
am on board with aiming for JumbleState to be a const so as we enforce
as policy that no extension is allowed to touch what the core code has
computed.
However, I am not convinced by most of the patch. The logic to
recompute the lengths of the constants is a concept proper to PGSS.
Perhaps we could reconsider moving that into core at some point, but
I am honestly not sure to see the range of benefits we'd gain from
that.
This line of arguments is stronger for the normalization of the query
string. Why should the core code decide what a normalized string
should look like when it comes to the detection of the constants, if
any? Instead of a dollar-quoted number, we could enforce a bunch of
things, like a '?' or a '$woozah$' at these locations.
Saying that, the key point of the patch is to take a copy of the
JumbleState locations, and recompute it for the needs of PGSS for the
sake of the normalized query representation. Hence, why don't we just
do that at the end? That should be enough to enforce the const marker
for the JumbleState across all the loaded extensions that want to look
at it. This leads me to the simpler patch attached.
Comments or tomatoes? That's simpler, and I'd be OK with just that
for v19. That would be much better than the current state of affairs,
where PGSS decides to enforce its own ideas in the JumbleState, ideas
fed to anything looping into the post-parse-analyze hook after PGSS.
--
Michael
From 336e6aa38d6dc3901c3fa1f746796a0b8fc87cd2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Fri, 27 Mar 2026 14:13:00 +0900
Subject: [PATCH v8] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
Update pg_stat_statements and related functions to match the new const
signature. Refactor fill_in_constant_lengths() to return a newly
allocated LocationLen array instead of modifying the JumbleState, so as
PGSS does not touch the now-const JumbleState. The routine is renamed
to compute_constant_lengths().
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
src/include/parser/analyze.h | 2 +-
.../pg_stat_statements/pg_stat_statements.c | 75 ++++++++++++-------
2 files changed, 50 insertions(+), 27 deletions(-)
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ffd..b2db6fa4e8c4 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3b..0b6be5e255a4 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -335,7 +335,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -359,7 +359,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -378,13 +378,14 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
+static char *generate_normalized_query(const JumbleState *jstate,
+ const char *query,
int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
+static LocationLen *compute_constant_lengths(const JumbleState *jstate,
+ const char *query,
+ int query_loc);
static int comp_location(const void *a, const void *b);
-
/*
* Module load callback
*/
@@ -840,7 +841,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1297,7 +1298,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
@@ -2845,7 +2846,7 @@ release_lock:
* Returns a palloc'd string.
*/
static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
+generate_normalized_query(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -2857,12 +2858,14 @@ generate_normalized_query(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
+ * Determine constants' lengths (core system only gives us locations),
+ * and return a sorted copy of jstate's LocationLen data with lengths
+ * filled in.
*/
- fill_in_constant_lengths(jstate, query, query_loc);
+ locs = compute_constant_lengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -2888,15 +2891,15 @@ generate_normalized_query(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
/* Adjust recorded location if we're dealing with partial string */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -2915,7 +2918,7 @@ generate_normalized_query(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -2924,6 +2927,10 @@ generate_normalized_query(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up, if needed */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
@@ -2942,8 +2949,9 @@ generate_normalized_query(JumbleState *jstate, const char *query,
}
/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants.
*
* The constants may use any allowed constant syntax, such as float literals,
* bit-strings, single-quoted strings and dollar-quoted strings. This is
@@ -2959,16 +2967,19 @@ generate_normalized_query(JumbleState *jstate, const char *query,
* past the first to -1 so that they can later be ignored.
*
* If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate. (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth
+ * doing.)
*
* N.B. There is an assumption that a '-' character at a Const location begins
* a negative numeric constant. This precludes there ever being another
* reason for a constant to start with a '-'.
+ *
+ * It is the caller's responsibility to free the result, if necessary.
*/
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
+static LocationLen *
+compute_constant_lengths(const JumbleState *jstate, const char *query,
int query_loc)
{
LocationLen *locs;
@@ -2977,14 +2988,20 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
core_YYSTYPE yylval;
YYLTYPE yylloc;
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc_array(LocationLen, jstate->clocations_count);
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
/*
* Sort the records by location so that we can process them in order while
* scanning the query text.
*/
if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
+ qsort(locs, jstate->clocations_count,
sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
/* initialize the flex scanner --- should match raw_parser() */
yyscanner = scanner_init(query,
@@ -3008,7 +3025,11 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
if (locs[i].squashed)
continue; /* squashable list, ignore */
- /* Adjust recorded location if we're dealing with partial string */
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
loc = locs[i].location - query_loc;
Assert(loc >= 0);
@@ -3064,6 +3085,8 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
}
scanner_finish(yyscanner);
+
+ return locs;
}
/*
--
2.53.0
Attachments:
[text/plain] v8-0001-Make-JumbleState-const-in-post_parse_analyze-hook.patch (9.6K, 2-v8-0001-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
download | inline diff:
From 336e6aa38d6dc3901c3fa1f746796a0b8fc87cd2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Fri, 27 Mar 2026 14:13:00 +0900
Subject: [PATCH v8] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
Update pg_stat_statements and related functions to match the new const
signature. Refactor fill_in_constant_lengths() to return a newly
allocated LocationLen array instead of modifying the JumbleState, so as
PGSS does not touch the now-const JumbleState. The routine is renamed
to compute_constant_lengths().
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
src/include/parser/analyze.h | 2 +-
.../pg_stat_statements/pg_stat_statements.c | 75 ++++++++++++-------
2 files changed, 50 insertions(+), 27 deletions(-)
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ffd..b2db6fa4e8c4 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3b..0b6be5e255a4 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -335,7 +335,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -359,7 +359,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -378,13 +378,14 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
+static char *generate_normalized_query(const JumbleState *jstate,
+ const char *query,
int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
+static LocationLen *compute_constant_lengths(const JumbleState *jstate,
+ const char *query,
+ int query_loc);
static int comp_location(const void *a, const void *b);
-
/*
* Module load callback
*/
@@ -840,7 +841,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1297,7 +1298,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
@@ -2845,7 +2846,7 @@ release_lock:
* Returns a palloc'd string.
*/
static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
+generate_normalized_query(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -2857,12 +2858,14 @@ generate_normalized_query(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
+ * Determine constants' lengths (core system only gives us locations),
+ * and return a sorted copy of jstate's LocationLen data with lengths
+ * filled in.
*/
- fill_in_constant_lengths(jstate, query, query_loc);
+ locs = compute_constant_lengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -2888,15 +2891,15 @@ generate_normalized_query(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
/* Adjust recorded location if we're dealing with partial string */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -2915,7 +2918,7 @@ generate_normalized_query(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -2924,6 +2927,10 @@ generate_normalized_query(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up, if needed */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
@@ -2942,8 +2949,9 @@ generate_normalized_query(JumbleState *jstate, const char *query,
}
/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants.
*
* The constants may use any allowed constant syntax, such as float literals,
* bit-strings, single-quoted strings and dollar-quoted strings. This is
@@ -2959,16 +2967,19 @@ generate_normalized_query(JumbleState *jstate, const char *query,
* past the first to -1 so that they can later be ignored.
*
* If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate. (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth
+ * doing.)
*
* N.B. There is an assumption that a '-' character at a Const location begins
* a negative numeric constant. This precludes there ever being another
* reason for a constant to start with a '-'.
+ *
+ * It is the caller's responsibility to free the result, if necessary.
*/
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
+static LocationLen *
+compute_constant_lengths(const JumbleState *jstate, const char *query,
int query_loc)
{
LocationLen *locs;
@@ -2977,14 +2988,20 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
core_YYSTYPE yylval;
YYLTYPE yylloc;
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc_array(LocationLen, jstate->clocations_count);
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
/*
* Sort the records by location so that we can process them in order while
* scanning the query text.
*/
if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
+ qsort(locs, jstate->clocations_count,
sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
/* initialize the flex scanner --- should match raw_parser() */
yyscanner = scanner_init(query,
@@ -3008,7 +3025,11 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
if (locs[i].squashed)
continue; /* squashable list, ignore */
- /* Adjust recorded location if we're dealing with partial string */
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
loc = locs[i].location - query_loc;
Assert(loc >= 0);
@@ -3064,6 +3085,8 @@ fill_in_constant_lengths(JumbleState *jstate, const char *query,
}
scanner_finish(yyscanner);
+
+ return locs;
}
/*
--
2.53.0
[application/pgp-signature] signature.asc (833B, 3-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 07:19 Lukas Fittl <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-03-27 07:19 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Thu, Mar 26, 2026 at 10:18 PM Michael Paquier <[email protected]> wrote:
> This line of arguments is stronger for the normalization of the query
> string. Why should the core code decide what a normalized string
> should look like when it comes to the detection of the constants, if
> any? Instead of a dollar-quoted number, we could enforce a bunch of
> things, like a '?' or a '$woozah$' at these locations.
Fair enough, though I haven't seen any extensions that do that in
practice - its reasonable to have normalization result in a query
string that's parsable again and can be passed to EXPLAIN
(GENERIC_PLAN).
> Saying that, the key point of the patch is to take a copy of the
> JumbleState locations, and recompute it for the needs of PGSS for the
> sake of the normalized query representation. Hence, why don't we just
> do that at the end? That should be enough to enforce the const marker
> for the JumbleState across all the loaded extensions that want to look
> at it. This leads me to the simpler patch attached.
>
> Comments or tomatoes? That's simpler, and I'd be OK with just that
> for v19. That would be much better than the current state of affairs,
> where PGSS decides to enforce its own ideas in the JumbleState, ideas
> fed to anything looping into the post-parse-analyze hook after PGSS.
I'm not convinced that making the const change alone is a good idea,
without also providing some helpers to reduce the repeated code in
extensions.
What if we only put the ComputeConstantLengths (as Sami had it in v7)
in core, together with making JumbleState const?
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-27 17:09 Sami Imseih <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-03-27 17:09 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> On Thu, Mar 26, 2026 at 10:18 PM Michael Paquier <[email protected]> wrote:
> > This line of arguments is stronger for the normalization of the query
> > string. Why should the core code decide what a normalized string
> > should look like when it comes to the detection of the constants, if
> > any? Instead of a dollar-quoted number, we could enforce a bunch of
> > things, like a '?' or a '$woozah$' at these locations.
>
> Fair enough, though I haven't seen any extensions that do that in
> practice - its reasonable to have normalization result in a query
> string that's parsable again and can be passed to EXPLAIN
> (GENERIC_PLAN).
with regards to generate_normalized_query, AFAICT, the most common
case is extensions are using it for is dollar quoted number, but I agree
this one is a gray area.
> What if we only put the ComputeConstantLengths (as Sami had it in v7)
> in core, together with making JumbleState const?
I agree that ComputeConstantLengths should be in core. This one is
not a gray area IMO. The query jumble already records constant locations,
but leaves the lengths unset. ComputeConstantLengths is just the
completion of that work. There could be no other interpretation,
unlike generate_normalized_query, of what the lengths should be.
--
Sami Imseih
Amazon Web Services (AWS)
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-28 01:42 Michael Paquier <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2026-03-28 01:42 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> On Mar 28, 2026, at 2:09, Sami Imseih <[email protected]> wrote:
> I agree that ComputeConstantLengths should be in core. This one is
> not a gray area IMO. The query jumble already records constant locations,
> but leaves the lengths unset. ComputeConstantLengths is just the
> completion of that work. There could be no other interpretation,
> unlike generate_normalized_query, of what the lengths should be.
This is an interesting remark. One problem with any patches presented yet is that we don’t attach the lengths as part of the in-core query jumbling procedure: we plug the lengths later using the same structure as the query jumbling. It seems to me that this is half-baked overall. I think that we don’t want to pay the extra length computation in the core query jumbling at the end, then why does it make sense to include the lengths in the JumbleState structure at all? Shouldn’t we use a different structure filled inside PGSS for this purpose rather than reuse the same thing for PGSS and the in-core part (don’t have the code in front of me now).
--
Michael
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-30 05:21 Lukas Fittl <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-03-30 05:21 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Fri, Mar 27, 2026 at 6:42 PM Michael Paquier <[email protected]> wrote:
>
> > On Mar 28, 2026, at 2:09, Sami Imseih <[email protected]> wrote:
> > I agree that ComputeConstantLengths should be in core. This one is
> > not a gray area IMO. The query jumble already records constant locations,
> > but leaves the lengths unset. ComputeConstantLengths is just the
> > completion of that work. There could be no other interpretation,
> > unlike generate_normalized_query, of what the lengths should be.
>
> This is an interesting remark. One problem with any patches presented yet is that we don’t attach the lengths as part of the in-core query jumbling procedure: we plug the lengths later using the same structure as the query jumbling. It seems to me that this is half-baked overall. I think that we don’t want to pay the extra length computation in the core query jumbling at the end, then why does it make sense to include the lengths in the JumbleState structure at all? Shouldn’t we use a different structure filled inside PGSS for this purpose rather than reuse the same thing for PGSS and the in-core part (don’t have the code in front of me now).
I see where you're coming from on that, but I don't think we can
remove anything here in practice:
typedef struct LocationLen
{
int location; /* Required */
int length; /* Set by _jumbleElements */
bool squashed; /* Set by RecordConstLocation called from
_jumbleElements */
bool extern_param; /* Set by RecordConstLocation called
from _jumbleParam */
} LocationLen;
typedef struct JumbleState
{
unsigned char *jumble; /* Required */
Size jumble_len; /* Required */
LocationLen *clocations; /* Required to track constant locations */
int clocations_buf_size; /* Required to track constant
locations */
int clocations_count; /* Required to track constant locations */
int highest_extern_param_id; /* Set by _jumbleParam */
bool has_squashed_lists; /* Set by _jumbleElements */
unsigned int pending_nulls; /* Required */
Size total_jumble_len; /* Required */
} JumbleState;
The only refactoring I could think of is to write out the
_jumbleElements information separately, then we could actually drop
length, and maybe squashed, from LocationLen. But I'm not sure that
really buys us much? It would be more clear I guess, because the mixed
locations of where length gets set is indeed a bit odd.
I still think it'd be reasonable for us to include
ComputeConstantLengths in core to complete the picture of what we're
doing with _jumbleElements and the length field already anyway. Its
basically a way to fully hydrate the partially filled out JumbleState
from the initial jumble.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-30 19:20 Sami Imseih <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-03-30 19:20 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> > > I agree that ComputeConstantLengths should be in core. This one is
> > > not a gray area IMO. The query jumble already records constant locations,
> > > but leaves the lengths unset. ComputeConstantLengths is just the
> > > completion of that work. There could be no other interpretation,
> > > unlike generate_normalized_query, of what the lengths should be.
> >
> > This is an interesting remark. One problem with any patches presented yet is that we don’t attach the lengths as part of the in-core query jumbling procedure: we plug the lengths later using the same structure as the query jumbling. It seems to me that this is half-baked overall. I think that we don’t want to pay the extra length computation in the core query jumbling at the end, then why does it make sense to include the lengths in the JumbleState structure at all? Shouldn’t we use a different structure filled inside PGSS for this purpose rather than reuse the same thing for PGSS and the in-core part (don’t have the code in front of me now).
>
> I see where you're coming from on that, but I don't think we can
> remove anything here in practice:
Yes. Not unless we want to rely on the parser to track the lengths in
Const, which could be invasive, but I have not looked into it.
Paying the length computation at the end of every jumble is also
going to impact performance, so that will not be a good idea.
> I still think it'd be reasonable for us to include
> ComputeConstantLengths in core to complete the picture of what we're
> doing with _jumbleElements and the length field already anyway. Its
> basically a way to fully hydrate the partially filled out JumbleState
> from the initial jumble.
I fully agree, ComputeConstantLengths is an optional post-jumble-query step
for a consumer that wishes to calculate the lengths. The length calculation
is not unique to a plug-in, so in my mind the work it's doing is core
jumbling functionality.
--
Sami Imseih
Amazon Web Services (AWS)
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-03-30 23:52 Michael Paquier <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2026-03-30 23:52 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Mon, Mar 30, 2026 at 02:20:47PM -0500, Sami Imseih wrote:
>> I see where you're coming from on that, but I don't think we can
>> remove anything here in practice:
>
> Yes. Not unless we want to rely on the parser to track the lengths in
> Const, which could be invasive, but I have not looked into it.
Hmm. We may not want to get down to that, still that would be
cheaper than reprocessing the parsing of the query string twice.
>> I still think it'd be reasonable for us to include
>> ComputeConstantLengths in core to complete the picture of what we're
>> doing with _jumbleElements and the length field already anyway. Its
>> basically a way to fully hydrate the partially filled out JumbleState
>> from the initial jumble.
>
> I fully agree, ComputeConstantLengths is an optional post-jumble-query step
> for a consumer that wishes to calculate the lengths. The length calculation
> is not unique to a plug-in, so in my mind the work it's doing is core
> jumbling functionality.
Okay. I could fall into that for this release. Marking the
JumbleState as a const is the most important piece here. I'm +-0
regarding this routine, but I can also see your point about how it's
useful to give at least the option to extensions to have a
recomputation of the Const lengths, the same way as PGSS. What are
the extensions that would use that?
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-01 02:53 Sami Imseih <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-04-01 02:53 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> >> I see where you're coming from on that, but I don't think we can
> >> remove anything here in practice:
> >
> > Yes. Not unless we want to rely on the parser to track the lengths in
> > Const, which could be invasive, but I have not looked into it.
>
> Hmm. We may not want to get down to that, still that would be
> cheaper than reprocessing the parsing of the query string twice.
Although unless an extension, at least pg_stat_statements, should
not be doing this double work often, and if it is it is due to heavy
churn on entries.
> >> I still think it'd be reasonable for us to include
> >> ComputeConstantLengths in core to complete the picture of what we're
> >> doing with _jumbleElements and the length field already anyway. Its
> >> basically a way to fully hydrate the partially filled out JumbleState
> >> from the initial jumble.
> >
> > I fully agree, ComputeConstantLengths is an optional post-jumble-query step
> > for a consumer that wishes to calculate the lengths. The length calculation
> > is not unique to a plug-in, so in my mind the work it's doing is core
> > jumbling functionality.
>
> Okay. I could fall into that for this release. Marking the
> JumbleState as a const is the most important piece here.
I do agree.
> I'm +-0 regarding this routine, but I can also see your point about how it's
> useful to give at least the option to extensions to have a
> recomputation of the Const lengths, the same way as PGSS. What are
> the extensions that would use that?
https://github.com/search?q=fill_in_constant_lengths&type=code
A few well-known extensions/tools out there based on a Github search.
--
Sami
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-01 07:55 Lukas Fittl <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-04-01 07:55 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Tue, Mar 31, 2026 at 7:54 PM Sami Imseih <[email protected]> wrote:
>
> > I'm +-0 regarding this routine, but I can also see your point about how it's
> > useful to give at least the option to extensions to have a
> > recomputation of the Const lengths, the same way as PGSS. What are
> > the extensions that would use that?
>
> https://github.com/search?q=fill_in_constant_lengths&type=code
>
> A few well-known extensions/tools out there based on a Github search.
See attached a v9 that extracts ComputeConstantLengths from Sami's v7
for easier discussion.
I've done an updated test with pg_stat_monitor [0] and pg_tracing [1]
with that, and that has both extensions passing regressions tests.
I've also looked but not tested a third extension (sql_firewall) and
that looks like it should just work as well.
Its worth noting that pg_tracing had a secondary use case for its
fill_in_constant_lengths function, which is to find the first
non-comment token in a query (to skip over leading comments). Not
having that causes a small regression test difference. The maintainers
will have to decide whether its worth keeping their own function, but
I think that doesn't invalidate the utility of having this in core to
me.
Thanks,
Lukas
[0]: https://github.com/lfittl/pg_stat_monitor/commit/054f347344a380f7a59cb444685db71301669c61
[1]: https://github.com/lfittl/pg_tracing/commit/d48a36c9a2c8fc284cf63ec1161558b90d8b44ef
--
Lukas Fittl
Attachments:
[application/octet-stream] v9-0001-Make-JumbleState-const-in-post_parse_analyze-hook.patch (17.0K, 2-v9-0001-Make-JumbleState-const-in-post_parse_analyze-hook.patch)
download | inline diff:
From 95f75de814621ecf5ad5b5c7d1a61440dc27257e Mon Sep 17 00:00:00 2001
From: Michael Paquier <[email protected]>
Date: Fri, 27 Mar 2026 14:13:00 +0900
Subject: [PATCH v9] Make JumbleState const in post_parse_analyze hook
Change the post_parse_analyze_hook_type signature to take a const
JumbleState parameter, preventing hooks from modifying the jumble
state during query analysis. This improves API safety by making it
clear that hooks should only read from the jumble state, not modify
it.
To aid extensions to use the API correctly, introduce a new helper
function ComputeConstantLengths() that takes the const JumbleState as
input, and returns a newly allocated LocationLen array with constant
lengths filled in. Replace the fill_in_constant_lengths function in
pg_stat_statements with it.
Discussion: https://postgr.es/m/CAA5RZ0tZp5qU0ikZEEqJnxvdSNGh1DWv80sb-k4QAUmiMoOp_Q@mail.gmail.com
---
.../pg_stat_statements/pg_stat_statements.c | 176 ++----------------
src/backend/nodes/queryjumblefuncs.c | 148 +++++++++++++++
src/include/nodes/queryjumble.h | 3 +
src/include/parser/analyze.h | 2 +-
4 files changed, 172 insertions(+), 157 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 7975476b890..03a801d9114 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,7 +50,6 @@
#include "access/htup_details.h"
#include "access/parallel.h"
#include "catalog/pg_authid.h"
-#include "common/int.h"
#include "executor/instrument.h"
#include "funcapi.h"
#include "jit/jit.h"
@@ -59,7 +58,6 @@
#include "nodes/queryjumble.h"
#include "optimizer/planner.h"
#include "parser/analyze.h"
-#include "parser/scanner.h"
#include "pgstat.h"
#include "storage/fd.h"
#include "storage/ipc.h"
@@ -335,7 +333,7 @@ static void pgss_shmem_request(void);
static void pgss_shmem_startup(void);
static void pgss_shmem_shutdown(int code, Datum arg);
static void pgss_post_parse_analyze(ParseState *pstate, Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
static PlannedStmt *pgss_planner(Query *parse,
const char *query_string,
int cursorOptions,
@@ -359,7 +357,7 @@ static void pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin);
@@ -378,12 +376,9 @@ static char *qtext_fetch(Size query_offset, int query_len,
static bool need_gc_qtexts(void);
static void gc_qtexts(void);
static TimestampTz entry_reset(Oid userid, Oid dbid, int64 queryid, bool minmax_only);
-static char *generate_normalized_query(JumbleState *jstate, const char *query,
+static char *generate_normalized_query(const JumbleState *jstate,
+ const char *query,
int query_loc, int *query_len_p);
-static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc);
-static int comp_location(const void *a, const void *b);
-
/*
* Module load callback
@@ -841,7 +836,7 @@ error:
* Post-parse-analysis hook: mark query with a queryId
*/
static void
-pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
+pgss_post_parse_analyze(ParseState *pstate, Query *query, const JumbleState *jstate)
{
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query, jstate);
@@ -1298,7 +1293,7 @@ pgss_store(const char *query, int64 queryId,
const BufferUsage *bufusage,
const WalUsage *walusage,
const struct JitInstrumentation *jitusage,
- JumbleState *jstate,
+ const JumbleState *jstate,
int parallel_workers_to_launch,
int parallel_workers_launched,
PlannedStmtOrigin planOrigin)
@@ -2849,7 +2844,7 @@ release_lock:
* Returns a palloc'd string.
*/
static char *
-generate_normalized_query(JumbleState *jstate, const char *query,
+generate_normalized_query(const JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
@@ -2861,12 +2856,14 @@ generate_normalized_query(JumbleState *jstate, const char *query,
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
int num_constants_replaced = 0;
+ LocationLen *locs = NULL;
/*
- * Get constants' lengths (core system only gives us locations). Note
- * this also ensures the items are sorted by location.
+ * Determine constants' lengths (core system only gives us locations), and
+ * return a sorted copy of jstate's LocationLen data with lengths filled
+ * in.
*/
- fill_in_constant_lengths(jstate, query, query_loc);
+ locs = ComputeConstantLengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
@@ -2892,15 +2889,15 @@ generate_normalized_query(JumbleState *jstate, const char *query,
* the parameter in the next iteration (or after the loop is done),
* which is a bit odd but seems to work okay in most cases.
*/
- if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists)
+ if (locs[i].extern_param && !jstate->has_squashed_lists)
continue;
- off = jstate->clocations[i].location;
+ off = locs[i].location;
/* Adjust recorded location if we're dealing with partial string */
off -= query_loc;
- tok_len = jstate->clocations[i].length;
+ tok_len = locs[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
@@ -2919,7 +2916,7 @@ generate_normalized_query(JumbleState *jstate, const char *query,
*/
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
num_constants_replaced + 1 + jstate->highest_extern_param_id,
- jstate->clocations[i].squashed ? " /*, ... */" : "");
+ locs[i].squashed ? " /*, ... */" : "");
num_constants_replaced++;
/* move forward */
@@ -2928,6 +2925,10 @@ generate_normalized_query(JumbleState *jstate, const char *query,
last_tok_len = tok_len;
}
+ /* Clean up, if needed */
+ if (locs)
+ pfree(locs);
+
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
@@ -2944,140 +2945,3 @@ generate_normalized_query(JumbleState *jstate, const char *query,
*query_len_p = n_quer_loc;
return norm_query;
}
-
-/*
- * Given a valid SQL string and an array of constant-location records,
- * fill in the textual lengths of those constants.
- *
- * The constants may use any allowed constant syntax, such as float literals,
- * bit-strings, single-quoted strings and dollar-quoted strings. This is
- * accomplished by using the public API for the core scanner.
- *
- * It is the caller's job to ensure that the string is a valid SQL statement
- * with constants at the indicated locations. Since in practice the string
- * has already been parsed, and the locations that the caller provides will
- * have originated from within the authoritative parser, this should not be
- * a problem.
- *
- * Multiple constants can have the same location. We reset lengths of those
- * past the first to -1 so that they can later be ignored.
- *
- * If query_loc > 0, then "query" has been advanced by that much compared to
- * the original string start, so we need to translate the provided locations
- * to compensate. (This lets us avoid re-scanning statements before the one
- * of interest, so it's worth doing.)
- *
- * N.B. There is an assumption that a '-' character at a Const location begins
- * a negative numeric constant. This precludes there ever being another
- * reason for a constant to start with a '-'.
- */
-static void
-fill_in_constant_lengths(JumbleState *jstate, const char *query,
- int query_loc)
-{
- LocationLen *locs;
- core_yyscan_t yyscanner;
- core_yy_extra_type yyextra;
- core_YYSTYPE yylval;
- YYLTYPE yylloc;
-
- /*
- * Sort the records by location so that we can process them in order while
- * scanning the query text.
- */
- if (jstate->clocations_count > 1)
- qsort(jstate->clocations, jstate->clocations_count,
- sizeof(LocationLen), comp_location);
- locs = jstate->clocations;
-
- /* initialize the flex scanner --- should match raw_parser() */
- yyscanner = scanner_init(query,
- &yyextra,
- &ScanKeywords,
- ScanKeywordTokens);
-
- /* Search for each constant, in sequence */
- for (int i = 0; i < jstate->clocations_count; i++)
- {
- int loc;
- int tok;
-
- /* Ignore constants after the first one in the same location */
- if (i > 0 && locs[i].location == locs[i - 1].location)
- {
- locs[i].length = -1;
- continue;
- }
-
- if (locs[i].squashed)
- continue; /* squashable list, ignore */
-
- /* Adjust recorded location if we're dealing with partial string */
- loc = locs[i].location - query_loc;
- Assert(loc >= 0);
-
- /*
- * We have a valid location for a constant that's not a dupe. Lex
- * tokens until we find the desired constant.
- */
- for (;;)
- {
- tok = core_yylex(&yylval, &yylloc, yyscanner);
-
- /* We should not hit end-of-string, but if we do, behave sanely */
- if (tok == 0)
- break; /* out of inner for-loop */
-
- /*
- * We should find the token position exactly, but if we somehow
- * run past it, work with that.
- */
- if (yylloc >= loc)
- {
- if (query[loc] == '-')
- {
- /*
- * It's a negative value - this is the one and only case
- * where we replace more than a single token.
- *
- * Do not compensate for the core system's special-case
- * adjustment of location to that of the leading '-'
- * operator in the event of a negative constant. It is
- * also useful for our purposes to start from the minus
- * symbol. In this way, queries like "select * from foo
- * where bar = 1" and "select * from foo where bar = -2"
- * will have identical normalized query strings.
- */
- tok = core_yylex(&yylval, &yylloc, yyscanner);
- if (tok == 0)
- break; /* out of inner for-loop */
- }
-
- /*
- * We now rely on the assumption that flex has placed a zero
- * byte after the text of the current token in scanbuf.
- */
- locs[i].length = strlen(yyextra.scanbuf + loc);
- break; /* out of inner for-loop */
- }
- }
-
- /* If we hit end-of-string, give up, leaving remaining lengths -1 */
- if (tok == 0)
- break;
- }
-
- scanner_finish(yyscanner);
-}
-
-/*
- * comp_location: comparator for qsorting LocationLen structs by location
- */
-static int
-comp_location(const void *a, const void *b)
-{
- int l = ((const LocationLen *) a)->location;
- int r = ((const LocationLen *) b)->location;
-
- return pg_cmp_s32(l, r);
-}
diff --git a/src/backend/nodes/queryjumblefuncs.c b/src/backend/nodes/queryjumblefuncs.c
index 87db8dc1a32..ecfbd5bd035 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -40,10 +40,12 @@
#include "access/transam.h"
#include "catalog/pg_proc.h"
#include "common/hashfn.h"
+#include "common/int.h"
#include "miscadmin.h"
#include "nodes/nodeFuncs.h"
#include "nodes/queryjumble.h"
#include "utils/lsyscache.h"
+#include "parser/scanner.h"
#include "parser/scansup.h"
#define JUMBLE_SIZE 1024 /* query serialization buffer size */
@@ -773,3 +775,149 @@ _jumbleRangeTblEntry_eref(JumbleState *jstate,
*/
JUMBLE_STRING(aliasname);
}
+
+/*
+ * CompLocation: comparator for qsorting LocationLen structs by location
+ */
+static int
+CompLocation(const void *a, const void *b)
+{
+ int l = ((const LocationLen *) a)->location;
+ int r = ((const LocationLen *) b)->location;
+
+ return pg_cmp_s32(l, r);
+}
+
+/*
+ * Given a valid SQL string and an array of constant-location records, return
+ * the textual lengths of those constants in a newly allocated LocationLen
+ * array, or NULL if there are no constants. It is the caller's responsibility
+ * to pfree the result, if necessary.
+ *
+ * The constants may use any allowed constant syntax, such as float literals,
+ * bit-strings, single-quoted strings and dollar-quoted strings. This is
+ * accomplished by using the public API for the core scanner.
+ *
+ * Multiple constants can have the same location. We reset lengths of those
+ * past the first to -1 so that they can later be ignored.
+ *
+ * If query_loc > 0, then "query" has been advanced by that much compared to
+ * the original string start, as is the case with multi-statement strings, so
+ * we need to translate the provided locations to compensate. (This lets us
+ * avoid re-scanning statements before the one of interest, so it's worth doing.)
+ *
+ * N.B. There is an assumption that a '-' character at a Const location begins
+ * a negative numeric constant. This precludes there ever being another
+ * reason for a constant to start with a '-'.
+ */
+LocationLen *
+ComputeConstantLengths(const JumbleState *jstate, const char *query,
+ int query_loc)
+{
+ LocationLen *locs;
+ core_yyscan_t yyscanner;
+ core_yy_extra_type yyextra;
+ core_YYSTYPE yylval;
+ YYLTYPE yylloc;
+
+ if (jstate->clocations_count == 0)
+ return NULL;
+
+ /* Copy constant locations to avoid modifying jstate */
+ locs = palloc_array(LocationLen, jstate->clocations_count);
+ memcpy(locs, jstate->clocations, jstate->clocations_count * sizeof(LocationLen));
+
+ /*
+ * Sort the records by location so that we can process them in order while
+ * scanning the query text.
+ */
+ if (jstate->clocations_count > 1)
+ qsort(locs, jstate->clocations_count,
+ sizeof(LocationLen), CompLocation);
+
+ /* initialize the flex scanner --- should match raw_parser() */
+ yyscanner = scanner_init(query,
+ &yyextra,
+ &ScanKeywords,
+ ScanKeywordTokens);
+
+ /* Search for each constant, in sequence */
+ for (int i = 0; i < jstate->clocations_count; i++)
+ {
+ int loc;
+ int tok;
+
+ /* Ignore constants after the first one in the same location */
+ if (i > 0 && locs[i].location == locs[i - 1].location)
+ {
+ locs[i].length = -1;
+ continue;
+ }
+
+ if (locs[i].squashed)
+ continue; /* squashable list, ignore */
+
+ /*
+ * Adjust the constant's location using the provided starting location
+ * of the current statement. This allows us to avoid scanning a
+ * multi-statement string from the beginning.
+ */
+ loc = locs[i].location - query_loc;
+ Assert(loc >= 0);
+
+ /*
+ * We have a valid location for a constant that's not a dupe. Lex
+ * tokens until we find the desired constant.
+ */
+ for (;;)
+ {
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+
+ /* We should not hit end-of-string, but if we do, behave sanely */
+ if (tok == 0)
+ break; /* out of inner for-loop */
+
+ /*
+ * We should find the token position exactly, but if we somehow
+ * run past it, work with that.
+ */
+ if (yylloc >= loc)
+ {
+ if (query[loc] == '-')
+ {
+ /*
+ * It's a negative value - this is the one and only case
+ * where we replace more than a single token.
+ *
+ * Do not compensate for the special-case adjustment of
+ * location to that of the leading '-' operator in the
+ * event of a negative constant (see doNegate() in
+ * gram.y). It is also useful for our purposes to start
+ * from the minus symbol. In this way, queries like
+ * "select * from foo where bar = 1" and "select * from
+ * foo where bar = -2" will have identical normalized
+ * query strings.
+ */
+ tok = core_yylex(&yylval, &yylloc, yyscanner);
+ if (tok == 0)
+ break; /* out of inner for-loop */
+ }
+
+ /*
+ * We now rely on the assumption that flex has placed a zero
+ * byte after the text of the current token in scanbuf.
+ */
+ locs[i].length = strlen(yyextra.scanbuf + loc);
+ break; /* out of inner for-loop */
+ }
+ }
+
+ /* If we hit end-of-string, give up, leaving remaining lengths -1 */
+ if (tok == 0)
+ break;
+ }
+
+ scanner_finish(yyscanner);
+
+ return locs;
+}
diff --git a/src/include/nodes/queryjumble.h b/src/include/nodes/queryjumble.h
index 9f81893003c..f331449ba78 100644
--- a/src/include/nodes/queryjumble.h
+++ b/src/include/nodes/queryjumble.h
@@ -91,6 +91,9 @@ extern PGDLLIMPORT int compute_query_id;
extern const char *CleanQuerytext(const char *query, int *location, int *len);
+extern LocationLen *ComputeConstantLengths(const JumbleState *jstate,
+ const char *query,
+ int query_loc);
extern JumbleState *JumbleQuery(Query *query);
extern void EnableQueryId(void);
diff --git a/src/include/parser/analyze.h b/src/include/parser/analyze.h
index e10270ff0ff..b2db6fa4e8c 100644
--- a/src/include/parser/analyze.h
+++ b/src/include/parser/analyze.h
@@ -21,7 +21,7 @@
/* Hook for plugins to get control at end of parse analysis */
typedef void (*post_parse_analyze_hook_type) (ParseState *pstate,
Query *query,
- JumbleState *jstate);
+ const JumbleState *jstate);
extern PGDLLIMPORT post_parse_analyze_hook_type post_parse_analyze_hook;
--
2.47.1
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-05 12:04 Michael Paquier <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2026-04-05 12:04 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Wed, Apr 01, 2026 at 12:55:28AM -0700, Lukas Fittl wrote:
> On Tue, Mar 31, 2026 at 7:54 PM Sami Imseih <[email protected]> wrote:
>>
>> > I'm +-0 regarding this routine, but I can also see your point about how it's
>> > useful to give at least the option to extensions to have a
>> > recomputation of the Const lengths, the same way as PGSS. What are
>> > the extensions that would use that?
>>
>> https://github.com/search?q=fill_in_constant_lengths&type=code
>>
>> A few well-known extensions/tools out there based on a Github search.
FWIW, this search points to a lot of forks
> See attached a v9 that extracts ComputeConstantLengths from Sami's v7
> for easier discussion.
>
> I've done an updated test with pg_stat_monitor [0] and pg_tracing [1]
> with that, and that has both extensions passing regressions tests.
> I've also looked but not tested a third extension (sql_firewall) and
> that looks like it should just work as well.
Still I can see the gains here, so I guess that this version works
here. I'm happy enough to see the const with JumbleState.
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-05 22:13 Sami Imseih <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Sami Imseih @ 2026-04-05 22:13 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> >> > I'm +-0 regarding this routine, but I can also see your point about how it's
> >> > useful to give at least the option to extensions to have a
> >> > recomputation of the Const lengths, the same way as PGSS. What are
> >> > the extensions that would use that?
> >>
> >> https://github.com/search?q=fill_in_constant_lengths&type=code
> >>
> >> A few well-known extensions/tools out there based on a Github search.
>
> FWIW, this search points to a lot of forks
>
> > See attached a v9 that extracts ComputeConstantLengths from Sami's v7
> > for easier discussion.
> >
> > I've done an updated test with pg_stat_monitor [0] and pg_tracing [1]
> > with that, and that has both extensions passing regressions tests.
> > I've also looked but not tested a third extension (sql_firewall) and
> > that looks like it should just work as well.
>
> Still I can see the gains here, so I guess that this version works
> here. I'm happy enough to see the const with JumbleState.
I took a look at v9 and it LGTM.
--
Sami
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-07 06:27 Michael Paquier <[email protected]>
parent: Sami Imseih <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Michael Paquier @ 2026-04-07 06:27 UTC (permalink / raw)
To: Sami Imseih <[email protected]>; +Cc: Lukas Fittl <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Sun, Apr 05, 2026 at 05:13:40PM -0500, Sami Imseih wrote:
> I took a look at v9 and it LGTM.
I can also see that v9 had the idea to discard quite a few of the
edits I did previously. Restored that, reworded one more place that
was refering to query normalization in ComputeConstantLengths(),
applied the result. We're in time at the end.
--
Michael
Attachments:
[application/pgp-signature] signature.asc (833B, 2-signature.asc)
download
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-07 06:51 Lukas Fittl <[email protected]>
parent: Michael Paquier <[email protected]>
0 siblings, 1 reply; 35+ messages in thread
From: Lukas Fittl @ 2026-04-07 06:51 UTC (permalink / raw)
To: Michael Paquier <[email protected]>; +Cc: Sami Imseih <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
On Mon, Apr 6, 2026 at 11:28 PM Michael Paquier <[email protected]> wrote:
>
> On Sun, Apr 05, 2026 at 05:13:40PM -0500, Sami Imseih wrote:
> > I took a look at v9 and it LGTM.
>
> I can also see that v9 had the idea to discard quite a few of the
> edits I did previously. Restored that, reworded one more place that
> was refering to query normalization in ComputeConstantLengths(),
> applied the result. We're in time at the end.
Thanks for getting this in!
(and apologies that I dropped your edits, I was looking at Sami's v7
version when I put that together)
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 35+ messages in thread
* Re: Refactor query normalization into core query jumbling
@ 2026-04-07 16:35 Sami Imseih <[email protected]>
parent: Lukas Fittl <[email protected]>
0 siblings, 0 replies; 35+ messages in thread
From: Sami Imseih @ 2026-04-07 16:35 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Michael Paquier <[email protected]>; zengman <[email protected]>; pgsql-hackers; Julien Rouhaud <[email protected]>
> On Mon, Apr 6, 2026 at 11:28 PM Michael Paquier <[email protected]> wrote:
> >
> > On Sun, Apr 05, 2026 at 05:13:40PM -0500, Sami Imseih wrote:
> > > I took a look at v9 and it LGTM.
> >
> > I can also see that v9 had the idea to discard quite a few of the
> > edits I did previously. Restored that, reworded one more place that
> > was refering to query normalization in ComputeConstantLengths(),
> > applied the result. We're in time at the end.
>
> Thanks for getting this in!
Thanks Michael and Lukas!
--
Sami
^ permalink raw reply [nested|flat] 35+ messages in thread
end of thread, other threads:[~2026-04-07 16:35 UTC | newest]
Thread overview: 35+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-12-19 00:44 Refactor query normalization into core query jumbling Sami Imseih <[email protected]>
2025-12-19 06:49 ` Bertrand Drouvot <[email protected]>
2025-12-19 21:36 ` Sami Imseih <[email protected]>
2025-12-19 23:22 ` Michael Paquier <[email protected]>
2025-12-22 23:40 ` Sami Imseih <[email protected]>
2025-12-23 03:39 ` =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
2025-12-23 16:35 ` Sami Imseih <[email protected]>
2025-12-24 02:50 ` =?ISO-8859-1?B?emVuZ21hbg==?= <[email protected]>
2025-12-24 02:57 ` Michael Paquier <[email protected]>
2025-12-24 04:06 ` Michael Paquier <[email protected]>
2025-12-24 17:32 ` Sami Imseih <[email protected]>
2025-12-30 02:34 ` Lukas Fittl <[email protected]>
2025-12-30 04:31 ` Michael Paquier <[email protected]>
2026-01-06 20:20 ` Sami Imseih <[email protected]>
2026-03-17 02:12 ` Sami Imseih <[email protected]>
2026-03-26 01:34 ` Lukas Fittl <[email protected]>
2026-03-26 03:31 ` Sami Imseih <[email protected]>
2026-03-26 16:35 ` Lukas Fittl <[email protected]>
2026-03-26 17:18 ` Sami Imseih <[email protected]>
2026-03-27 01:50 ` Lukas Fittl <[email protected]>
2026-03-27 05:17 ` Michael Paquier <[email protected]>
2026-03-27 07:19 ` Lukas Fittl <[email protected]>
2026-03-27 17:09 ` Sami Imseih <[email protected]>
2026-03-28 01:42 ` Michael Paquier <[email protected]>
2026-03-30 05:21 ` Lukas Fittl <[email protected]>
2026-03-30 19:20 ` Sami Imseih <[email protected]>
2026-03-30 23:52 ` Michael Paquier <[email protected]>
2026-04-01 02:53 ` Sami Imseih <[email protected]>
2026-04-01 07:55 ` Lukas Fittl <[email protected]>
2026-04-05 12:04 ` Michael Paquier <[email protected]>
2026-04-05 22:13 ` Sami Imseih <[email protected]>
2026-04-07 06:27 ` Michael Paquier <[email protected]>
2026-04-07 06:51 ` Lukas Fittl <[email protected]>
2026-04-07 16:35 ` Sami Imseih <[email protected]>
2025-12-30 08:13 ` Bertrand Drouvot <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox