public inbox for [email protected]
help / color / mirror / Atom feedProposal: Recent mutated table tracking in memory
44+ messages / 2 participants
[nested] [flat]
* Proposal: Recent mutated table tracking in memory
@ 2026-01-06 11:25 Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-01-06 11:25 UTC (permalink / raw)
To: [email protected]
Hello,
As initially proposed under "Proposal: recent access based routing for
primary-replica setups" and then broken into separate tasks - i am adding
here a patch to implement tracking of latest mutated table, and then using
the replication lag as a base - deciding where to point queries when query
load balancing and parsing is enabled.
More details as in the patch:
Feature: add in-memory table tracking to prevent stale reads from replicas
Implement "memory map" feature that tracks recently-written database
tables in shared memory to prevent stale reads during replication lag.
When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
table within the TTL window is routed to primary instead of replica.
Key features:
- Shared memory hash table for tracking table mutations with TTL
- Query parse cache with LRU eviction for performance
- Cold start protection (routes all queries to primary initially)
- Automatic TTL calculation: replication_delay × configurable factor
- Per-table staleness tracking with microsecond precision
New configuration parameters:
- memory_map_enabled: Enable/disable the feature (default: off)
- memory_map_ttl_factor: TTL multiplier for replication delay (default: 5.0)
- memory_map_cold_start_duration: Cold start period in ms (default: 2000)
- memory_map_table_buckets: Hash buckets for table map (default: 1024)
- memory_map_table_size: Max tracked tables (default: 2048)
- memory_map_query_buckets: Hash buckets for query cache (default: 2048)
- memory_map_query_cache_size: Max cached queries (default: 10000)
Patch applies properly and tests pass.
Open to all feedback - thank you!
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] mutated_table.patch (67.1K, 3-mutated_table.patch)
download | inline diff:
From 47551f4c1eae9b6275904d4ead9b24d9a83fda4b Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Implement "memory map" feature that tracks recently-written database
tables in shared memory to prevent stale reads during replication lag.
When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
table within the TTL window is routed to primary instead of replica.
Key features:
- Shared memory hash table for tracking table mutations with TTL
- Query parse cache with LRU eviction for performance
- Cold start protection (routes all queries to primary initially)
- Automatic TTL calculation: replication_delay × configurable factor
- Per-table staleness tracking with microsecond precision
New configuration parameters:
- memory_map_enabled: Enable/disable the feature (default: off)
- memory_map_ttl_factor: TTL multiplier for replication delay (default: 5.0)
- memory_map_cold_start_duration: Cold start period in ms (default: 2000)
- memory_map_table_buckets: Hash buckets for table map (default: 1024)
- memory_map_table_size: Max tracked tables (default: 2048)
- memory_map_query_buckets: Hash buckets for query cache (default: 2048)
- memory_map_query_cache_size: Max cached queries (default: 10000)
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..bdc929ee55b94899ffdd90880a741cfbac051aa4 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1193,4 +1193,210 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-memory-map">
+ <title>Memory Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the memory map feature, which tracks recently written tables
+ to prevent stale reads from replica nodes during replication lag. This implements the
+ "lagless" architecture pattern for distributed systems with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * memory_map_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling the memory map feature increases shared memory consumption. With default settings,
+ the feature requires approximately 6.6 MB of shared memory (0.3 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>memory_map_table_size * 160 bytes</literal> (default: 2048 * 160 = ~320 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>memory_map_query_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>memory_map_query_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when enabling this feature.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-memory-map-enabled" xreflabel="memory_map_enabled">
+ <term><varname>memory_map_enabled</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>memory_map_enabled</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables in-memory tracking of recently written tables. When enabled, tables are marked
+ as stale after write operations, and reads are routed to primary until the TTL expires.
+ </para>
+ <para>
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ Default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-ttl-factor" xreflabel="memory_map_ttl_factor">
+ <term><varname>memory_map_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>memory_map_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * memory_map_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-cold-start-duration" xreflabel="memory_map_cold_start_duration">
+ <term><varname>memory_map_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the memory map
+ is populated with recent write history.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-table-buckets" xreflabel="memory_map_table_buckets">
+ <term><varname>memory_map_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the table mutation tracking hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-table-size" xreflabel="memory_map_table_size">
+ <term><varname>memory_map_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the memory map.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 160 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-query-buckets" xreflabel="memory_map_query_buckets">
+ <term><varname>memory_map_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-query-cache-size" xreflabel="memory_map_query_cache_size">
+ <term><varname>memory_map_query_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_query_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-memory-map-example">
+ <title>Memory Map Configuration Example</title>
+ <para>
+ To enable memory map with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable memory map feature
+memory_map_enabled = on
+memory_map_ttl_factor = 5.0
+memory_map_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+memory_map_table_size = 4096 # Track up to 4096 tables (~640 KB)
+memory_map_query_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 32 MB (31 MB query cache + 0.6 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.6 MB.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..51896ae07771fc00382ab965eaf3807c8b5f3d94 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_memory_map.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..d9a28e7ec3369ff799cb37c37c0cd05075327606 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -783,6 +783,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"memory_map_enabled", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Enable in-memory tracking of recently written tables to avoid stale reads from replicas",
+ CONFIG_VAR_TYPE_BOOL, false, 0
+ },
+ &g_pool_config.memory_map_enabled,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"auto_failback", CFGCXT_RELOAD, FAILOVER_CONFIG,
"Enables nodes automatically reattach, when detached node continue streaming replication.",
@@ -1757,6 +1767,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"memory_map_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for memory map (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.memory_map_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2376,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"memory_map_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.memory_map_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for table mutation map.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in table mutation map.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_query_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_query_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 1a13168c6e8d3f0064dfce4ee6e4661eee69304e..47e5f2796f809dcf3208edd7d0a2bcf8dda83260 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_memory_map.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -2135,6 +2136,92 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check memory map for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->memory_map_enabled)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_memory_map_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of memory map cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table names and check if any are stale */
+ SelectContext ctx;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_memory_map_table_is_stale(ctx.table_names[i]))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..9675c1b65d9bae83c6412c1f1f3399364932221f 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -365,6 +365,16 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Memory map configuration for tracking recently written tables */
+ bool memory_map_enabled; /* Enable in-memory table tracking */
+ double memory_map_ttl_factor; /* TTL multiplier for replication delay */
+ int memory_map_cold_start_duration; /* Cold start duration in ms */
+ int memory_map_table_buckets; /* Number of hash buckets for table map */
+ int memory_map_table_size; /* Max entries in table map */
+ int memory_map_query_buckets; /* Number of hash buckets for query cache */
+ int memory_map_query_cache_size; /* Max entries in query cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_memory_map.h b/src/include/utils/pool_memory_map.h
new file mode 100644
index 0000000000000000000000000000000000000000..511d7a45e7dbd417b1e49b9211fb994f29af1a08
--- /dev/null
+++ b/src/include/utils/pool_memory_map.h
@@ -0,0 +1,236 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_memory_map.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_MEMORY_MAP_H
+#define POOL_MEMORY_MAP_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define MEMORY_MAP_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define MEMORY_MAP_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define MEMORY_MAP_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define MEMORY_MAP_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table
+ */
+typedef struct TableMutationEntry
+{
+ char table_name[MEMORY_MAP_TABLE_NAME_LEN]; /* "schema"."table" */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ volatile uint32 lock; /* Spinlock for thread-safe access */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TableMutationEntry entries[max_entries];
+ */
+} TableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[MEMORY_MAP_MAX_TABLES_PER_QUERY][MEMORY_MAP_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ volatile uint32 lock; /* Spinlock for thread-safe access */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for memory map feature
+ */
+typedef struct MemoryMapState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} MemoryMapState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct MemoryMapShmem
+{
+ MemoryMapState state;
+ TableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} MemoryMapShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for memory map.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_memory_map_init(void);
+
+/*
+ * Initialize per-child process state for memory map.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_memory_map_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_memory_map_in_cold_start(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_memory_map_table_is_stale(const char *table_name);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_names: array of table names
+ * num_tables: number of tables in array
+ */
+extern void pool_memory_map_mark_tables_written(const char **table_names, int num_tables);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_name: fully qualified table name
+ */
+extern void pool_memory_map_mark_table_written(const char *table_name);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_memory_map_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_memory_map_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_memory_map_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_memory_map_normalize_and_hash(const char *query);
+
+/*
+ * Get the current TTL in microseconds.
+ */
+extern uint64 pool_memory_map_get_ttl(void);
+
+/*
+ * Calculate required shared memory size for memory map.
+ */
+extern Size pool_memory_map_shmem_size(void);
+
+/*
+ * Get memory map statistics for monitoring.
+ */
+extern void pool_memory_map_get_stats(uint32 *queries_checked,
+ uint32 *forced_primary,
+ uint32 *allowed_replica,
+ uint64 *current_ttl_us);
+
+#endif /* POOL_MEMORY_MAP_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index 4d88c5815ea253471167dfe7e5bf39f0270323ec..f4a14c84db99100fb761168c14e77b2f2b9eff4b 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_memory_map.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -3065,6 +3066,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->memory_map_enabled)
+ {
+ size += MAXALIGN(pool_memory_map_shmem_size());
+ elog(DEBUG1, "memory_map: %zu bytes requested for shared memory", MAXALIGN(pool_memory_map_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3181,6 +3188,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize memory map for tracking recently written tables */
+ if (pool_config->memory_map_enabled)
+ {
+ pool_memory_map_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..9b0681ca46ac2602d3f541ad3119770d422fb0c3 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_memory_map.h"
+#include "utils/pool_select_walker.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,38 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for memory map feature.
+ * Mark tables as written when INSERT/UPDATE/DELETE completes.
+ */
+ if (pool_config->memory_map_enabled)
+ {
+ char *table_name = NULL;
+
+ if (IsA(node, InsertStmt))
+ {
+ InsertStmt *stmt = (InsertStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+ else if (IsA(node, UpdateStmt))
+ {
+ UpdateStmt *stmt = (UpdateStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+ else if (IsA(node, DeleteStmt))
+ {
+ DeleteStmt *stmt = (DeleteStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+
+ if (table_name != NULL)
+ {
+ pool_memory_map_mark_table_written(table_name);
+ ereport(DEBUG1,
+ (errmsg("memory map: marked table \"%s\" as written", table_name)));
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..07ee58c6a48dcd3ef6d79970e08a6f77b8924e1d 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_memory_map.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize memory map child state for cold start tracking */
+ if (pool_config->memory_map_enabled)
+ {
+ pool_memory_map_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..a245d58bf3339913602143da1b83b964fe5dcaeb 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -499,6 +499,51 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Memory Map (Lagless Read Replica) -
+ # WARNING: Enabling this feature increases shared memory usage
+ # Default settings require ~6.6 MB shared memory
+ # (0.3 MB table tracking + 6.3 MB query cache)
+
+#memory_map_enabled = off
+ # Enable in-memory tracking of recently written tables
+ # to prevent stale reads from replicas during replication lag
+ # (change requires reload)
+
+#memory_map_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#memory_map_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#memory_map_table_buckets = 1024
+ # Number of hash buckets for table mutation tracking
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#memory_map_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#memory_map_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#memory_map_query_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 7026f0b1f0de7b9018ac912fac850f91d1c2978b..7dfce4946e268e120471db760440155787f84515 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_memory_map.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -696,6 +697,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for memory map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1032,6 +1034,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for memory map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/* Log delay if necessary */
delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in
* milliseconds, convert
@@ -1049,6 +1055,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update memory map TTL based on maximum observed delay */
+ if (pool_config->memory_map_enabled && max_delay_us > 0)
+ pool_memory_map_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/tests/045.memory_map/test.sh b/src/test/regression/tests/045.memory_map/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..ce05418262664e5133e2ffd478c7ae622b062cc7
--- /dev/null
+++ b/src/test/regression/tests/045.memory_map/test.sh
@@ -0,0 +1,196 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for memory map feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure memory map feature
+ echo "memory_map_enabled = on" >> etc/pgpool.conf
+ echo "memory_map_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "memory_map_cold_start_duration = 2000" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message
+ if grep -q "could not load balance because of memory map cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (2 seconds)
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to memory map
+ if grep -q "could not load balance because of memory map cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1
+ $PSQL test -c "INSERT INTO t1 VALUES (1);" > /dev/null 2>&1
+
+ # Immediately read from t1 - should go to primary due to recent write
+ $PSQL test -c "SELECT 'write_read_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for table staleness message
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -i "memory" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see memory map blocking message for t2
+ if grep -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2
+ $PSQL test -c "UPDATE t2 SET i = 999 WHERE i = 0;" > /dev/null 2>&1
+
+ # Immediately read from t2 - should go to primary
+ $PSQL test -c "SELECT 'update_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3
+ $PSQL test -c "DELETE FROM t3 WHERE i = 0;" > /dev/null 2>&1
+
+ # Immediately read from t3 - should go to primary
+ $PSQL test -c "SELECT 'delete_test' as marker, * FROM t3;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: Multi-Table Query with One Stale Table ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a new clean table
+ $PSQL test -c "CREATE TABLE t4(i INTEGER);" > /dev/null 2>&1
+
+ # Wait a bit for TTL to expire on other tables if factor is low
+ sleep 1
+
+ # Write to t1 only
+ $PSQL test -c "INSERT INTO t1 VALUES (100);" > /dev/null 2>&1
+
+ # Query joining t1 and t4 - should go to primary because t1 is stale
+ $PSQL test -c "SELECT 'multi_table_test' as marker FROM t1, t4;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*t1.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: Multi-table query routes to primary when one table is stale"
+ else
+ echo "Test 7 FAILED: Multi-table staleness not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Memory Map Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/utils/pool_memory_map.c b/src/utils/pool_memory_map.c
new file mode 100644
index 0000000000000000000000000000000000000000..3f00ec1e2afef6518532804391633175fd351811
--- /dev/null
+++ b/src/utils/pool_memory_map.c
@@ -0,0 +1,1076 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_memory_map.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "utils/pool_memory_map.h"
+#include "utils/elog.h"
+#include "utils/palloc.h"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static MemoryMapShmem *memory_map_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TableMutationEntry *)((char *)(map) + sizeof(TableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Spinlock operations using atomic compare-and-swap
+ * ----------------
+ */
+
+static inline void
+spin_lock(volatile uint32 *lock)
+{
+ while (__sync_lock_test_and_set(lock, 1))
+ {
+ /* Spin until we acquire the lock */
+ while (*lock)
+ ;
+ }
+}
+
+static inline void
+spin_unlock(volatile uint32 *lock)
+{
+ __sync_lock_release(lock);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for strings
+ */
+static uint32
+fnv1a_hash_string(const char *str)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+
+ while (*str)
+ {
+ hash ^= (uint8)*str++;
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+ map->lock = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = MEMORY_MAP_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : MEMORY_MAP_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TableMutationHashTable *map)
+{
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == MEMORY_MAP_INVALID_INDEX)
+ return MEMORY_MAP_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = MEMORY_MAP_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TableMutationHashTable *map, int idx)
+{
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or MEMORY_MAP_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TableMutationHashTable *map, const char *table_name, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ strcmp(entries[idx].table_name, table_name) == 0)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return MEMORY_MAP_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TableMutationHashTable *map, const char *table_name,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_name, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == MEMORY_MAP_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict oldest entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != MEMORY_MAP_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == MEMORY_MAP_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("memory_map: failed to allocate entry for table %s", table_name)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ strlcpy(entries[idx].table_name, table_name, MEMORY_MAP_TABLE_NAME_LEN);
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("memory_map: marked table '%s' as written", table_name)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("memory_map: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = MEMORY_MAP_INVALID_INDEX;
+ cache->lru_tail = MEMORY_MAP_INVALID_INDEX;
+ cache->lock = 0;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = MEMORY_MAP_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : MEMORY_MAP_INVALID_INDEX;
+ entries[i].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[i].lru_next = MEMORY_MAP_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != MEMORY_MAP_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == MEMORY_MAP_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != MEMORY_MAP_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == MEMORY_MAP_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[idx].lru_next = MEMORY_MAP_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != MEMORY_MAP_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = MEMORY_MAP_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == MEMORY_MAP_INVALID_INDEX)
+ return MEMORY_MAP_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != MEMORY_MAP_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = MEMORY_MAP_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return MEMORY_MAP_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the memory map feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_memory_map_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->memory_map_table_buckets;
+ int table_size = pool_config->memory_map_table_size;
+ int query_buckets = pool_config->memory_map_query_buckets;
+ int query_cache_size = pool_config->memory_map_query_cache_size;
+
+ /* Main structure */
+ size += sizeof(MemoryMapShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_memory_map_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (!pool_config->memory_map_enabled)
+ {
+ ereport(DEBUG1,
+ (errmsg("memory_map: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_memory_map_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("memory_map: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ memory_map_shmem = (MemoryMapShmem *)shmem_ptr;
+ shmem_ptr += sizeof(MemoryMapShmem);
+
+ memory_map_shmem->table_map = (TableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TableMutationHashTable);
+ shmem_ptr += pool_config->memory_map_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->memory_map_table_size * sizeof(TableMutationEntry);
+
+ memory_map_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(memory_map_shmem->table_map,
+ pool_config->memory_map_table_buckets,
+ pool_config->memory_map_table_size);
+
+ query_cache_init(memory_map_shmem->query_cache,
+ pool_config->memory_map_query_buckets,
+ pool_config->memory_map_query_cache_size);
+
+ /* Initialize global state */
+ memory_map_shmem->state.initialized = true;
+ memory_map_shmem->state.current_ttl_us = MEMORY_MAP_DEFAULT_TTL_US;
+ get_current_time(&memory_map_shmem->state.ttl_last_updated);
+ memory_map_shmem->state.stats_queries_checked = 0;
+ memory_map_shmem->state.stats_forced_primary = 0;
+ memory_map_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("memory_map: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_memory_map_child_init(void)
+{
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: child initialized, cold start period %d ms",
+ pool_config->memory_map_cold_start_duration)));
+}
+
+bool
+pool_memory_map_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (!pool_config->memory_map_enabled || !cold_start_initialized)
+ return false;
+
+ if (pool_config->memory_map_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->memory_map_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("memory_map: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->memory_map_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+bool
+pool_memory_map_table_is_stale(const char *table_name)
+{
+ TableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return false;
+
+ map = memory_map_shmem->table_map;
+ hash = fnv1a_hash_string(table_name);
+
+ spin_lock(&map->lock);
+
+ idx = table_map_lookup(map, table_name, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = memory_map_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("memory_map: table '%s' elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_name, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ spin_unlock(&map->lock);
+
+ /* Update statistics */
+ __sync_fetch_and_add(&memory_map_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&memory_map_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&memory_map_shmem->state.stats_allowed_replica, 1);
+
+ return is_stale;
+}
+
+void
+pool_memory_map_mark_tables_written(const char **table_names, int num_tables)
+{
+ TableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_names == NULL)
+ return;
+
+ map = memory_map_shmem->table_map;
+ get_current_time(&now);
+
+ spin_lock(&map->lock);
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ table_map_cleanup_expired(map, memory_map_shmem->state.current_ttl_us);
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+
+ if (table_names[i] != NULL && table_names[i][0] != '\0')
+ {
+ hash = fnv1a_hash_string(table_names[i]);
+ table_map_insert(map, table_names[i], hash, &now);
+ }
+ }
+
+ spin_unlock(&map->lock);
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_memory_map_mark_table_written(const char *table_name)
+{
+ if (table_name != NULL)
+ {
+ const char *tables[1] = { table_name };
+ pool_memory_map_mark_tables_written(tables, 1);
+ }
+}
+
+void
+pool_memory_map_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->memory_map_ttl_factor);
+ if (new_ttl < MEMORY_MAP_DEFAULT_TTL_US)
+ new_ttl = MEMORY_MAP_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ memory_map_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&memory_map_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->memory_map_ttl_factor)));
+}
+
+bool
+pool_memory_map_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return false;
+
+ cache = memory_map_shmem->query_cache;
+
+ spin_lock(&cache->lock);
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < MEMORY_MAP_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], MEMORY_MAP_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ spin_unlock(&cache->lock);
+
+ return found;
+}
+
+void
+pool_memory_map_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ cache = memory_map_shmem->query_cache;
+
+ spin_lock(&cache->lock);
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ spin_unlock(&cache->lock);
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == MEMORY_MAP_INVALID_INDEX)
+ {
+ spin_unlock(&cache->lock);
+ ereport(WARNING,
+ (errmsg("memory_map: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > MEMORY_MAP_MAX_TABLES_PER_QUERY) ?
+ MEMORY_MAP_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], MEMORY_MAP_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ spin_unlock(&cache->lock);
+}
+
+uint64
+pool_memory_map_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
+
+uint64
+pool_memory_map_get_ttl(void)
+{
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return MEMORY_MAP_DEFAULT_TTL_US;
+
+ return memory_map_shmem->state.current_ttl_us;
+}
+
+void
+pool_memory_map_get_stats(uint32 *queries_checked,
+ uint32 *forced_primary,
+ uint32 *allowed_replica,
+ uint64 *current_ttl_us)
+{
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ {
+ *queries_checked = 0;
+ *forced_primary = 0;
+ *allowed_replica = 0;
+ *current_ttl_us = 0;
+ return;
+ }
+
+ *queries_checked = memory_map_shmem->state.stats_queries_checked;
+ *forced_primary = memory_map_shmem->state.stats_forced_primary;
+ *allowed_replica = memory_map_shmem->state.stats_allowed_replica;
+ *current_ttl_us = memory_map_shmem->state.current_ttl_us;
+}
--
2.52.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-01-14 08:55 ` Nadav Shatz <[email protected]>
2026-01-26 11:02 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 2 replies; 44+ messages in thread
From: Nadav Shatz @ 2026-01-14 08:55 UTC (permalink / raw)
To: [email protected]
Hi all,
Any comments or concerns? can we merge it if not?
On Tue, Jan 6, 2026 at 1:25 PM Nadav Shatz <[email protected]> wrote:
> Hello,
>
> As initially proposed under "Proposal: recent access based routing for
> primary-replica setups" and then broken into separate tasks - i am adding
> here a patch to implement tracking of latest mutated table, and then using
> the replication lag as a base - deciding where to point queries when query
> load balancing and parsing is enabled.
>
> More details as in the patch:
> Feature: add in-memory table tracking to prevent stale reads from replicas
>
> Implement "memory map" feature that tracks recently-written database
> tables in shared memory to prevent stale reads during replication lag.
> When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
> marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
> table within the TTL window is routed to primary instead of replica.
>
> Key features:
> - Shared memory hash table for tracking table mutations with TTL
> - Query parse cache with LRU eviction for performance
> - Cold start protection (routes all queries to primary initially)
> - Automatic TTL calculation: replication_delay × configurable factor
> - Per-table staleness tracking with microsecond precision
>
> New configuration parameters:
> - memory_map_enabled: Enable/disable the feature (default: off)
> - memory_map_ttl_factor: TTL multiplier for replication delay (default:
> 5.0)
> - memory_map_cold_start_duration: Cold start period in ms (default: 2000)
> - memory_map_table_buckets: Hash buckets for table map (default: 1024)
> - memory_map_table_size: Max tracked tables (default: 2048)
> - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
> - memory_map_query_cache_size: Max cached queries (default: 10000)
>
> Patch applies properly and tests pass.
>
> Open to all feedback - thank you!
>
> --
> Nadav Shatz
> Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-01-26 11:02 ` Nadav Shatz <[email protected]>
1 sibling, 0 replies; 44+ messages in thread
From: Nadav Shatz @ 2026-01-26 11:02 UTC (permalink / raw)
To: [email protected]
Doing another ping - sorry for that, it's been 20 days.
Would appreciate any feedback. thank you
On Wed, Jan 14, 2026 at 10:55 AM Nadav Shatz <[email protected]> wrote:
> Hi all,
>
> Any comments or concerns? can we merge it if not?
>
> On Tue, Jan 6, 2026 at 1:25 PM Nadav Shatz <[email protected]> wrote:
>
>> Hello,
>>
>> As initially proposed under "Proposal: recent access based routing for
>> primary-replica setups" and then broken into separate tasks - i am adding
>> here a patch to implement tracking of latest mutated table, and then using
>> the replication lag as a base - deciding where to point queries when query
>> load balancing and parsing is enabled.
>>
>> More details as in the patch:
>> Feature: add in-memory table tracking to prevent stale reads from replicas
>>
>> Implement "memory map" feature that tracks recently-written database
>> tables in shared memory to prevent stale reads during replication lag.
>> When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
>> marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
>> table within the TTL window is routed to primary instead of replica.
>>
>> Key features:
>> - Shared memory hash table for tracking table mutations with TTL
>> - Query parse cache with LRU eviction for performance
>> - Cold start protection (routes all queries to primary initially)
>> - Automatic TTL calculation: replication_delay × configurable factor
>> - Per-table staleness tracking with microsecond precision
>>
>> New configuration parameters:
>> - memory_map_enabled: Enable/disable the feature (default: off)
>> - memory_map_ttl_factor: TTL multiplier for replication delay (default:
>> 5.0)
>> - memory_map_cold_start_duration: Cold start period in ms (default: 2000)
>> - memory_map_table_buckets: Hash buckets for table map (default: 1024)
>> - memory_map_table_size: Max tracked tables (default: 2048)
>> - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
>> - memory_map_query_cache_size: Max cached queries (default: 10000)
>>
>> Patch applies properly and tests pass.
>>
>> Open to all feedback - thank you!
>>
>> --
>> Nadav Shatz
>> Tailor Brands | CTO
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-01-28 05:08 ` Tatsuo Ishii <[email protected]>
2026-01-28 05:37 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
1 sibling, 2 replies; 44+ messages in thread
From: Tatsuo Ishii @ 2026-01-28 05:08 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
Sorry for the late reply. I just your email now. Will check and reply
back soon.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> Hi all,
>
> Any comments or concerns? can we merge it if not?
>
> On Tue, Jan 6, 2026 at 1:25 PM Nadav Shatz <[email protected]> wrote:
>
>> Hello,
>>
>> As initially proposed under "Proposal: recent access based routing for
>> primary-replica setups" and then broken into separate tasks - i am adding
>> here a patch to implement tracking of latest mutated table, and then using
>> the replication lag as a base - deciding where to point queries when query
>> load balancing and parsing is enabled.
>>
>> More details as in the patch:
>> Feature: add in-memory table tracking to prevent stale reads from replicas
>>
>> Implement "memory map" feature that tracks recently-written database
>> tables in shared memory to prevent stale reads during replication lag.
>> When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
>> marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
>> table within the TTL window is routed to primary instead of replica.
>>
>> Key features:
>> - Shared memory hash table for tracking table mutations with TTL
>> - Query parse cache with LRU eviction for performance
>> - Cold start protection (routes all queries to primary initially)
>> - Automatic TTL calculation: replication_delay × configurable factor
>> - Per-table staleness tracking with microsecond precision
>>
>> New configuration parameters:
>> - memory_map_enabled: Enable/disable the feature (default: off)
>> - memory_map_ttl_factor: TTL multiplier for replication delay (default:
>> 5.0)
>> - memory_map_cold_start_duration: Cold start period in ms (default: 2000)
>> - memory_map_table_buckets: Hash buckets for table map (default: 1024)
>> - memory_map_table_size: Max tracked tables (default: 2048)
>> - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
>> - memory_map_query_cache_size: Max cached queries (default: 10000)
>>
>> Patch applies properly and tests pass.
>>
>> Open to all feedback - thank you!
>>
>> --
>> Nadav Shatz
>> Tailor Brands | CTO
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-01-28 05:37 ` Nadav Shatz <[email protected]>
1 sibling, 0 replies; 44+ messages in thread
From: Nadav Shatz @ 2026-01-28 05:37 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Thank you Tatsuo!
Nadav Shatz
Tailor Brands | CTO
On Wed, Jan 28, 2026 at 7:08 AM Tatsuo Ishii <[email protected]> wrote:
> Hi Nadav,
>
> Sorry for the late reply. I just your email now. Will check and reply
> back soon.
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi all,
> >
> > Any comments or concerns? can we merge it if not?
> >
> > On Tue, Jan 6, 2026 at 1:25 PM Nadav Shatz <[email protected]>
> wrote:
> >
> >> Hello,
> >>
> >> As initially proposed under "Proposal: recent access based routing for
> >> primary-replica setups" and then broken into separate tasks - i am
> adding
> >> here a patch to implement tracking of latest mutated table, and then
> using
> >> the replication lag as a base - deciding where to point queries when
> query
> >> load balancing and parsing is enabled.
> >>
> >> More details as in the patch:
> >> Feature: add in-memory table tracking to prevent stale reads from
> replicas
> >>
> >> Implement "memory map" feature that tracks recently-written database
> >> tables in shared memory to prevent stale reads during replication lag.
> >> When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
> >> marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
> >> table within the TTL window is routed to primary instead of replica.
> >>
> >> Key features:
> >> - Shared memory hash table for tracking table mutations with TTL
> >> - Query parse cache with LRU eviction for performance
> >> - Cold start protection (routes all queries to primary initially)
> >> - Automatic TTL calculation: replication_delay × configurable factor
> >> - Per-table staleness tracking with microsecond precision
> >>
> >> New configuration parameters:
> >> - memory_map_enabled: Enable/disable the feature (default: off)
> >> - memory_map_ttl_factor: TTL multiplier for replication delay (default:
> >> 5.0)
> >> - memory_map_cold_start_duration: Cold start period in ms (default:
> 2000)
> >> - memory_map_table_buckets: Hash buckets for table map (default: 1024)
> >> - memory_map_table_size: Max tracked tables (default: 2048)
> >> - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
> >> - memory_map_query_cache_size: Max cached queries (default: 10000)
> >>
> >> Patch applies properly and tests pass.
> >>
> >> Open to all feedback - thank you!
> >>
> >> --
> >> Nadav Shatz
> >> Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-01-29 08:28 ` Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
1 sibling, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-01-29 08:28 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Hi Nadav,
>
> Sorry for the late reply. I just your email now. Will check and reply
> back soon.
Unfortunately your patch failed to apply to current master branch.
$ git apply ~/mutated_table.patch
error: patch failed: src/streaming_replication/pool_worker_child.c:1032
error: src/streaming_replication/pool_worker_child.c: patch does not apply
It tried with patch command and it seems failed here:
--- src/streaming_replication/pool_worker_child.c
+++ src/streaming_replication/pool_worker_child.c
@@ -1034,6 +1036,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for memory map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/* Log delay if necessary */
delay_threshold_by_time = pool_config->delay_threshold_by_time * 1000; /* threshold is in
* milliseconds, convert
Need rebase?
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-01-29 08:54 ` Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-01-29 08:54 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
yes indeed, please find attached.
On Thu, Jan 29, 2026 at 10:28 AM Tatsuo Ishii <[email protected]> wrote:
> > Hi Nadav,
> >
> > Sorry for the late reply. I just your email now. Will check and reply
> > back soon.
>
> Unfortunately your patch failed to apply to current master branch.
>
> $ git apply ~/mutated_table.patch
> error: patch failed: src/streaming_replication/pool_worker_child.c:1032
> error: src/streaming_replication/pool_worker_child.c: patch does not apply
>
> It tried with patch command and it seems failed here:
>
> --- src/streaming_replication/pool_worker_child.c
> +++ src/streaming_replication/pool_worker_child.c
> @@ -1034,6 +1036,10 @@ check_replication_time_lag_with_cmd(void)
> bkinfo->standby_delay = delay;
> bkinfo->standby_delay_by_time = true;
>
> + /* Track maximum delay for memory map TTL
> calculation */
> + if (delay > max_delay_us)
> + max_delay_us = delay;
> +
> /* Log delay if necessary */
> delay_threshold_by_time =
> pool_config->delay_threshold_by_time * 1000; /* threshold is in
>
>
> * milliseconds, convert
>
> Need rebase?
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] mutated_table.patch (67.1K, 3-mutated_table.patch)
download | inline diff:
From a316059ed23761dfbebe0e4a775611485f3d5d8d Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Implement "memory map" feature that tracks recently-written database
tables in shared memory to prevent stale reads during replication lag.
When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
table within the TTL window is routed to primary instead of replica.
Key features:
- Shared memory hash table for tracking table mutations with TTL
- Query parse cache with LRU eviction for performance
- Cold start protection (routes all queries to primary initially)
- Automatic TTL calculation: replication_delay × configurable factor
- Per-table staleness tracking with microsecond precision
New configuration parameters:
- memory_map_enabled: Enable/disable the feature (default: off)
- memory_map_ttl_factor: TTL multiplier for replication delay (default: 5.0)
- memory_map_cold_start_duration: Cold start period in ms (default: 2000)
- memory_map_table_buckets: Hash buckets for table map (default: 1024)
- memory_map_table_size: Max tracked tables (default: 2048)
- memory_map_query_buckets: Hash buckets for query cache (default: 2048)
- memory_map_query_cache_size: Max cached queries (default: 10000)
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..bdc929ee55b94899ffdd90880a741cfbac051aa4 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1193,4 +1193,210 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-memory-map">
+ <title>Memory Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the memory map feature, which tracks recently written tables
+ to prevent stale reads from replica nodes during replication lag. This implements the
+ "lagless" architecture pattern for distributed systems with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * memory_map_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling the memory map feature increases shared memory consumption. With default settings,
+ the feature requires approximately 6.6 MB of shared memory (0.3 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>memory_map_table_size * 160 bytes</literal> (default: 2048 * 160 = ~320 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>memory_map_query_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>memory_map_query_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when enabling this feature.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-memory-map-enabled" xreflabel="memory_map_enabled">
+ <term><varname>memory_map_enabled</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>memory_map_enabled</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables in-memory tracking of recently written tables. When enabled, tables are marked
+ as stale after write operations, and reads are routed to primary until the TTL expires.
+ </para>
+ <para>
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ Default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-ttl-factor" xreflabel="memory_map_ttl_factor">
+ <term><varname>memory_map_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>memory_map_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * memory_map_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-cold-start-duration" xreflabel="memory_map_cold_start_duration">
+ <term><varname>memory_map_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the memory map
+ is populated with recent write history.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-table-buckets" xreflabel="memory_map_table_buckets">
+ <term><varname>memory_map_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the table mutation tracking hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-table-size" xreflabel="memory_map_table_size">
+ <term><varname>memory_map_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the memory map.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 160 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-query-buckets" xreflabel="memory_map_query_buckets">
+ <term><varname>memory_map_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-memory-map-query-cache-size" xreflabel="memory_map_query_cache_size">
+ <term><varname>memory_map_query_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>memory_map_query_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-memory-map-example">
+ <title>Memory Map Configuration Example</title>
+ <para>
+ To enable memory map with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable memory map feature
+memory_map_enabled = on
+memory_map_ttl_factor = 5.0
+memory_map_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+memory_map_table_size = 4096 # Track up to 4096 tables (~640 KB)
+memory_map_query_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 32 MB (31 MB query cache + 0.6 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.6 MB.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..51896ae07771fc00382ab965eaf3807c8b5f3d94 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_memory_map.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..d9a28e7ec3369ff799cb37c37c0cd05075327606 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -783,6 +783,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"memory_map_enabled", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Enable in-memory tracking of recently written tables to avoid stale reads from replicas",
+ CONFIG_VAR_TYPE_BOOL, false, 0
+ },
+ &g_pool_config.memory_map_enabled,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"auto_failback", CFGCXT_RELOAD, FAILOVER_CONFIG,
"Enables nodes automatically reattach, when detached node continue streaming replication.",
@@ -1757,6 +1767,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"memory_map_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for memory map (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.memory_map_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2376,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"memory_map_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.memory_map_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for table mutation map.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in table mutation map.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"memory_map_query_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.memory_map_query_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..dfa620decdec7d83e5d0198cd711884d315e02af 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_memory_map.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -2139,6 +2140,92 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check memory map for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->memory_map_enabled)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_memory_map_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of memory map cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table names and check if any are stale */
+ SelectContext ctx;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_memory_map_table_is_stale(ctx.table_names[i]))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..9675c1b65d9bae83c6412c1f1f3399364932221f 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -365,6 +365,16 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Memory map configuration for tracking recently written tables */
+ bool memory_map_enabled; /* Enable in-memory table tracking */
+ double memory_map_ttl_factor; /* TTL multiplier for replication delay */
+ int memory_map_cold_start_duration; /* Cold start duration in ms */
+ int memory_map_table_buckets; /* Number of hash buckets for table map */
+ int memory_map_table_size; /* Max entries in table map */
+ int memory_map_query_buckets; /* Number of hash buckets for query cache */
+ int memory_map_query_cache_size; /* Max entries in query cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_memory_map.h b/src/include/utils/pool_memory_map.h
new file mode 100644
index 0000000000000000000000000000000000000000..511d7a45e7dbd417b1e49b9211fb994f29af1a08
--- /dev/null
+++ b/src/include/utils/pool_memory_map.h
@@ -0,0 +1,236 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_memory_map.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_MEMORY_MAP_H
+#define POOL_MEMORY_MAP_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define MEMORY_MAP_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define MEMORY_MAP_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define MEMORY_MAP_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define MEMORY_MAP_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table
+ */
+typedef struct TableMutationEntry
+{
+ char table_name[MEMORY_MAP_TABLE_NAME_LEN]; /* "schema"."table" */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ volatile uint32 lock; /* Spinlock for thread-safe access */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TableMutationEntry entries[max_entries];
+ */
+} TableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[MEMORY_MAP_MAX_TABLES_PER_QUERY][MEMORY_MAP_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ volatile uint32 lock; /* Spinlock for thread-safe access */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for memory map feature
+ */
+typedef struct MemoryMapState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} MemoryMapState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct MemoryMapShmem
+{
+ MemoryMapState state;
+ TableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} MemoryMapShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for memory map.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_memory_map_init(void);
+
+/*
+ * Initialize per-child process state for memory map.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_memory_map_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_memory_map_in_cold_start(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_memory_map_table_is_stale(const char *table_name);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_names: array of table names
+ * num_tables: number of tables in array
+ */
+extern void pool_memory_map_mark_tables_written(const char **table_names, int num_tables);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_name: fully qualified table name
+ */
+extern void pool_memory_map_mark_table_written(const char *table_name);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_memory_map_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_memory_map_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_memory_map_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_memory_map_normalize_and_hash(const char *query);
+
+/*
+ * Get the current TTL in microseconds.
+ */
+extern uint64 pool_memory_map_get_ttl(void);
+
+/*
+ * Calculate required shared memory size for memory map.
+ */
+extern Size pool_memory_map_shmem_size(void);
+
+/*
+ * Get memory map statistics for monitoring.
+ */
+extern void pool_memory_map_get_stats(uint32 *queries_checked,
+ uint32 *forced_primary,
+ uint32 *allowed_replica,
+ uint64 *current_ttl_us);
+
+#endif /* POOL_MEMORY_MAP_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..5dded3fe3dd1d8d91edf2e0f901ff6cbd01fca04 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_memory_map.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -3068,6 +3069,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->memory_map_enabled)
+ {
+ size += MAXALIGN(pool_memory_map_shmem_size());
+ elog(DEBUG1, "memory_map: %zu bytes requested for shared memory", MAXALIGN(pool_memory_map_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3191,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize memory map for tracking recently written tables */
+ if (pool_config->memory_map_enabled)
+ {
+ pool_memory_map_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..9b0681ca46ac2602d3f541ad3119770d422fb0c3 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_memory_map.h"
+#include "utils/pool_select_walker.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,38 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for memory map feature.
+ * Mark tables as written when INSERT/UPDATE/DELETE completes.
+ */
+ if (pool_config->memory_map_enabled)
+ {
+ char *table_name = NULL;
+
+ if (IsA(node, InsertStmt))
+ {
+ InsertStmt *stmt = (InsertStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+ else if (IsA(node, UpdateStmt))
+ {
+ UpdateStmt *stmt = (UpdateStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+ else if (IsA(node, DeleteStmt))
+ {
+ DeleteStmt *stmt = (DeleteStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+
+ if (table_name != NULL)
+ {
+ pool_memory_map_mark_table_written(table_name);
+ ereport(DEBUG1,
+ (errmsg("memory map: marked table \"%s\" as written", table_name)));
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..07ee58c6a48dcd3ef6d79970e08a6f77b8924e1d 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_memory_map.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize memory map child state for cold start tracking */
+ if (pool_config->memory_map_enabled)
+ {
+ pool_memory_map_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..a245d58bf3339913602143da1b83b964fe5dcaeb 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -499,6 +499,51 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Memory Map (Lagless Read Replica) -
+ # WARNING: Enabling this feature increases shared memory usage
+ # Default settings require ~6.6 MB shared memory
+ # (0.3 MB table tracking + 6.3 MB query cache)
+
+#memory_map_enabled = off
+ # Enable in-memory tracking of recently written tables
+ # to prevent stale reads from replicas during replication lag
+ # (change requires reload)
+
+#memory_map_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#memory_map_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#memory_map_table_buckets = 1024
+ # Number of hash buckets for table mutation tracking
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#memory_map_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#memory_map_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#memory_map_query_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..ca0c468b4f56a715a0ae773fdefe5104440e3860 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_memory_map.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for memory map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for memory map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update memory map TTL based on maximum observed delay */
+ if (pool_config->memory_map_enabled && max_delay_us > 0)
+ pool_memory_map_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/tests/045.memory_map/test.sh b/src/test/regression/tests/045.memory_map/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..ce05418262664e5133e2ffd478c7ae622b062cc7
--- /dev/null
+++ b/src/test/regression/tests/045.memory_map/test.sh
@@ -0,0 +1,196 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for memory map feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure memory map feature
+ echo "memory_map_enabled = on" >> etc/pgpool.conf
+ echo "memory_map_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "memory_map_cold_start_duration = 2000" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message
+ if grep -q "could not load balance because of memory map cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (2 seconds)
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to memory map
+ if grep -q "could not load balance because of memory map cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1
+ $PSQL test -c "INSERT INTO t1 VALUES (1);" > /dev/null 2>&1
+
+ # Immediately read from t1 - should go to primary due to recent write
+ $PSQL test -c "SELECT 'write_read_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for table staleness message
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -i "memory" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see memory map blocking message for t2
+ if grep -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2
+ $PSQL test -c "UPDATE t2 SET i = 999 WHERE i = 0;" > /dev/null 2>&1
+
+ # Immediately read from t2 - should go to primary
+ $PSQL test -c "SELECT 'update_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3
+ $PSQL test -c "DELETE FROM t3 WHERE i = 0;" > /dev/null 2>&1
+
+ # Immediately read from t3 - should go to primary
+ $PSQL test -c "SELECT 'delete_test' as marker, * FROM t3;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: Multi-Table Query with One Stale Table ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a new clean table
+ $PSQL test -c "CREATE TABLE t4(i INTEGER);" > /dev/null 2>&1
+
+ # Wait a bit for TTL to expire on other tables if factor is low
+ sleep 1
+
+ # Write to t1 only
+ $PSQL test -c "INSERT INTO t1 VALUES (100);" > /dev/null 2>&1
+
+ # Query joining t1 and t4 - should go to primary because t1 is stale
+ $PSQL test -c "SELECT 'multi_table_test' as marker FROM t1, t4;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*t1.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: Multi-table query routes to primary when one table is stale"
+ else
+ echo "Test 7 FAILED: Multi-table staleness not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Memory Map Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/utils/pool_memory_map.c b/src/utils/pool_memory_map.c
new file mode 100644
index 0000000000000000000000000000000000000000..3f00ec1e2afef6518532804391633175fd351811
--- /dev/null
+++ b/src/utils/pool_memory_map.c
@@ -0,0 +1,1076 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_memory_map.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "utils/pool_memory_map.h"
+#include "utils/elog.h"
+#include "utils/palloc.h"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static MemoryMapShmem *memory_map_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TableMutationEntry *)((char *)(map) + sizeof(TableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Spinlock operations using atomic compare-and-swap
+ * ----------------
+ */
+
+static inline void
+spin_lock(volatile uint32 *lock)
+{
+ while (__sync_lock_test_and_set(lock, 1))
+ {
+ /* Spin until we acquire the lock */
+ while (*lock)
+ ;
+ }
+}
+
+static inline void
+spin_unlock(volatile uint32 *lock)
+{
+ __sync_lock_release(lock);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for strings
+ */
+static uint32
+fnv1a_hash_string(const char *str)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+
+ while (*str)
+ {
+ hash ^= (uint8)*str++;
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+ map->lock = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = MEMORY_MAP_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : MEMORY_MAP_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TableMutationHashTable *map)
+{
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == MEMORY_MAP_INVALID_INDEX)
+ return MEMORY_MAP_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = MEMORY_MAP_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TableMutationHashTable *map, int idx)
+{
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or MEMORY_MAP_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TableMutationHashTable *map, const char *table_name, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ strcmp(entries[idx].table_name, table_name) == 0)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return MEMORY_MAP_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TableMutationHashTable *map, const char *table_name,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_name, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == MEMORY_MAP_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict oldest entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != MEMORY_MAP_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == MEMORY_MAP_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("memory_map: failed to allocate entry for table %s", table_name)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ strlcpy(entries[idx].table_name, table_name, MEMORY_MAP_TABLE_NAME_LEN);
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("memory_map: marked table '%s' as written", table_name)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("memory_map: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = MEMORY_MAP_INVALID_INDEX;
+ cache->lru_tail = MEMORY_MAP_INVALID_INDEX;
+ cache->lock = 0;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = MEMORY_MAP_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : MEMORY_MAP_INVALID_INDEX;
+ entries[i].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[i].lru_next = MEMORY_MAP_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != MEMORY_MAP_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == MEMORY_MAP_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != MEMORY_MAP_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == MEMORY_MAP_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != MEMORY_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = MEMORY_MAP_INVALID_INDEX;
+ entries[idx].lru_next = MEMORY_MAP_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != MEMORY_MAP_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = MEMORY_MAP_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == MEMORY_MAP_INVALID_INDEX)
+ return MEMORY_MAP_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != MEMORY_MAP_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = MEMORY_MAP_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return MEMORY_MAP_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the memory map feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_memory_map_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->memory_map_table_buckets;
+ int table_size = pool_config->memory_map_table_size;
+ int query_buckets = pool_config->memory_map_query_buckets;
+ int query_cache_size = pool_config->memory_map_query_cache_size;
+
+ /* Main structure */
+ size += sizeof(MemoryMapShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_memory_map_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (!pool_config->memory_map_enabled)
+ {
+ ereport(DEBUG1,
+ (errmsg("memory_map: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_memory_map_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("memory_map: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ memory_map_shmem = (MemoryMapShmem *)shmem_ptr;
+ shmem_ptr += sizeof(MemoryMapShmem);
+
+ memory_map_shmem->table_map = (TableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TableMutationHashTable);
+ shmem_ptr += pool_config->memory_map_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->memory_map_table_size * sizeof(TableMutationEntry);
+
+ memory_map_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(memory_map_shmem->table_map,
+ pool_config->memory_map_table_buckets,
+ pool_config->memory_map_table_size);
+
+ query_cache_init(memory_map_shmem->query_cache,
+ pool_config->memory_map_query_buckets,
+ pool_config->memory_map_query_cache_size);
+
+ /* Initialize global state */
+ memory_map_shmem->state.initialized = true;
+ memory_map_shmem->state.current_ttl_us = MEMORY_MAP_DEFAULT_TTL_US;
+ get_current_time(&memory_map_shmem->state.ttl_last_updated);
+ memory_map_shmem->state.stats_queries_checked = 0;
+ memory_map_shmem->state.stats_forced_primary = 0;
+ memory_map_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("memory_map: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_memory_map_child_init(void)
+{
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: child initialized, cold start period %d ms",
+ pool_config->memory_map_cold_start_duration)));
+}
+
+bool
+pool_memory_map_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (!pool_config->memory_map_enabled || !cold_start_initialized)
+ return false;
+
+ if (pool_config->memory_map_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->memory_map_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("memory_map: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->memory_map_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+bool
+pool_memory_map_table_is_stale(const char *table_name)
+{
+ TableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return false;
+
+ map = memory_map_shmem->table_map;
+ hash = fnv1a_hash_string(table_name);
+
+ spin_lock(&map->lock);
+
+ idx = table_map_lookup(map, table_name, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = memory_map_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("memory_map: table '%s' elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_name, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ spin_unlock(&map->lock);
+
+ /* Update statistics */
+ __sync_fetch_and_add(&memory_map_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&memory_map_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&memory_map_shmem->state.stats_allowed_replica, 1);
+
+ return is_stale;
+}
+
+void
+pool_memory_map_mark_tables_written(const char **table_names, int num_tables)
+{
+ TableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_names == NULL)
+ return;
+
+ map = memory_map_shmem->table_map;
+ get_current_time(&now);
+
+ spin_lock(&map->lock);
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ table_map_cleanup_expired(map, memory_map_shmem->state.current_ttl_us);
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+
+ if (table_names[i] != NULL && table_names[i][0] != '\0')
+ {
+ hash = fnv1a_hash_string(table_names[i]);
+ table_map_insert(map, table_names[i], hash, &now);
+ }
+ }
+
+ spin_unlock(&map->lock);
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_memory_map_mark_table_written(const char *table_name)
+{
+ if (table_name != NULL)
+ {
+ const char *tables[1] = { table_name };
+ pool_memory_map_mark_tables_written(tables, 1);
+ }
+}
+
+void
+pool_memory_map_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->memory_map_ttl_factor);
+ if (new_ttl < MEMORY_MAP_DEFAULT_TTL_US)
+ new_ttl = MEMORY_MAP_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ memory_map_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&memory_map_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("memory_map: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->memory_map_ttl_factor)));
+}
+
+bool
+pool_memory_map_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return false;
+
+ cache = memory_map_shmem->query_cache;
+
+ spin_lock(&cache->lock);
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < MEMORY_MAP_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], MEMORY_MAP_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ spin_unlock(&cache->lock);
+
+ return found;
+}
+
+void
+pool_memory_map_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][MEMORY_MAP_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return;
+
+ cache = memory_map_shmem->query_cache;
+
+ spin_lock(&cache->lock);
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != MEMORY_MAP_INVALID_INDEX)
+ {
+ spin_unlock(&cache->lock);
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == MEMORY_MAP_INVALID_INDEX)
+ {
+ spin_unlock(&cache->lock);
+ ereport(WARNING,
+ (errmsg("memory_map: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > MEMORY_MAP_MAX_TABLES_PER_QUERY) ?
+ MEMORY_MAP_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], MEMORY_MAP_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ spin_unlock(&cache->lock);
+}
+
+uint64
+pool_memory_map_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
+
+uint64
+pool_memory_map_get_ttl(void)
+{
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ return MEMORY_MAP_DEFAULT_TTL_US;
+
+ return memory_map_shmem->state.current_ttl_us;
+}
+
+void
+pool_memory_map_get_stats(uint32 *queries_checked,
+ uint32 *forced_primary,
+ uint32 *allowed_replica,
+ uint64 *current_ttl_us)
+{
+ if (!pool_config->memory_map_enabled || memory_map_shmem == NULL)
+ {
+ *queries_checked = 0;
+ *forced_primary = 0;
+ *allowed_replica = 0;
+ *current_ttl_us = 0;
+ return;
+ }
+
+ *queries_checked = memory_map_shmem->state.stats_queries_checked;
+ *forced_primary = memory_map_shmem->state.stats_forced_primary;
+ *allowed_replica = memory_map_shmem->state.stats_allowed_replica;
+ *current_ttl_us = memory_map_shmem->state.current_ttl_us;
+}
--
2.52.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-01-30 08:09 ` Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-01-30 08:09 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> yes indeed, please find attached.
Thanks. Here some comments on the patch:
- It seems you use a table name (and schema) for a key to identify the
TableMutationEntry and other objects. I think you should use table
oids for the key because the same table name could exists in
different schema. Moreover, if the database is different from the
database when the map entry was created, a map look up could return
incorrect result. In summary the key should be table oid and
database oid (which is already done by query cache subsystem).
- In the patch spin lock primitives are introduced. Why can't we use
semaphore instead? A spin lock uses busy loop, which could increase
the system load if the duration of locking becomes longer.
- What would happen if the leader watchdog fails and other watchdog
node take the place of the leader role?
- pool_memory_map_get_ttl() and pool_memory_map_get_stats() are
defined but are not used anywhere. Why do you have them?
- I think "memory_map" is a too generic name. Can we use more specific
name for the feature?
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-01-31 17:11 ` Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-01-31 17:11 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Thank you for the comments!
I agree with all of them. Let me know what you think of the changes and new
naming.
Please see attached another version of the patch.
On Fri, Jan 30, 2026 at 10:10 AM Tatsuo Ishii <[email protected]> wrote:
> > yes indeed, please find attached.
>
> Thanks. Here some comments on the patch:
>
> - It seems you use a table name (and schema) for a key to identify the
> TableMutationEntry and other objects. I think you should use table
> oids for the key because the same table name could exists in
> different schema. Moreover, if the database is different from the
> database when the map entry was created, a map look up could return
> incorrect result. In summary the key should be table oid and
> database oid (which is already done by query cache subsystem).
>
> - In the patch spin lock primitives are introduced. Why can't we use
> semaphore instead? A spin lock uses busy loop, which could increase
> the system load if the duration of locking becomes longer.
>
> - What would happen if the leader watchdog fails and other watchdog
> node take the place of the leader role?
>
> - pool_memory_map_get_ttl() and pool_memory_map_get_stats() are
> defined but are not used anywhere. Why do you have them?
>
> - I think "memory_map" is a too generic name. Can we use more specific
> name for the feature?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] mutated_table.patch (76.5K, 3-mutated_table.patch)
download | inline diff:
From 403ed46f0d2b33858c05a25d74be2b027db7d21b Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Implement "memory map" feature that tracks recently-written database
tables in shared memory to prevent stale reads during replication lag.
When a write (INSERT/UPDATE/DELETE) occurs on a table, that table is
marked as "dirty" for a configurable TTL period. Any SELECT on a dirty
table within the TTL window is routed to primary instead of replica.
Key features:
- Shared memory hash table for tracking table mutations with TTL
- Query parse cache with LRU eviction for performance
- Cold start protection (routes all queries to primary initially)
- Automatic TTL calculation: replication_delay × configurable factor
- Per-table staleness tracking with microsecond precision
New configuration parameters:
- memory_map_enabled: Enable/disable the feature (default: off)
- memory_map_ttl_factor: TTL multiplier for replication delay (default: 5.0)
- memory_map_cold_start_duration: Cold start period in ms (default: 2000)
- memory_map_table_buckets: Hash buckets for table map (default: 1024)
- memory_map_table_size: Max tracked tables (default: 2048)
- memory_map_query_buckets: Hash buckets for query cache (default: 2048)
- memory_map_query_cache_size: Max cached queries (default: 10000)
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..dce8dec199371e3a24d92baaad6647757b7edf5f 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1193,4 +1193,214 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the table mutation map feature, which tracks recently written tables
+ to prevent stale reads from replica nodes during replication lag. This implements the
+ "lagless" architecture pattern for distributed systems with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * table_mutation_map_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling the table mutation map feature increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>table_mutation_map_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>table_mutation_map_query_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>table_mutation_map_query_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when enabling this feature.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-table-mutation-map-enabled" xreflabel="table_mutation_map_enabled">
+ <term><varname>table_mutation_map_enabled</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_enabled</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables in-memory tracking of recently written tables. When enabled, tables are marked
+ as stale after write operations, and reads are routed to primary until the TTL expires.
+ </para>
+ <para>
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ Default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-table-mutation-map-ttl-factor" xreflabel="table_mutation_map_ttl_factor">
+ <term><varname>table_mutation_map_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * table_mutation_map_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-table-mutation-map-cold-start-duration" xreflabel="table_mutation_map_cold_start_duration">
+ <term><varname>table_mutation_map_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the table mutation map
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-table-mutation-map-table-buckets" xreflabel="table_mutation_map_table_buckets">
+ <term><varname>table_mutation_map_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the table mutation tracking hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-table-mutation-map-table-size" xreflabel="table_mutation_map_table_size">
+ <term><varname>table_mutation_map_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the table mutation map.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-table-mutation-map-query-buckets" xreflabel="table_mutation_map_query_buckets">
+ <term><varname>table_mutation_map_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-table-mutation-map-query-cache-size" xreflabel="table_mutation_map_query_cache_size">
+ <term><varname>table_mutation_map_query_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>table_mutation_map_query_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-table-mutation-map-example">
+ <title>Table Mutation Map Configuration Example</title>
+ <para>
+ To enable table mutation map with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable table mutation map feature
+table_mutation_map_enabled = on
+table_mutation_map_ttl_factor = 5.0
+table_mutation_map_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+table_mutation_map_table_size = 4096 # Track up to 4096 tables (~160 KB)
+table_mutation_map_query_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..fc69bb98c8907d23855837cefaad0a972b4e2171 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_table_mutation_map.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..099191af7629c0ca145628e9a9e9ac92c4bb2f6e 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -783,6 +783,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"table_mutation_map_enabled", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Enable in-memory tracking of recently written tables to avoid stale reads from replicas",
+ CONFIG_VAR_TYPE_BOOL, false, 0
+ },
+ &g_pool_config.table_mutation_map_enabled,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"auto_failback", CFGCXT_RELOAD, FAILOVER_CONFIG,
"Enables nodes automatically reattach, when detached node continue streaming replication.",
@@ -1757,6 +1767,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"table_mutation_map_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for table mutation map (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.table_mutation_map_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2376,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"table_mutation_map_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.table_mutation_map_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"table_mutation_map_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for table mutation map.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.table_mutation_map_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"table_mutation_map_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in table mutation map.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.table_mutation_map_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"table_mutation_map_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.table_mutation_map_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"table_mutation_map_query_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.table_mutation_map_query_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..2dbbee8abce8daff6a98bf8f202bdc10bf324006 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_table_mutation_map.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -2010,6 +2011,19 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->table_mutation_map_enabled &&
+ (IsA(node, InsertStmt) || IsA(node, UpdateStmt) || IsA(node, DeleteStmt)))
+ {
+ int *oids;
+ pool_extract_table_oids(node, &oids);
+ pool_table_mutation_map_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2153,107 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check table mutation map for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->table_mutation_map_enabled)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_table_mutation_map_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of table mutation map cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table oids and check if any are stale */
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid = pool_table_mutation_map_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because database oid was unavailable"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_table_mutation_map_table_is_stale(ctx.table_oids[i], dboid))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/include/pool.h b/src/include/pool.h
index ea6f87e120af866b8ed3a15790d9d8a8e009fe91..c0d8170a8d6c23afdaa1e30dcf7c4bd4d88e7edd 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TABLE_MUTATION_MAP_TABLE_SEM 8
+#define TABLE_MUTATION_MAP_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..e6c727823dedeedd0225420b66be8382f5bb83fe 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -365,6 +365,16 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Table mutation map configuration for tracking recently written tables */
+ bool table_mutation_map_enabled; /* Enable in-memory table tracking */
+ double table_mutation_map_ttl_factor; /* TTL multiplier for replication delay */
+ int table_mutation_map_cold_start_duration; /* Cold start duration in ms */
+ int table_mutation_map_table_buckets; /* Number of hash buckets for table map */
+ int table_mutation_map_table_size; /* Max entries in table map */
+ int table_mutation_map_query_buckets; /* Number of hash buckets for query cache */
+ int table_mutation_map_query_cache_size; /* Max entries in query cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_table_mutation_map.h b/src/include/utils/pool_table_mutation_map.h
new file mode 100644
index 0000000000000000000000000000000000000000..4c96cb9085107a2682be2948b83f83835fe555c8
--- /dev/null
+++ b/src/include/utils/pool_table_mutation_map.h
@@ -0,0 +1,237 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_table_mutation_map.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_TABLE_MUTATION_MAP_H
+#define POOL_TABLE_MUTATION_MAP_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TABLE_MUTATION_MAP_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TABLE_MUTATION_MAP_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TABLE_MUTATION_MAP_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TABLE_MUTATION_MAP_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TableMutationEntry entries[max_entries];
+ */
+} TableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[TABLE_MUTATION_MAP_MAX_TABLES_PER_QUERY][TABLE_MUTATION_MAP_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for table mutation map feature
+ */
+typedef struct TableMutationMapState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} TableMutationMapState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TableMutationMapShmem
+{
+ TableMutationMapState state;
+ TableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TableMutationMapShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for table mutation map.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_table_mutation_map_init(void);
+
+/*
+ * Initialize per-child process state for table mutation map.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_table_mutation_map_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_table_mutation_map_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_table_mutation_map_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_table_mutation_map_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_table_mutation_map_table_is_stale(int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_table_mutation_map_mark_tables_written(const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_table_mutation_map_mark_table_written(int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_table_mutation_map_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_table_mutation_map_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TABLE_MUTATION_MAP_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_table_mutation_map_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TABLE_MUTATION_MAP_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_table_mutation_map_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for table mutation map.
+ */
+extern Size pool_table_mutation_map_shmem_size(void);
+
+#endif /* POOL_TABLE_MUTATION_MAP_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..87dc2c4f09a62e1cd680b8020975e3ecf0813ec0 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_table_mutation_map.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1485,11 +1486,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1503,6 +1507,10 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR && pool_config->table_mutation_map_enabled)
+ {
+ pool_table_mutation_map_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3068,6 +3076,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->table_mutation_map_enabled)
+ {
+ size += MAXALIGN(pool_table_mutation_map_shmem_size());
+ elog(DEBUG1, "table_mutation_map: %zu bytes requested for shared memory", MAXALIGN(pool_table_mutation_map_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3198,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize table mutation map for tracking recently written tables */
+ if (pool_config->table_mutation_map_enabled)
+ {
+ pool_table_mutation_map_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..2de467496194dd219437eb3721ba9d8c8f999bb6 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_table_mutation_map.h"
+#include "utils/pool_select_walker.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,45 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for table mutation map feature.
+ * Mark tables as written when INSERT/UPDATE/DELETE completes.
+ */
+ if (pool_config->table_mutation_map_enabled)
+ {
+ char *table_name = NULL;
+ int table_oid = 0;
+ int dboid = 0;
+
+ if (IsA(node, InsertStmt))
+ {
+ InsertStmt *stmt = (InsertStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+ else if (IsA(node, UpdateStmt))
+ {
+ UpdateStmt *stmt = (UpdateStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+ else if (IsA(node, DeleteStmt))
+ {
+ DeleteStmt *stmt = (DeleteStmt *) node;
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+
+ if (table_name != NULL)
+ {
+ table_oid = pool_table_name_to_oid(table_name);
+ dboid = pool_table_mutation_map_get_database_oid();
+ if (table_oid > 0 && dboid > 0)
+ {
+ pool_table_mutation_map_mark_table_written(table_oid, dboid);
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: marked table \"%s\" as written", table_name)));
+ }
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..26d7cf1d1a6768c109850a43b57373141f9f7eaf 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_table_mutation_map.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize table mutation map child state for cold start tracking */
+ if (pool_config->table_mutation_map_enabled)
+ {
+ pool_table_mutation_map_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..46052bad37bbd1f4affec8e08e5cecd3d4903976 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -499,6 +499,51 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Table Mutation Map (Lagless Read Replica) -
+ # WARNING: Enabling this feature increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#table_mutation_map_enabled = off
+ # Enable in-memory tracking of recently written tables
+ # to prevent stale reads from replicas during replication lag
+ # (change requires reload)
+
+#table_mutation_map_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#table_mutation_map_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#table_mutation_map_table_buckets = 1024
+ # Number of hash buckets for table mutation tracking
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#table_mutation_map_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#table_mutation_map_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#table_mutation_map_query_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..38bd217be1972af57f80c26c8d726aad704d56bd 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_table_mutation_map.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for table mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation map TTL based on maximum observed delay */
+ if (pool_config->table_mutation_map_enabled && max_delay_us > 0)
+ pool_table_mutation_map_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/tests/045.table_mutation_map/test.sh b/src/test/regression/tests/045.table_mutation_map/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..e0f229ee88a70e4643df1745a6e6992b867354ae
--- /dev/null
+++ b/src/test/regression/tests/045.table_mutation_map/test.sh
@@ -0,0 +1,228 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for table mutation map feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure table mutation map feature
+ echo "table_mutation_map_enabled = on" >> etc/pgpool.conf
+ echo "table_mutation_map_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "table_mutation_map_cold_start_duration = 2000" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message
+ if grep -q "could not load balance because of table mutation map cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (2 seconds)
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to table mutation map
+ if grep -q "could not load balance because of table mutation map cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1
+ $PSQL test -c "INSERT INTO t1 VALUES (1);" > /dev/null 2>&1
+
+ # Immediately read from t1 - should go to primary due to recent write
+ $PSQL test -c "SELECT 'write_read_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for table staleness message
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -i "table_mutation_map" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see table mutation map blocking message for t2
+ if grep -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2
+ $PSQL test -c "UPDATE t2 SET i = 999 WHERE i = 0;" > /dev/null 2>&1
+
+ # Immediately read from t2 - should go to primary
+ $PSQL test -c "SELECT 'update_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3
+ $PSQL test -c "DELETE FROM t3 WHERE i = 0;" > /dev/null 2>&1
+
+ # Immediately read from t3 - should go to primary
+ $PSQL test -c "SELECT 'delete_test' as marker, * FROM t3;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: Multi-Table Query with One Stale Table ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a new clean table
+ $PSQL test -c "CREATE TABLE t4(i INTEGER);" > /dev/null 2>&1
+
+ # Wait a bit for TTL to expire on other tables if factor is low
+ sleep 1
+
+ # Write to t1 only
+ $PSQL test -c "INSERT INTO t1 VALUES (100);" > /dev/null 2>&1
+
+ # Query joining t1 and t4 - should go to primary because t1 is stale
+ $PSQL test -c "SELECT 'multi_table_test' as marker FROM t1, t4;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*t1.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: Multi-table query routes to primary when one table is stale"
+ else
+ echo "Test 7 FAILED: Multi-table staleness not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: Different Databases with Same Table Name ==="
+ # Create another database and a table with the same name
+ $PSQL test -c "CREATE DATABASE test2;" > /dev/null 2>&1
+ $PSQL test2 -c "CREATE TABLE t1(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for TTL to expire
+ sleep 1
+
+ # Write to t1 in 'test' database
+ $PSQL test -c "INSERT INTO t1 VALUES (500);" > /dev/null 2>&1
+
+ # Read from t1 in 'test2' database - should load balance (Node 1)
+ # because it's a different database, even if table name is same
+ > log/pgpool.log
+ $PSQL test2 -c "SELECT 'diff_db_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ if grep -q "could not load balance because table.*t1.*was recently written" log/pgpool.log; then
+ echo "Test 8 FAILED: Table marked as stale in wrong database"
+ ./shutdownall
+ exit 1
+ fi
+
+ # Read from t1 in 'test' database - should go to primary
+ $PSQL test -c "SELECT 'same_db_test' as marker, * FROM t1;" > /dev/null 2>&1
+ if grep -q "could not load balance because table.*t1.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: Correctly distinguishes between databases"
+ else
+ echo "Test 8 FAILED: Table staleness not detected in correct database"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Table Mutation Map Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/utils/pool_table_mutation_map.c b/src/utils/pool_table_mutation_map.c
new file mode 100644
index 0000000000000000000000000000000000000000..300c230ad18aa2204a09974d13ecf8e8958ff36f
--- /dev/null
+++ b/src/utils/pool_table_mutation_map.c
@@ -0,0 +1,1166 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_table_mutation_map.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_table_mutation_map.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY "SELECT oid FROM pg_catalog.pg_database WHERE datname = '%s'"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TableMutationMapShmem *table_mutation_map_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TableMutationEntry *)((char *)(map) + sizeof(TableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TABLE_MUTATION_MAP_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TABLE_MUTATION_MAP_TABLE_SEM);
+}
+
+static inline void
+query_cache_lock(void)
+{
+ pool_semaphore_lock(TABLE_MUTATION_MAP_QUERY_SEM);
+}
+
+static inline void
+query_cache_unlock(void)
+{
+ pool_semaphore_unlock(TABLE_MUTATION_MAP_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+table_mutation_map_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+
+ backend = pool_get_session_context(false)->backend;
+ if (backend == NULL || MAIN_CONNECTION(backend) == NULL || MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("table_mutation_map: error creating relcache while getting database OID")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_table_mutation_map_get_database_oid(void)
+{
+ return table_mutation_map_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TABLE_MUTATION_MAP_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TABLE_MUTATION_MAP_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TableMutationHashTable *map)
+{
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == TABLE_MUTATION_MAP_INVALID_INDEX)
+ return TABLE_MUTATION_MAP_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TABLE_MUTATION_MAP_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TableMutationHashTable *map, int idx)
+{
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or TABLE_MUTATION_MAP_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TableMutationHashTable *map, int table_oid, int dboid, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return TABLE_MUTATION_MAP_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TableMutationHashTable *map, int table_oid, int dboid,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict an entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("table_mutation_map: failed to allocate entry for table oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("table_mutation_map: marked table oid %d (dboid %d) as written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = TABLE_MUTATION_MAP_INVALID_INDEX;
+ cache->lru_tail = TABLE_MUTATION_MAP_INVALID_INDEX;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TABLE_MUTATION_MAP_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TABLE_MUTATION_MAP_INVALID_INDEX;
+ entries[i].lru_prev = TABLE_MUTATION_MAP_INVALID_INDEX;
+ entries[i].lru_next = TABLE_MUTATION_MAP_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != TABLE_MUTATION_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != TABLE_MUTATION_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = TABLE_MUTATION_MAP_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != TABLE_MUTATION_MAP_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == TABLE_MUTATION_MAP_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = TABLE_MUTATION_MAP_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != TABLE_MUTATION_MAP_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == TABLE_MUTATION_MAP_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != TABLE_MUTATION_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != TABLE_MUTATION_MAP_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = TABLE_MUTATION_MAP_INVALID_INDEX;
+ entries[idx].lru_next = TABLE_MUTATION_MAP_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TABLE_MUTATION_MAP_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == TABLE_MUTATION_MAP_INVALID_INDEX)
+ return TABLE_MUTATION_MAP_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = TABLE_MUTATION_MAP_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return TABLE_MUTATION_MAP_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the table mutation map feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_table_mutation_map_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->table_mutation_map_table_buckets;
+ int table_size = pool_config->table_mutation_map_table_size;
+ int query_buckets = pool_config->table_mutation_map_query_buckets;
+ int query_cache_size = pool_config->table_mutation_map_query_cache_size;
+
+ /* Main structure */
+ size += sizeof(TableMutationMapShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_table_mutation_map_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (!pool_config->table_mutation_map_enabled)
+ {
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_table_mutation_map_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("table_mutation_map: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ table_mutation_map_shmem = (TableMutationMapShmem *)shmem_ptr;
+ shmem_ptr += sizeof(TableMutationMapShmem);
+
+ table_mutation_map_shmem->table_map = (TableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TableMutationHashTable);
+ shmem_ptr += pool_config->table_mutation_map_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->table_mutation_map_table_size * sizeof(TableMutationEntry);
+
+ table_mutation_map_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(table_mutation_map_shmem->table_map,
+ pool_config->table_mutation_map_table_buckets,
+ pool_config->table_mutation_map_table_size);
+
+ query_cache_init(table_mutation_map_shmem->query_cache,
+ pool_config->table_mutation_map_query_buckets,
+ pool_config->table_mutation_map_query_cache_size);
+
+ /* Initialize global state */
+ table_mutation_map_shmem->state.initialized = true;
+ table_mutation_map_shmem->state.current_ttl_us = TABLE_MUTATION_MAP_DEFAULT_TTL_US;
+ get_current_time(&table_mutation_map_shmem->state.ttl_last_updated);
+ get_current_time(&table_mutation_map_shmem->state.last_cleanup_time);
+ table_mutation_map_shmem->state.global_cold_start_until.tv_sec = 0;
+ table_mutation_map_shmem->state.global_cold_start_until.tv_usec = 0;
+ table_mutation_map_shmem->state.stats_queries_checked = 0;
+ table_mutation_map_shmem->state.stats_forced_primary = 0;
+ table_mutation_map_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("table_mutation_map: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_table_mutation_map_child_init(void)
+{
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: child initialized, cold start period %d ms",
+ pool_config->table_mutation_map_cold_start_duration)));
+}
+
+bool
+pool_table_mutation_map_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return false;
+
+ if (pool_config->table_mutation_map_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+
+ if (table_mutation_map_shmem->state.global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now, &table_mutation_map_shmem->state.global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->table_mutation_map_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("table_mutation_map: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->table_mutation_map_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+void
+pool_table_mutation_map_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ int duration_ms;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return;
+
+ duration_ms = pool_config->table_mutation_map_cold_start_duration;
+ if (duration_ms <= 0)
+ return;
+
+ get_current_time(&now);
+ table_mutation_map_shmem->state.global_cold_start_until = now;
+ table_mutation_map_shmem->state.global_cold_start_until.tv_sec += duration_ms / 1000;
+ table_mutation_map_shmem->state.global_cold_start_until.tv_usec += (duration_ms % 1000) * 1000;
+ if (table_mutation_map_shmem->state.global_cold_start_until.tv_usec >= 1000000)
+ {
+ table_mutation_map_shmem->state.global_cold_start_until.tv_sec +=
+ table_mutation_map_shmem->state.global_cold_start_until.tv_usec / 1000000;
+ table_mutation_map_shmem->state.global_cold_start_until.tv_usec %=
+ 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("table_mutation_map: entering global cold start for %d ms",
+ duration_ms)));
+}
+
+bool
+pool_table_mutation_map_table_is_stale(int table_oid, int dboid)
+{
+ TableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = table_mutation_map_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ TableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = table_mutation_map_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("table_mutation_map: table oid %d (dboid %d) elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_oid, dboid, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics */
+ __sync_fetch_and_add(&table_mutation_map_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&table_mutation_map_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&table_mutation_map_shmem->state.stats_allowed_replica, 1);
+
+ return is_stale;
+}
+
+void
+pool_table_mutation_map_mark_tables_written(const int *table_oids, int num_tables, int dboid)
+{
+ TableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL || dboid <= 0)
+ return;
+
+ map = table_mutation_map_shmem->table_map;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ /* Limit cleanup frequency to avoid O(N) scan on every write */
+ /* 100ms interval */
+ if (elapsed_us(&table_mutation_map_shmem->state.last_cleanup_time, &now) > 100000)
+ {
+ table_map_cleanup_expired(map, table_mutation_map_shmem->state.current_ttl_us);
+ table_mutation_map_shmem->state.last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+ table_map_insert(map, table_oid, dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_table_mutation_map_mark_table_written(int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+ pool_table_mutation_map_mark_tables_written(tables, 1, dboid);
+ }
+}
+
+void
+pool_table_mutation_map_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->table_mutation_map_ttl_factor);
+ if (new_ttl < TABLE_MUTATION_MAP_DEFAULT_TTL_US)
+ new_ttl = TABLE_MUTATION_MAP_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ table_mutation_map_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&table_mutation_map_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("table_mutation_map: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->table_mutation_map_ttl_factor)));
+}
+
+bool
+pool_table_mutation_map_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TABLE_MUTATION_MAP_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return false;
+
+ cache = table_mutation_map_shmem->query_cache;
+
+ query_cache_lock();
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < TABLE_MUTATION_MAP_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], TABLE_MUTATION_MAP_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ query_cache_unlock();
+
+ return found;
+}
+
+void
+pool_table_mutation_map_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TABLE_MUTATION_MAP_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (!pool_config->table_mutation_map_enabled || table_mutation_map_shmem == NULL)
+ return;
+
+ cache = table_mutation_map_shmem->query_cache;
+
+ query_cache_lock();
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == TABLE_MUTATION_MAP_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ ereport(WARNING,
+ (errmsg("table_mutation_map: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > TABLE_MUTATION_MAP_MAX_TABLES_PER_QUERY) ?
+ TABLE_MUTATION_MAP_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], TABLE_MUTATION_MAP_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ query_cache_unlock();
+}
+
+uint64
+pool_table_mutation_map_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.52.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-03 07:43 ` Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-03 07:43 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
Thank you for updating the patch!
> Thank you for the comments!
>
> I agree with all of them. Let me know what you think of the changes and new
> naming.
I still think "memory_map" is too generic. Anything put on memory for
data mapping could be called "memory map". I recommend to change the
name to more feature specific one: What about replacing "memory_map"
with "track_table_mutation"? It's a little bit longer name but it
clearly represents the feature. Any better ideas are welcome.
- memory_map_enabled: Enable/disable the feature (default: off)
- memory_map_ttl_factor: TTL multiplier for replication delay (default: 5.0)
- memory_map_cold_start_duration: Cold start period in ms (default: 2000)
- memory_map_table_buckets: Hash buckets for table map (default: 1024)
- memory_map_table_size: Max tracked tables (default: 2048)
- memory_map_query_buckets: Hash buckets for query cache (default: 2048)
- memory_map_query_cache_size: Max cached queries (default: 10000)
Also I feel memory_map_query_cache_size is confusing because there's
already "query cache" feature in pgpool. Can we change it something
like "query_parse_cache_size"?
Review comments:
(1) Why the regression test is 45? Shouldn't it be 42? (the last
feature test is 041.external_replication_delay).
(2) You enhance the patch to deal with leader watch changing. That's
good. However, I don't see a test case for it in test.sh.
(3) It seems the patch does not support TRUNCATE, MERGE, PREPARE and
WITH + updating. If so, it should be noted in the docs as a limitation
of the feature.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-02-03 23:23 ` Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-03 23:23 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
From: Tatsuo Ishii <[email protected]>
Subject: Re: Proposal: Recent mutated table tracking in memory
Date: Tue, 03 Feb 2026 16:43:53 +0900 (JST)
Message-ID: <[email protected]>
> Hi Nadav,
>
> Thank you for updating the patch!
>
>> Thank you for the comments!
>>
>> I agree with all of them. Let me know what you think of the changes and new
>> naming.
>
> I still think "memory_map" is too generic. Anything put on memory for
> data mapping could be called "memory map". I recommend to change the
> name to more feature specific one: What about replacing "memory_map"
> with "track_table_mutation"? It's a little bit longer name but it
> clearly represents the feature. Any better ideas are welcome.
>
> - memory_map_enabled: Enable/disable the feature (default: off)
> - memory_map_ttl_factor: TTL multiplier for replication delay (default: 5.0)
> - memory_map_cold_start_duration: Cold start period in ms (default: 2000)
> - memory_map_table_buckets: Hash buckets for table map (default: 1024)
> - memory_map_table_size: Max tracked tables (default: 2048)
> - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
> - memory_map_query_cache_size: Max cached queries (default: 10000)
>
> Also I feel memory_map_query_cache_size is confusing because there's
> already "query cache" feature in pgpool. Can we change it something
> like "query_parse_cache_size"?
>
> Review comments:
>
> (1) Why the regression test is 45? Shouldn't it be 42? (the last
> feature test is 041.external_replication_delay).
>
> (2) You enhance the patch to deal with leader watch changing. That's
> good. However, I don't see a test case for it in test.sh.
>
> (3) It seems the patch does not support TRUNCATE, MERGE, PREPARE and
> WITH + updating. If so, it should be noted in the docs as a limitation
> of the feature.
(4) It seems the patch does not consider transactions. If an UPDATE is
performed in a transaction and the transaction gets rollbacked, load
balance is disabled despite that fact that the table modification did
not happen.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-02-06 11:29 ` Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-02-06 11:29 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Thank you for all the great comments and questions! I took under
consideration all of them either adding support/tests or detailing the
limitations in the docs.
Let me know what you think of the latest patch attached here
On Wed, Feb 4, 2026 at 1:23 AM Tatsuo Ishii <[email protected]> wrote:
> From: Tatsuo Ishii <[email protected]>
> Subject: Re: Proposal: Recent mutated table tracking in memory
> Date: Tue, 03 Feb 2026 16:43:53 +0900 (JST)
> Message-ID: <[email protected]>
>
> > Hi Nadav,
> >
> > Thank you for updating the patch!
> >
> >> Thank you for the comments!
> >>
> >> I agree with all of them. Let me know what you think of the changes and
> new
> >> naming.
> >
> > I still think "memory_map" is too generic. Anything put on memory for
> > data mapping could be called "memory map". I recommend to change the
> > name to more feature specific one: What about replacing "memory_map"
> > with "track_table_mutation"? It's a little bit longer name but it
> > clearly represents the feature. Any better ideas are welcome.
> >
> > - memory_map_enabled: Enable/disable the feature (default: off)
> > - memory_map_ttl_factor: TTL multiplier for replication delay (default:
> 5.0)
> > - memory_map_cold_start_duration: Cold start period in ms (default: 2000)
> > - memory_map_table_buckets: Hash buckets for table map (default: 1024)
> > - memory_map_table_size: Max tracked tables (default: 2048)
> > - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
> > - memory_map_query_cache_size: Max cached queries (default: 10000)
> >
> > Also I feel memory_map_query_cache_size is confusing because there's
> > already "query cache" feature in pgpool. Can we change it something
> > like "query_parse_cache_size"?
> >
> > Review comments:
> >
> > (1) Why the regression test is 45? Shouldn't it be 42? (the last
> > feature test is 041.external_replication_delay).
> >
> > (2) You enhance the patch to deal with leader watch changing. That's
> > good. However, I don't see a test case for it in test.sh.
> >
> > (3) It seems the patch does not support TRUNCATE, MERGE, PREPARE and
> > WITH + updating. If so, it should be noted in the docs as a limitation
> > of the feature.
>
> (4) It seems the patch does not consider transactions. If an UPDATE is
> performed in a transaction and the transaction gets rollbacked, load
> balance is disabled despite that fact that the table modification did
> not happen.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] table_track.patch (95.8K, 3-table_track.patch)
download | inline diff:
From 403c0b03e050dd3a98280d4908f0c82735bb1945 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Implement "track table mutation" feature that tracks recently-written database
tables in shared memory to prevent stale reads during replication lag.
When a write (INSERT/UPDATE/DELETE/TRUNCATE/MERGE) occurs on a table, that table
is marked as "stale" for a configurable TTL period. Any SELECT on a stale
table within the TTL window is routed to primary instead of replica.
Key features:
- Shared memory hash table for tracking table mutations with TTL
- Query parse cache with LRU eviction for performance
- Cold start protection (routes all queries to primary initially)
- Automatic TTL calculation: replication_delay × configurable factor
- Per-table staleness tracking with microsecond precision
- WITH clause (CTE) support for data-modifying CTEs
- Watchdog integration: global cold start on leader change
Tracked statement types:
- INSERT, UPDATE, DELETE (including RETURNING)
- TRUNCATE (including multiple tables)
- MERGE (PostgreSQL 15+)
- WITH clauses containing INSERT/UPDATE/DELETE
New configuration parameters:
- track_table_mutation_enabled: Enable/disable the feature (default: off)
- track_table_mutation_ttl_factor: TTL multiplier for replication delay (default: 5.0)
- track_table_mutation_cold_start_duration: Cold start period in ms (default: 2000)
- track_table_mutation_table_buckets: Hash buckets for table map (default: 1024)
- track_table_mutation_table_size: Max tracked tables (default: 2048)
- track_table_mutation_query_buckets: Hash buckets for query cache (default: 2048)
- track_table_mutation_query_parse_cache_size: Max cached queries (default: 10000)
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..f2341340305b89d90d8f83f748e995f5cf8df123 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1193,4 +1193,273 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which tracks recently written tables
+ to prevent stale reads from replica nodes during replication lag. This implements the
+ "lagless" architecture pattern for distributed systems with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling the track table mutation feature increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when enabling this feature.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-enabled" xreflabel="track_table_mutation_enabled">
+ <term><varname>track_table_mutation_enabled</varname> (<type>boolean</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_enabled</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Enables in-memory tracking of recently written tables. When enabled, tables are marked
+ as stale after write operations, and reads are routed to primary until the TTL expires.
+ </para>
+ <para>
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ Default is <literal>off</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable track table mutation feature
+track_table_mutation_enabled = on
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096 # Track up to 4096 tables (~160 KB)
+track_table_mutation_query_parse_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Tables are marked as stale when the
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal> command
+ completes, regardless of whether the enclosing transaction is committed or rolled back.
+ This means that if a transaction is rolled back, the table remains marked as stale until
+ the TTL expires, even though no actual data modification occurred. This is by design:
+ the feature errs on the side of caution by routing more queries to the primary rather
+ than risking stale reads. The performance impact of this conservative approach is minimal
+ compared to the safety benefit of avoiding stale reads.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..39588af58deba045dffc01ae932115b8a9dbfcf2 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..7b488269aed27cbd629bed42c5db89f27c173f9f 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -783,6 +783,16 @@ static struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_enabled", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Enable in-memory tracking of recently written tables to avoid stale reads from replicas",
+ CONFIG_VAR_TYPE_BOOL, false, 0
+ },
+ &g_pool_config.track_table_mutation_enabled,
+ false,
+ NULL, NULL, NULL
+ },
+
{
{"auto_failback", CFGCXT_RELOAD, FAILOVER_CONFIG,
"Enables nodes automatically reattach, when detached node continue streaming replication.",
@@ -1757,6 +1767,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2376,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..5c1a2d36d810d5b9959a12605138e1149231c8ab 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -2010,6 +2011,19 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->track_table_mutation_enabled &&
+ (IsA(node, InsertStmt) || IsA(node, UpdateStmt) || IsA(node, DeleteStmt)))
+ {
+ int *oids;
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2153,107 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check track table mutation for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->track_table_mutation_enabled)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of track table mutation cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table oids and check if any are stale */
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because database oid was unavailable"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_track_table_mutation_table_is_stale(ctx.table_oids[i], dboid))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/include/pool.h b/src/include/pool.h
index ea6f87e120af866b8ed3a15790d9d8a8e009fe91..7168c1aea877856b5978de332ad636325eb9c30c 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..d310922f1a932cd342f71f574b8b2db08179957c 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -365,6 +365,16 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Track table mutation configuration for tracking recently written tables */
+ bool track_table_mutation_enabled; /* Enable in-memory table tracking */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for replication delay */
+ int track_table_mutation_cold_start_duration; /* Cold start duration in ms */
+ int track_table_mutation_table_buckets; /* Number of hash buckets for table map */
+ int track_table_mutation_table_size; /* Max entries in table map */
+ int track_table_mutation_query_buckets; /* Number of hash buckets for query cache */
+ int track_table_mutation_query_parse_cache_size; /* Max entries in query parse cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 0000000000000000000000000000000000000000..5cd5d4ef409645fe77e3bb02239e140456de0554
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,237 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY][TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..302af64d0512a2f5dae95c7c361c3f18ba70b036 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1485,11 +1486,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1503,6 +1507,10 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR && pool_config->track_table_mutation_enabled)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3068,6 +3076,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->track_table_mutation_enabled)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1, "track_table_mutation: %zu bytes requested for shared memory", MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3198,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for tracking recently written tables */
+ if (pool_config->track_table_mutation_enabled)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..8fee9381ff456272c3920aa2fcd18789bc766d70 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/pool_select_walker.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -46,6 +48,7 @@ static int forward_empty_query(POOL_CONNECTION *frontend, char *packet, int pack
static int forward_packet_to_frontend(POOL_CONNECTION *frontend, char kind, char *packet, int packetlen);
static void process_clear_cache(POOL_CONNECTION_POOL *backend);
static bool check_alter_role_statement(AlterRoleStmt *stmt);
+static void track_cte_mutations(WithClause *withClause, int dboid);
POOL_STATUS
CommandComplete(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, bool command_complete)
@@ -304,6 +307,113 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for track table mutation feature.
+ * Mark tables as written when INSERT/UPDATE/DELETE completes.
+ */
+ if (pool_config->track_table_mutation_enabled && node != NULL)
+ {
+ char *table_name = NULL;
+ int table_oid = 0;
+ int dboid = 0;
+
+ if (IsA(node, InsertStmt))
+ {
+ InsertStmt *stmt = (InsertStmt *) node;
+ if (stmt->relation != NULL)
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ /* Track CTEs with data modifications in WITH clause */
+ if (stmt->withClause != NULL)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+ track_cte_mutations(stmt->withClause, dboid);
+ }
+ }
+ else if (IsA(node, UpdateStmt))
+ {
+ UpdateStmt *stmt = (UpdateStmt *) node;
+ if (stmt->relation != NULL)
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ /* Track CTEs with data modifications in WITH clause */
+ if (stmt->withClause != NULL)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+ track_cte_mutations(stmt->withClause, dboid);
+ }
+ }
+ else if (IsA(node, DeleteStmt))
+ {
+ DeleteStmt *stmt = (DeleteStmt *) node;
+ if (stmt->relation != NULL)
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ /* Track CTEs with data modifications in WITH clause */
+ if (stmt->withClause != NULL)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+ track_cte_mutations(stmt->withClause, dboid);
+ }
+ }
+ else if (IsA(node, SelectStmt))
+ {
+ /* SELECT itself doesn't modify tables, but WITH clause might */
+ SelectStmt *stmt = (SelectStmt *) node;
+ if (stmt->withClause != NULL)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+ track_cte_mutations(stmt->withClause, dboid);
+ }
+ }
+ else if (IsA(node, TruncateStmt))
+ {
+ /* TRUNCATE can affect multiple tables */
+ TruncateStmt *stmt = (TruncateStmt *) node;
+ ListCell *cell;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, stmt->relations)
+ {
+ RangeVar *rv = (RangeVar *) lfirst(cell);
+ if (rv != NULL)
+ {
+ table_name = make_table_name_from_rangevar(rv);
+ if (table_name != NULL)
+ {
+ table_oid = pool_table_name_to_oid(table_name);
+ if (table_oid > 0)
+ {
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: marked table \"%s\" as written (TRUNCATE)", table_name)));
+ }
+ }
+ }
+ }
+ }
+ /* Already handled all tables, skip the common path */
+ table_name = NULL;
+ }
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+ if (stmt->relation != NULL)
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ }
+
+ if (table_name != NULL)
+ {
+ table_oid = pool_table_name_to_oid(table_name);
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (table_oid > 0 && dboid > 0)
+ {
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: marked table \"%s\" as written", table_name)));
+ }
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
@@ -531,6 +641,85 @@ check_alter_role_statement(AlterRoleStmt *stmt)
return false;
}
+/*
+ * Track table mutations in WITH clause CTEs.
+ * Iterates through CTEs and marks tables as written for any DML operations.
+ */
+static void
+track_cte_mutations(WithClause *withClause, int dboid)
+{
+ ListCell *cell;
+ char *table_name;
+ int table_oid;
+
+ if (withClause == NULL || dboid <= 0)
+ return;
+
+ foreach(cell, withClause->ctes)
+ {
+ CommonTableExpr *cte = (CommonTableExpr *) lfirst(cell);
+
+ if (cte == NULL || cte->ctequery == NULL)
+ continue;
+
+ /* Check what type of statement the CTE contains */
+ if (IsA(cte->ctequery, InsertStmt))
+ {
+ InsertStmt *stmt = (InsertStmt *) cte->ctequery;
+ if (stmt->relation != NULL)
+ {
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ if (table_name != NULL)
+ {
+ table_oid = pool_table_name_to_oid(table_name);
+ if (table_oid > 0)
+ {
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: marked table \"%s\" as written (CTE INSERT)", table_name)));
+ }
+ }
+ }
+ }
+ else if (IsA(cte->ctequery, UpdateStmt))
+ {
+ UpdateStmt *stmt = (UpdateStmt *) cte->ctequery;
+ if (stmt->relation != NULL)
+ {
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ if (table_name != NULL)
+ {
+ table_oid = pool_table_name_to_oid(table_name);
+ if (table_oid > 0)
+ {
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: marked table \"%s\" as written (CTE UPDATE)", table_name)));
+ }
+ }
+ }
+ }
+ else if (IsA(cte->ctequery, DeleteStmt))
+ {
+ DeleteStmt *stmt = (DeleteStmt *) cte->ctequery;
+ if (stmt->relation != NULL)
+ {
+ table_name = make_table_name_from_rangevar(stmt->relation);
+ if (table_name != NULL)
+ {
+ table_oid = pool_table_name_to_oid(table_name);
+ if (table_oid > 0)
+ {
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: marked table \"%s\" as written (CTE DELETE)", table_name)));
+ }
+ }
+ }
+ }
+ }
+}
+
/*
* Extract the number of tuples from CommandComplete message
*/
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..c83fb11cc26ed05c8ab2057f5e7f167fc038a1fb 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->track_table_mutation_enabled)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..0bd59a03d7c69c0fa27a7d9040f44151ade344d0 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -499,6 +499,51 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (Lagless Read Replica) -
+ # WARNING: Enabling this feature increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_enabled = off
+ # Enable in-memory tracking of recently written tables
+ # to prevent stale reads from replicas during replication lag
+ # (change requires reload)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..2fc4afffdad148fbd026d979841027e455ba808b 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for table mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update track table mutation TTL based on maximum observed delay */
+ if (pool_config->track_table_mutation_enabled && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..ce0f7e46409fa4ba13e714c6e97abb59f5aef836
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,292 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature
+ echo "track_table_mutation_enabled = on" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 2000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (2 seconds)
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..fcb93d27a7e7e8a5efe6eacfb0f88f6f3c8bc765
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
@@ -0,0 +1,3 @@
+leader
+standby
+*.pid
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
new file mode 100644
index 0000000000000000000000000000000000000000..e764600eebe47c750b9fd3d3d33505480d81b8f5
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
@@ -0,0 +1,25 @@
+# leader watchdog config for track_table_mutation watchdog test
+use_watchdog = on
+wd_interval = 1
+wd_priority = 2
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature
+track_table_mutation_enabled = on
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
new file mode 100644
index 0000000000000000000000000000000000000000..32a5784cd9ae3b6a9499974a6c10a813469b781b
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
@@ -0,0 +1,27 @@
+# standby watchdog config for track_table_mutation watchdog test
+port = 11100
+pcp_port = 11105
+use_watchdog = on
+wd_interval = 1
+wd_priority = 1
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature
+track_table_mutation_enabled = on
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..752a6e6aa377fe0c54244975e606648101c98cf8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,179 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation global cold start on watchdog leader change.
+# Tests that when the watchdog leader changes, the new leader triggers
+# a global cold start to force all queries to primary.
+#
+source $TESTLIBS
+LEADER_DIR=leader
+STANDBY_DIR=standby
+PSQL=$PGBIN/psql
+success_count=0
+
+rm -fr $LEADER_DIR
+rm -fr $STANDBY_DIR
+
+mkdir $LEADER_DIR
+mkdir $STANDBY_DIR
+
+# dir in leader directory
+cd $LEADER_DIR
+
+# create leader environment with streaming replication
+echo -n "creating leader pgpool..."
+$PGPOOL_SETUP -m s -n 2 -p 11000 || exit 1
+echo "leader setup done."
+
+# copy the configurations to standby
+cp -r etc ../$STANDBY_DIR/
+
+source ./bashrc.ports
+cat ../leader.conf >> etc/pgpool.conf
+echo 0 > etc/pgpool_node_id
+
+./startall
+wait_for_pgpool_startup
+
+# back to test root dir
+cd ..
+
+# create standby environment
+mkdir $STANDBY_DIR/log
+echo -n "creating standby pgpool..."
+cat standby.conf >> $STANDBY_DIR/etc/pgpool.conf
+# since we are using the same pgpool-II conf as of leader, change the pid file path
+echo "pid_file_name = '$PWD/pgpool2.pid'" >> $STANDBY_DIR/etc/pgpool.conf
+echo 1 > $STANDBY_DIR/etc/pgpool_node_id
+# start the standby pgpool-II by hand
+$PGPOOL_INSTALL_DIR/bin/pgpool -D -n -f $STANDBY_DIR/etc/pgpool.conf -F $STANDBY_DIR/etc/pcp.conf -a $STANDBY_DIR/etc/pool_hba.conf > $STANDBY_DIR/log/pgpool.log 2>&1 &
+
+# Test 1: Check if leader pgpool-II started correctly
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node. Starting escalation process" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up successfully."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 2: Check if standby has successfully joined
+echo "=== Test 2: Waiting for the standby to join cluster... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster as standby node" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby successfully connected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join cluster"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation is enabled and working on leader
+echo "=== Test 3: Verify track_table_mutation is enabled ==="
+if grep -a "track_table_mutation: initialized" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: track_table_mutation initialized on leader"
+else
+ echo "Test 3 FAILED: track_table_mutation not initialized on leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader pgpool and trigger failover
+echo "=== Test 4: Triggering leader failover... ==="
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $LEADER_DIR/etc/pgpool.conf -m f stop
+
+echo "Checking if the Standby pgpool-II detected the leader shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a " is shutting down" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Leader shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Leader shutdown not detected"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby becomes new leader
+echo "=== Test 5: Checking if standby takes over as leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node. Starting escalation process" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became the new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered on new leader
+echo "=== Test 6: Checking if global cold start was triggered... ==="
+# The new leader should trigger global cold start when it becomes coordinator
+# Look for the log message that indicates global cold start was triggered
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: entering global cold start" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered on new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+cd $LEADER_DIR
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Track Table Mutation Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 0000000000000000000000000000000000000000..d9ac5d31669d6d350c7f565e70a1f35bc6917704
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1188 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY "SELECT oid FROM pg_catalog.pg_database WHERE datname = '%s'"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+query_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+query_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ /* Ensure we have a valid query context */
+ if (session_context->query_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL || MAIN_CONNECTION(backend) == NULL || MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: error creating relcache while getting database OID")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map, int idx)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or TRACK_TABLE_MUTATION_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map, int table_oid, int dboid, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map, int table_oid, int dboid,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict an entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate entry for table oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: marked table oid %d (dboid %d) as written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->lru_tail = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the track table mutation feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->track_table_mutation_table_buckets;
+ int table_size = pool_config->track_table_mutation_table_size;
+ int query_buckets = pool_config->track_table_mutation_query_buckets;
+ int query_cache_size = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TrackTableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (!pool_config->track_table_mutation_enabled)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ track_table_mutation_shmem = (TrackTableMutationShmem *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map = (TrackTableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += pool_config->track_table_mutation_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->track_table_mutation_table_size * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(track_table_mutation_shmem->table_map,
+ pool_config->track_table_mutation_table_buckets,
+ pool_config->track_table_mutation_table_size);
+
+ query_cache_init(track_table_mutation_shmem->query_cache,
+ pool_config->track_table_mutation_query_buckets,
+ pool_config->track_table_mutation_query_parse_cache_size);
+
+ /* Initialize global state */
+ track_table_mutation_shmem->state.initialized = true;
+ track_table_mutation_shmem->state.current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+ get_current_time(&track_table_mutation_shmem->state.last_cleanup_time);
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec = 0;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec = 0;
+ track_table_mutation_shmem->state.stats_queries_checked = 0;
+ track_table_mutation_shmem->state.stats_forced_primary = 0;
+ track_table_mutation_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_track_table_mutation_child_init(void)
+{
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: child initialized, cold start period %d ms",
+ pool_config->track_table_mutation_cold_start_duration)));
+}
+
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (pool_config->track_table_mutation_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+
+ /* Check for watchdog-triggered global cold start first */
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now, &track_table_mutation_shmem->state.global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->track_table_mutation_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->track_table_mutation_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ int duration_ms;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return;
+
+ duration_ms = pool_config->track_table_mutation_cold_start_duration;
+ if (duration_ms <= 0)
+ return;
+
+ get_current_time(&now);
+ track_table_mutation_shmem->state.global_cold_start_until = now;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec += duration_ms / 1000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec += (duration_ms % 1000) * 1000;
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_usec >= 1000000)
+ {
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec +=
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec / 1000000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec %=
+ 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: entering global cold start for %d ms",
+ duration_ms)));
+}
+
+bool
+pool_track_table_mutation_table_is_stale(int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: table oid %d (dboid %d) elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_oid, dboid, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics - skip if shmem not available */
+ if (track_table_mutation_shmem != NULL)
+ {
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_allowed_replica, 1);
+ }
+
+ return is_stale;
+}
+
+void
+pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL || dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ /* Limit cleanup frequency to avoid O(N) scan on every write */
+ /* 100ms interval */
+ if (elapsed_us(&track_table_mutation_shmem->state.last_cleanup_time, &now) > 100000)
+ {
+ table_map_cleanup_expired(map, track_table_mutation_shmem->state.current_ttl_us);
+ track_table_mutation_shmem->state.last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+ table_map_insert(map, table_oid, dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_track_table_mutation_mark_table_written(int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+ pool_track_table_mutation_mark_tables_written(tables, 1, dboid);
+ }
+}
+
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->track_table_mutation_ttl_factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ track_table_mutation_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->track_table_mutation_ttl_factor)));
+}
+
+bool
+pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return false;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ query_cache_unlock();
+
+ return found;
+}
+
+void
+pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (!pool_config->track_table_mutation_enabled || track_table_mutation_shmem == NULL)
+ return;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY) ?
+ TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ query_cache_unlock();
+}
+
+uint64
+pool_track_table_mutation_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.52.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-10 15:16 ` Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-02-10 15:16 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
After reading more about disable_load_balance_on_write=dml_adaptive i came
to the thought that this feature is actually an "extension" of that since
it covers "global" and not just per transaction behavior. in any case i
think it makes more sense that it sits under
the disable_load_balance_on_write and not as a standalone for clarity.
I'm attaching below an updated patch with these adjustments.
Please let me know what you think.
On Fri, Feb 6, 2026 at 1:29 PM Nadav Shatz <[email protected]> wrote:
> Hi Tatsuo,
>
> Thank you for all the great comments and questions! I took under
> consideration all of them either adding support/tests or detailing the
> limitations in the docs.
>
> Let me know what you think of the latest patch attached here
>
> On Wed, Feb 4, 2026 at 1:23 AM Tatsuo Ishii <[email protected]> wrote:
>
>> From: Tatsuo Ishii <[email protected]>
>> Subject: Re: Proposal: Recent mutated table tracking in memory
>> Date: Tue, 03 Feb 2026 16:43:53 +0900 (JST)
>> Message-ID: <[email protected]>
>>
>> > Hi Nadav,
>> >
>> > Thank you for updating the patch!
>> >
>> >> Thank you for the comments!
>> >>
>> >> I agree with all of them. Let me know what you think of the changes
>> and new
>> >> naming.
>> >
>> > I still think "memory_map" is too generic. Anything put on memory for
>> > data mapping could be called "memory map". I recommend to change the
>> > name to more feature specific one: What about replacing "memory_map"
>> > with "track_table_mutation"? It's a little bit longer name but it
>> > clearly represents the feature. Any better ideas are welcome.
>> >
>> > - memory_map_enabled: Enable/disable the feature (default: off)
>> > - memory_map_ttl_factor: TTL multiplier for replication delay (default:
>> 5.0)
>> > - memory_map_cold_start_duration: Cold start period in ms (default:
>> 2000)
>> > - memory_map_table_buckets: Hash buckets for table map (default: 1024)
>> > - memory_map_table_size: Max tracked tables (default: 2048)
>> > - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
>> > - memory_map_query_cache_size: Max cached queries (default: 10000)
>> >
>> > Also I feel memory_map_query_cache_size is confusing because there's
>> > already "query cache" feature in pgpool. Can we change it something
>> > like "query_parse_cache_size"?
>> >
>> > Review comments:
>> >
>> > (1) Why the regression test is 45? Shouldn't it be 42? (the last
>> > feature test is 041.external_replication_delay).
>> >
>> > (2) You enhance the patch to deal with leader watch changing. That's
>> > good. However, I don't see a test case for it in test.sh.
>> >
>> > (3) It seems the patch does not support TRUNCATE, MERGE, PREPARE and
>> > WITH + updating. If so, it should be noted in the docs as a limitation
>> > of the feature.
>>
>> (4) It seems the patch does not consider transactions. If an UPDATE is
>> performed in a transaction and the transaction gets rollbacked, load
>> balance is disabled despite that fact that the table modification did
>> not happen.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] table_track.patch (96.3K, 3-table_track.patch)
download | inline diff:
From 469a36663b8d85d844cd94fd92afa95a4dd0160d Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
Introduces 'dml_adaptive_global' as a new value for disable_load_balance_on_write.
This mode is a superset of dml_adaptive: it performs per-transaction local tracking
AND cross-session shared-memory tracking of recently written tables, routing reads
to primary until a TTL (based on measured replication delay) expires.
Sub-parameters (track_table_mutation_*) control TTL factor, cold start duration,
hash table sizing, and query parse cache sizing.
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..c69814dd05cb66fb5f4f47944bd688692354a707 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1108,6 +1108,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-track-table-mutation"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1193,4 +1205,257 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096 # Track up to 4096 tables (~160 KB)
+track_table_mutation_query_parse_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Tables are marked as stale when the
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal> command
+ completes, regardless of whether the enclosing transaction is committed or rolled back.
+ This means that if a transaction is rolled back, the table remains marked as stale until
+ the TTL expires, even though no actual data modification occurred. This is by design:
+ the feature errs on the side of caution by routing more queries to the primary rather
+ than risking stale reads. The performance impact of this conservative approach is minimal
+ compared to the safety benefit of avoiding stale reads.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..39588af58deba045dffc01ae932115b8a9dbfcf2 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..623d8751677fd6f39d0e12f0e3e899171890f6e0 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1757,6 +1758,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2367,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..d7ac89ca9bd5c6a5b58bb364d18dd48e645fd532 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,7 +1829,7 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL || !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
@@ -1836,7 +1837,7 @@ is_select_object_in_temp_write_list(Node *node, void *context)
RangeVar *rgv = (RangeVar *) node;
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
@@ -1880,7 +1881,7 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
@@ -1944,7 +1945,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -2010,6 +2011,18 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2152,107 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check track table mutation for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of track table mutation cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table oids and check if any are stale */
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because database oid was unavailable"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_track_table_mutation_table_is_stale(ctx.table_oids[i], dboid))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc64ceba1d1fafd6f4a9f10a750872374..a9596561a7e0265e928b957a2766f46fb4e9ebaa 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,7 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +738,10 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write != DLBOW_OFF && !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index ea6f87e120af866b8ed3a15790d9d8a8e009fe91..7168c1aea877856b5978de332ad636325eb9c30c 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..8798b86eb3620ab36be733bb60bbb8464b0063c8 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -365,6 +369,15 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Track table mutation configuration for tracking recently written tables */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for replication delay */
+ int track_table_mutation_cold_start_duration; /* Cold start duration in ms */
+ int track_table_mutation_table_buckets; /* Number of hash buckets for table map */
+ int track_table_mutation_table_size; /* Max entries in table map */
+ int track_table_mutation_query_buckets; /* Number of hash buckets for query cache */
+ int track_table_mutation_query_parse_cache_size; /* Max entries in query parse cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 0000000000000000000000000000000000000000..5cd5d4ef409645fe77e3bb02239e140456de0554
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,237 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY][TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..b88b0478cb150f89bd9b6b8ab38db0d6912fddd0 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1485,11 +1486,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1503,6 +1507,10 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR && pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3068,6 +3076,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1, "track_table_mutation: %zu bytes requested for shared memory", MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3198,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for tracking recently written tables */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..aae08a7786d5eaf7271e7b680dc04e1d01f0a629 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,27 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature.
+ * Mark tables as written when INSERT/UPDATE/DELETE/TRUNCATE/MERGE completes.
+ * Reuses pool_extract_table_oids() which handles all statement types
+ * including WITH clause CTEs.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL && node != NULL)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..d53f571421968bd789d0b55f97e0a1eb68a813e5 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index 47b5c8f98a5b4c92d675840eea88f7e03bb18b4c..75fc7508480d79aacc281dd5e624f9e34a998833 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,7 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1804,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index f38f711469576342ce59469b085c97365116004c..dca93334e9e47bb7978064edece5ca0e40021ce3 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
}
return num_oids;
}
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+
+ table = make_table_name_from_rangevar(stmt->relation);
+ }
else if (IsA(node, ExplainStmt))
{
ListCell *cell;
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..de99a7a97ba4a1a03cb3d5589d55ea61cb6e51fa 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,46 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..347f88a88688309b298311a282fe1c1ef2aa0f73 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for table mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update track table mutation TTL based on maximum observed delay */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..35f2c9e0627142d7c8f3ab42f7e08541fb9e9a41
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,292 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 2000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (2 seconds)
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..fcb93d27a7e7e8a5efe6eacfb0f88f6f3c8bc765
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
@@ -0,0 +1,3 @@
+leader
+standby
+*.pid
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
new file mode 100644
index 0000000000000000000000000000000000000000..945cff9860d0357fbb0e3e9a5643124d916bd9c3
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
@@ -0,0 +1,25 @@
+# leader watchdog config for track_table_mutation watchdog test
+use_watchdog = on
+wd_interval = 1
+wd_priority = 2
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature via dml_adaptive_global
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
new file mode 100644
index 0000000000000000000000000000000000000000..a11c3dfca427cf6b246451d067c30b0255b9c4ce
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
@@ -0,0 +1,27 @@
+# standby watchdog config for track_table_mutation watchdog test
+port = 11100
+pcp_port = 11105
+use_watchdog = on
+wd_interval = 1
+wd_priority = 1
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature via dml_adaptive_global
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..752a6e6aa377fe0c54244975e606648101c98cf8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,179 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation global cold start on watchdog leader change.
+# Tests that when the watchdog leader changes, the new leader triggers
+# a global cold start to force all queries to primary.
+#
+source $TESTLIBS
+LEADER_DIR=leader
+STANDBY_DIR=standby
+PSQL=$PGBIN/psql
+success_count=0
+
+rm -fr $LEADER_DIR
+rm -fr $STANDBY_DIR
+
+mkdir $LEADER_DIR
+mkdir $STANDBY_DIR
+
+# dir in leader directory
+cd $LEADER_DIR
+
+# create leader environment with streaming replication
+echo -n "creating leader pgpool..."
+$PGPOOL_SETUP -m s -n 2 -p 11000 || exit 1
+echo "leader setup done."
+
+# copy the configurations to standby
+cp -r etc ../$STANDBY_DIR/
+
+source ./bashrc.ports
+cat ../leader.conf >> etc/pgpool.conf
+echo 0 > etc/pgpool_node_id
+
+./startall
+wait_for_pgpool_startup
+
+# back to test root dir
+cd ..
+
+# create standby environment
+mkdir $STANDBY_DIR/log
+echo -n "creating standby pgpool..."
+cat standby.conf >> $STANDBY_DIR/etc/pgpool.conf
+# since we are using the same pgpool-II conf as of leader, change the pid file path
+echo "pid_file_name = '$PWD/pgpool2.pid'" >> $STANDBY_DIR/etc/pgpool.conf
+echo 1 > $STANDBY_DIR/etc/pgpool_node_id
+# start the standby pgpool-II by hand
+$PGPOOL_INSTALL_DIR/bin/pgpool -D -n -f $STANDBY_DIR/etc/pgpool.conf -F $STANDBY_DIR/etc/pcp.conf -a $STANDBY_DIR/etc/pool_hba.conf > $STANDBY_DIR/log/pgpool.log 2>&1 &
+
+# Test 1: Check if leader pgpool-II started correctly
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node. Starting escalation process" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up successfully."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 2: Check if standby has successfully joined
+echo "=== Test 2: Waiting for the standby to join cluster... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster as standby node" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby successfully connected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join cluster"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation is enabled and working on leader
+echo "=== Test 3: Verify track_table_mutation is enabled ==="
+if grep -a "track_table_mutation: initialized" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: track_table_mutation initialized on leader"
+else
+ echo "Test 3 FAILED: track_table_mutation not initialized on leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader pgpool and trigger failover
+echo "=== Test 4: Triggering leader failover... ==="
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $LEADER_DIR/etc/pgpool.conf -m f stop
+
+echo "Checking if the Standby pgpool-II detected the leader shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a " is shutting down" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Leader shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Leader shutdown not detected"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby becomes new leader
+echo "=== Test 5: Checking if standby takes over as leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node. Starting escalation process" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became the new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered on new leader
+echo "=== Test 6: Checking if global cold start was triggered... ==="
+# The new leader should trigger global cold start when it becomes coordinator
+# Look for the log message that indicates global cold start was triggered
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: entering global cold start" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered on new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+cd $LEADER_DIR
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Track Table Mutation Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 0000000000000000000000000000000000000000..27d4f0380d43a237f518c60cdd73aba2ff51b723
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1188 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY "SELECT oid FROM pg_catalog.pg_database WHERE datname = '%s'"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+query_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+query_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ /* Ensure we have a valid query context */
+ if (session_context->query_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL || MAIN_CONNECTION(backend) == NULL || MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: error creating relcache while getting database OID")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map, int idx)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or TRACK_TABLE_MUTATION_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map, int table_oid, int dboid, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map, int table_oid, int dboid,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict an entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate entry for table oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: marked table oid %d (dboid %d) as written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->lru_tail = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the track table mutation feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->track_table_mutation_table_buckets;
+ int table_size = pool_config->track_table_mutation_table_size;
+ int query_buckets = pool_config->track_table_mutation_query_buckets;
+ int query_cache_size = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TrackTableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ track_table_mutation_shmem = (TrackTableMutationShmem *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map = (TrackTableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += pool_config->track_table_mutation_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->track_table_mutation_table_size * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(track_table_mutation_shmem->table_map,
+ pool_config->track_table_mutation_table_buckets,
+ pool_config->track_table_mutation_table_size);
+
+ query_cache_init(track_table_mutation_shmem->query_cache,
+ pool_config->track_table_mutation_query_buckets,
+ pool_config->track_table_mutation_query_parse_cache_size);
+
+ /* Initialize global state */
+ track_table_mutation_shmem->state.initialized = true;
+ track_table_mutation_shmem->state.current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+ get_current_time(&track_table_mutation_shmem->state.last_cleanup_time);
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec = 0;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec = 0;
+ track_table_mutation_shmem->state.stats_queries_checked = 0;
+ track_table_mutation_shmem->state.stats_forced_primary = 0;
+ track_table_mutation_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_track_table_mutation_child_init(void)
+{
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: child initialized, cold start period %d ms",
+ pool_config->track_table_mutation_cold_start_duration)));
+}
+
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (pool_config->track_table_mutation_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+
+ /* Check for watchdog-triggered global cold start first */
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now, &track_table_mutation_shmem->state.global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->track_table_mutation_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->track_table_mutation_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ int duration_ms;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ duration_ms = pool_config->track_table_mutation_cold_start_duration;
+ if (duration_ms <= 0)
+ return;
+
+ get_current_time(&now);
+ track_table_mutation_shmem->state.global_cold_start_until = now;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec += duration_ms / 1000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec += (duration_ms % 1000) * 1000;
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_usec >= 1000000)
+ {
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec +=
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec / 1000000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec %=
+ 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: entering global cold start for %d ms",
+ duration_ms)));
+}
+
+bool
+pool_track_table_mutation_table_is_stale(int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: table oid %d (dboid %d) elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_oid, dboid, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics - skip if shmem not available */
+ if (track_table_mutation_shmem != NULL)
+ {
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_allowed_replica, 1);
+ }
+
+ return is_stale;
+}
+
+void
+pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL || dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ /* Limit cleanup frequency to avoid O(N) scan on every write */
+ /* 100ms interval */
+ if (elapsed_us(&track_table_mutation_shmem->state.last_cleanup_time, &now) > 100000)
+ {
+ table_map_cleanup_expired(map, track_table_mutation_shmem->state.current_ttl_us);
+ track_table_mutation_shmem->state.last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+ table_map_insert(map, table_oid, dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_track_table_mutation_mark_table_written(int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+ pool_track_table_mutation_mark_tables_written(tables, 1, dboid);
+ }
+}
+
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->track_table_mutation_ttl_factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ track_table_mutation_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->track_table_mutation_ttl_factor)));
+}
+
+bool
+pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ query_cache_unlock();
+
+ return found;
+}
+
+void
+pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY) ?
+ TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ query_cache_unlock();
+}
+
+uint64
+pool_track_table_mutation_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.52.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-11 10:28 ` Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-11 10:28 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Hi Tatsuo,
>
> After reading more about disable_load_balance_on_write=dml_adaptive i came
> to the thought that this feature is actually an "extension" of that since
> it covers "global" and not just per transaction behavior. in any case i
> think it makes more sense that it sits under
> the disable_load_balance_on_write and not as a standalone for clarity.
>
> I'm attaching below an updated patch with these adjustments.
>
> Please let me know what you think.
I worry about the transactional behavior with the patch:
+ This means that if a transaction is rolled back, the table remains marked as stale until
+ the TTL expires, even though no actual data modification occurred. This is by design:
This allows attackers to issue simple command continuously to
effectively disable load balance (and increase the load of primary) in
whole system:
BEGIN;
UPDATE t1 SET i = 1 WHERE FALSE;
ROLLBACK;
I think if the patch allows that, we cannot accept the patch.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> On Fri, Feb 6, 2026 at 1:29 PM Nadav Shatz <[email protected]> wrote:
>
>> Hi Tatsuo,
>>
>> Thank you for all the great comments and questions! I took under
>> consideration all of them either adding support/tests or detailing the
>> limitations in the docs.
>>
>> Let me know what you think of the latest patch attached here
>>
>> On Wed, Feb 4, 2026 at 1:23 AM Tatsuo Ishii <[email protected]> wrote:
>>
>>> From: Tatsuo Ishii <[email protected]>
>>> Subject: Re: Proposal: Recent mutated table tracking in memory
>>> Date: Tue, 03 Feb 2026 16:43:53 +0900 (JST)
>>> Message-ID: <[email protected]>
>>>
>>> > Hi Nadav,
>>> >
>>> > Thank you for updating the patch!
>>> >
>>> >> Thank you for the comments!
>>> >>
>>> >> I agree with all of them. Let me know what you think of the changes
>>> and new
>>> >> naming.
>>> >
>>> > I still think "memory_map" is too generic. Anything put on memory for
>>> > data mapping could be called "memory map". I recommend to change the
>>> > name to more feature specific one: What about replacing "memory_map"
>>> > with "track_table_mutation"? It's a little bit longer name but it
>>> > clearly represents the feature. Any better ideas are welcome.
>>> >
>>> > - memory_map_enabled: Enable/disable the feature (default: off)
>>> > - memory_map_ttl_factor: TTL multiplier for replication delay (default:
>>> 5.0)
>>> > - memory_map_cold_start_duration: Cold start period in ms (default:
>>> 2000)
>>> > - memory_map_table_buckets: Hash buckets for table map (default: 1024)
>>> > - memory_map_table_size: Max tracked tables (default: 2048)
>>> > - memory_map_query_buckets: Hash buckets for query cache (default: 2048)
>>> > - memory_map_query_cache_size: Max cached queries (default: 10000)
>>> >
>>> > Also I feel memory_map_query_cache_size is confusing because there's
>>> > already "query cache" feature in pgpool. Can we change it something
>>> > like "query_parse_cache_size"?
>>> >
>>> > Review comments:
>>> >
>>> > (1) Why the regression test is 45? Shouldn't it be 42? (the last
>>> > feature test is 041.external_replication_delay).
>>> >
>>> > (2) You enhance the patch to deal with leader watch changing. That's
>>> > good. However, I don't see a test case for it in test.sh.
>>> >
>>> > (3) It seems the patch does not support TRUNCATE, MERGE, PREPARE and
>>> > WITH + updating. If so, it should be noted in the docs as a limitation
>>> > of the feature.
>>>
>>> (4) It seems the patch does not consider transactions. If an UPDATE is
>>> performed in a transaction and the transaction gets rollbacked, load
>>> balance is disabled despite that fact that the table modification did
>>> not happen.
>>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS K.K.
>>> English: http://www.sraoss.co.jp/index_en/
>>> Japanese:http://www.sraoss.co.jp
>>>
>>
>>
>> --
>> Nadav Shatz
>> Tailor Brands | CTO
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-02-12 09:05 ` Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-25 23:55 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 2 replies; 44+ messages in thread
From: Nadav Shatz @ 2026-02-12 09:05 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Thank you for the careful review. You raised an important concern. I've
addressed it in the updated patch — here's the explanation:
The attack scenario you describe is now handled. In the updated patch,
writes inside explicit transactions are only flushed to the shared-memory
table map at COMMIT time. If the transaction is rolled back, the table is
never marked as stale. So the attack pattern:
BEGIN;
UPDATE t1 SET i = 1 WHERE FALSE;
ROLLBACK;
has zero effect on the shared-memory table map. The dml_adaptive_global
mode piggybacks on the existing dml_adaptive per-transaction write list
(transaction_temp_write_list). On COMMIT, the accumulated table names are
resolved to OIDs and flushed to shared memory. On ROLLBACK,
the list is simply discarded (the existing dml_adaptive behavior).
For autocommit statements (outside explicit transactions), tables are
marked immediately — but in that case the write is committed, so this is
correct.
Regression test included. Test 042 now includes:
- Test 10: verifies that BEGIN; INSERT; ROLLBACK; SELECT does NOT route
the SELECT to primary
- Test 11: verifies that BEGIN; INSERT; COMMIT; SELECT DOES route the
SELECT to primary
Additional context on the threat model:
1. This feature requires disable_load_balance_on_write =
'dml_adaptive_global' — it is opt-in, not enabled by default. Operators who
enable it accept documented trade-offs (additional shared memory, TTL-based
staleness window).
2. An attacker who can connect and execute SQL against pgpool already has
the ability to cause far more damage (DROP TABLE, mass DELETEs, resource
exhaustion via expensive queries, connection flooding, etc.). The
table-marking via committed writes is a minor concern compared to
those vectors. Authentication, connection limits, and network security
are the appropriate defenses at that layer.
3. Even in the worst case (an attacker commits real writes in a loop),
the impact is bounded: the stale marking is temporary (TTL-based, typically
a few seconds), and only affects load-balancing decisions — it doesn't
cause data loss or correctness issues.
4. The existing dml_adaptive mode has analogous behavior: within a
transaction, a write to table T causes all reads of T to go to primary for
the remainder of that transaction. The only difference is scope —
dml_adaptive_global extends this across sessions with a TTL.
Thanks!
On Wed, Feb 11, 2026 at 12:28 PM Tatsuo Ishii <[email protected]> wrote:
> > Hi Tatsuo,
> >
> > After reading more about disable_load_balance_on_write=dml_adaptive i
> came
> > to the thought that this feature is actually an "extension" of that since
> > it covers "global" and not just per transaction behavior. in any case i
> > think it makes more sense that it sits under
> > the disable_load_balance_on_write and not as a standalone for clarity.
> >
> > I'm attaching below an updated patch with these adjustments.
> >
> > Please let me know what you think.
>
> I worry about the transactional behavior with the patch:
>
> + This means that if a transaction is rolled back, the table remains
> marked as stale until
> + the TTL expires, even though no actual data modification occurred.
> This is by design:
>
> This allows attackers to issue simple command continuously to
> effectively disable load balance (and increase the load of primary) in
> whole system:
>
> BEGIN;
> UPDATE t1 SET i = 1 WHERE FALSE;
> ROLLBACK;
>
> I think if the patch allows that, we cannot accept the patch.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > On Fri, Feb 6, 2026 at 1:29 PM Nadav Shatz <[email protected]>
> wrote:
> >
> >> Hi Tatsuo,
> >>
> >> Thank you for all the great comments and questions! I took under
> >> consideration all of them either adding support/tests or detailing the
> >> limitations in the docs.
> >>
> >> Let me know what you think of the latest patch attached here
> >>
> >> On Wed, Feb 4, 2026 at 1:23 AM Tatsuo Ishii <[email protected]>
> wrote:
> >>
> >>> From: Tatsuo Ishii <[email protected]>
> >>> Subject: Re: Proposal: Recent mutated table tracking in memory
> >>> Date: Tue, 03 Feb 2026 16:43:53 +0900 (JST)
> >>> Message-ID: <[email protected]>
> >>>
> >>> > Hi Nadav,
> >>> >
> >>> > Thank you for updating the patch!
> >>> >
> >>> >> Thank you for the comments!
> >>> >>
> >>> >> I agree with all of them. Let me know what you think of the changes
> >>> and new
> >>> >> naming.
> >>> >
> >>> > I still think "memory_map" is too generic. Anything put on memory for
> >>> > data mapping could be called "memory map". I recommend to change the
> >>> > name to more feature specific one: What about replacing "memory_map"
> >>> > with "track_table_mutation"? It's a little bit longer name but it
> >>> > clearly represents the feature. Any better ideas are welcome.
> >>> >
> >>> > - memory_map_enabled: Enable/disable the feature (default: off)
> >>> > - memory_map_ttl_factor: TTL multiplier for replication delay
> (default:
> >>> 5.0)
> >>> > - memory_map_cold_start_duration: Cold start period in ms (default:
> >>> 2000)
> >>> > - memory_map_table_buckets: Hash buckets for table map (default:
> 1024)
> >>> > - memory_map_table_size: Max tracked tables (default: 2048)
> >>> > - memory_map_query_buckets: Hash buckets for query cache (default:
> 2048)
> >>> > - memory_map_query_cache_size: Max cached queries (default: 10000)
> >>> >
> >>> > Also I feel memory_map_query_cache_size is confusing because there's
> >>> > already "query cache" feature in pgpool. Can we change it something
> >>> > like "query_parse_cache_size"?
> >>> >
> >>> > Review comments:
> >>> >
> >>> > (1) Why the regression test is 45? Shouldn't it be 42? (the last
> >>> > feature test is 041.external_replication_delay).
> >>> >
> >>> > (2) You enhance the patch to deal with leader watch changing. That's
> >>> > good. However, I don't see a test case for it in test.sh.
> >>> >
> >>> > (3) It seems the patch does not support TRUNCATE, MERGE, PREPARE and
> >>> > WITH + updating. If so, it should be noted in the docs as a
> limitation
> >>> > of the feature.
> >>>
> >>> (4) It seems the patch does not consider transactions. If an UPDATE is
> >>> performed in a transaction and the transaction gets rollbacked, load
> >>> balance is disabled despite that fact that the table modification did
> >>> not happen.
> >>>
> >>> Best regards,
> >>> --
> >>> Tatsuo Ishii
> >>> SRA OSS K.K.
> >>> English: http://www.sraoss.co.jp/index_en/
> >>> Japanese:http://www.sraoss.co.jp
> >>>
> >>
> >>
> >> --
> >> Nadav Shatz
> >> Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] table_track.patch (99.1K, 3-table_track.patch)
download | inline diff:
From ad6acadf4661875c56ae8e5e901f16fafb5e78a2 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
Introduces 'dml_adaptive_global' as a new value for disable_load_balance_on_write.
This mode is a superset of dml_adaptive: it performs per-transaction local tracking
AND cross-session shared-memory tracking of recently written tables, routing reads
to primary until a TTL (based on measured replication delay) expires.
Sub-parameters (track_table_mutation_*) control TTL factor, cold start duration,
hash table sizing, and query parse cache sizing.
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..74162ef2f81f38879c552438ee9321dfde34a4be 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1108,6 +1108,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-track-table-mutation"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1193,4 +1205,255 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096 # Track up to 4096 tables (~160 KB)
+track_table_mutation_query_parse_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..39588af58deba045dffc01ae932115b8a9dbfcf2 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..623d8751677fd6f39d0e12f0e3e899171890f6e0 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1757,6 +1758,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2367,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..aa123222eccaa8505f984dbe3224958fc79424c8 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,7 +1829,7 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL || !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
@@ -1836,7 +1837,7 @@ is_select_object_in_temp_write_list(Node *node, void *context)
RangeVar *rgv = (RangeVar *) node;
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
@@ -1880,7 +1881,7 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
@@ -1944,7 +1945,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1963,6 +1964,34 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush the accumulated
+ * table writes to shared memory. On ROLLBACK, skip — the
+ * writes never committed so no stale-read risk exists.
+ * This prevents attackers from polluting the table map with
+ * rolled-back transactions.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ session_context->transaction_temp_write_list != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, session_context->transaction_temp_write_list)
+ {
+ char *table_name = (char *) lfirst(cell);
+ int table_oid = pool_table_name_to_oid(table_name);
+
+ if (table_oid > 0)
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2010,6 +2039,18 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2180,107 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check track table mutation for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of track table mutation cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table oids and check if any are stale */
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because database oid was unavailable"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_track_table_mutation_table_is_stale(ctx.table_oids[i], dboid))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc64ceba1d1fafd6f4a9f10a750872374..a9596561a7e0265e928b957a2766f46fb4e9ebaa 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,7 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +738,10 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write != DLBOW_OFF && !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index ea6f87e120af866b8ed3a15790d9d8a8e009fe91..7168c1aea877856b5978de332ad636325eb9c30c 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..8798b86eb3620ab36be733bb60bbb8464b0063c8 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -365,6 +369,15 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Track table mutation configuration for tracking recently written tables */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for replication delay */
+ int track_table_mutation_cold_start_duration; /* Cold start duration in ms */
+ int track_table_mutation_table_buckets; /* Number of hash buckets for table map */
+ int track_table_mutation_table_size; /* Max entries in table map */
+ int track_table_mutation_query_buckets; /* Number of hash buckets for query cache */
+ int track_table_mutation_query_parse_cache_size; /* Max entries in query parse cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 0000000000000000000000000000000000000000..5cd5d4ef409645fe77e3bb02239e140456de0554
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,237 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY][TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..b88b0478cb150f89bd9b6b8ab38db0d6912fddd0 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1485,11 +1486,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1503,6 +1507,10 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR && pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3068,6 +3076,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1, "track_table_mutation: %zu bytes requested for shared memory", MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3198,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for tracking recently written tables */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..b15db53248433cb3112246274ed771b79abe1392 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,29 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature.
+ * For autocommit statements (not in explicit transaction), mark tables
+ * immediately. For explicit transactions, marking is deferred to COMMIT
+ * in dml_adaptive() so that ROLLBACKed writes don't pollute the shared
+ * memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL && !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..d53f571421968bd789d0b55f97e0a1eb68a813e5 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index 47b5c8f98a5b4c92d675840eea88f7e03bb18b4c..75fc7508480d79aacc281dd5e624f9e34a998833 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,7 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1804,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index f38f711469576342ce59469b085c97365116004c..dca93334e9e47bb7978064edece5ca0e40021ce3 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
}
return num_oids;
}
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+
+ table = make_table_name_from_rangevar(stmt->relation);
+ }
else if (IsA(node, ExplainStmt))
{
ListCell *cell;
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..de99a7a97ba4a1a03cb3d5589d55ea61cb6e51fa 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,46 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..347f88a88688309b298311a282fe1c1ef2aa0f73 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for table mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update track table mutation TTL based on maximum observed delay */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..ede56bd1968711fc15f45784c6958e12a4e4e589
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,352 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 2000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (2 seconds)
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..fcb93d27a7e7e8a5efe6eacfb0f88f6f3c8bc765
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
@@ -0,0 +1,3 @@
+leader
+standby
+*.pid
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
new file mode 100644
index 0000000000000000000000000000000000000000..945cff9860d0357fbb0e3e9a5643124d916bd9c3
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
@@ -0,0 +1,25 @@
+# leader watchdog config for track_table_mutation watchdog test
+use_watchdog = on
+wd_interval = 1
+wd_priority = 2
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature via dml_adaptive_global
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
new file mode 100644
index 0000000000000000000000000000000000000000..a11c3dfca427cf6b246451d067c30b0255b9c4ce
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
@@ -0,0 +1,27 @@
+# standby watchdog config for track_table_mutation watchdog test
+port = 11100
+pcp_port = 11105
+use_watchdog = on
+wd_interval = 1
+wd_priority = 1
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature via dml_adaptive_global
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..752a6e6aa377fe0c54244975e606648101c98cf8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,179 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation global cold start on watchdog leader change.
+# Tests that when the watchdog leader changes, the new leader triggers
+# a global cold start to force all queries to primary.
+#
+source $TESTLIBS
+LEADER_DIR=leader
+STANDBY_DIR=standby
+PSQL=$PGBIN/psql
+success_count=0
+
+rm -fr $LEADER_DIR
+rm -fr $STANDBY_DIR
+
+mkdir $LEADER_DIR
+mkdir $STANDBY_DIR
+
+# dir in leader directory
+cd $LEADER_DIR
+
+# create leader environment with streaming replication
+echo -n "creating leader pgpool..."
+$PGPOOL_SETUP -m s -n 2 -p 11000 || exit 1
+echo "leader setup done."
+
+# copy the configurations to standby
+cp -r etc ../$STANDBY_DIR/
+
+source ./bashrc.ports
+cat ../leader.conf >> etc/pgpool.conf
+echo 0 > etc/pgpool_node_id
+
+./startall
+wait_for_pgpool_startup
+
+# back to test root dir
+cd ..
+
+# create standby environment
+mkdir $STANDBY_DIR/log
+echo -n "creating standby pgpool..."
+cat standby.conf >> $STANDBY_DIR/etc/pgpool.conf
+# since we are using the same pgpool-II conf as of leader, change the pid file path
+echo "pid_file_name = '$PWD/pgpool2.pid'" >> $STANDBY_DIR/etc/pgpool.conf
+echo 1 > $STANDBY_DIR/etc/pgpool_node_id
+# start the standby pgpool-II by hand
+$PGPOOL_INSTALL_DIR/bin/pgpool -D -n -f $STANDBY_DIR/etc/pgpool.conf -F $STANDBY_DIR/etc/pcp.conf -a $STANDBY_DIR/etc/pool_hba.conf > $STANDBY_DIR/log/pgpool.log 2>&1 &
+
+# Test 1: Check if leader pgpool-II started correctly
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node. Starting escalation process" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up successfully."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 2: Check if standby has successfully joined
+echo "=== Test 2: Waiting for the standby to join cluster... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster as standby node" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby successfully connected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join cluster"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation is enabled and working on leader
+echo "=== Test 3: Verify track_table_mutation is enabled ==="
+if grep -a "track_table_mutation: initialized" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: track_table_mutation initialized on leader"
+else
+ echo "Test 3 FAILED: track_table_mutation not initialized on leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader pgpool and trigger failover
+echo "=== Test 4: Triggering leader failover... ==="
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $LEADER_DIR/etc/pgpool.conf -m f stop
+
+echo "Checking if the Standby pgpool-II detected the leader shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a " is shutting down" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Leader shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Leader shutdown not detected"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby becomes new leader
+echo "=== Test 5: Checking if standby takes over as leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node. Starting escalation process" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became the new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered on new leader
+echo "=== Test 6: Checking if global cold start was triggered... ==="
+# The new leader should trigger global cold start when it becomes coordinator
+# Look for the log message that indicates global cold start was triggered
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: entering global cold start" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered on new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+cd $LEADER_DIR
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Track Table Mutation Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 0000000000000000000000000000000000000000..27d4f0380d43a237f518c60cdd73aba2ff51b723
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1188 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY "SELECT oid FROM pg_catalog.pg_database WHERE datname = '%s'"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+query_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+query_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ /* Ensure we have a valid query context */
+ if (session_context->query_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL || MAIN_CONNECTION(backend) == NULL || MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: error creating relcache while getting database OID")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map, int idx)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or TRACK_TABLE_MUTATION_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map, int table_oid, int dboid, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map, int table_oid, int dboid,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict an entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate entry for table oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: marked table oid %d (dboid %d) as written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->lru_tail = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the track table mutation feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->track_table_mutation_table_buckets;
+ int table_size = pool_config->track_table_mutation_table_size;
+ int query_buckets = pool_config->track_table_mutation_query_buckets;
+ int query_cache_size = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TrackTableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ track_table_mutation_shmem = (TrackTableMutationShmem *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map = (TrackTableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += pool_config->track_table_mutation_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->track_table_mutation_table_size * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(track_table_mutation_shmem->table_map,
+ pool_config->track_table_mutation_table_buckets,
+ pool_config->track_table_mutation_table_size);
+
+ query_cache_init(track_table_mutation_shmem->query_cache,
+ pool_config->track_table_mutation_query_buckets,
+ pool_config->track_table_mutation_query_parse_cache_size);
+
+ /* Initialize global state */
+ track_table_mutation_shmem->state.initialized = true;
+ track_table_mutation_shmem->state.current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+ get_current_time(&track_table_mutation_shmem->state.last_cleanup_time);
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec = 0;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec = 0;
+ track_table_mutation_shmem->state.stats_queries_checked = 0;
+ track_table_mutation_shmem->state.stats_forced_primary = 0;
+ track_table_mutation_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_track_table_mutation_child_init(void)
+{
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: child initialized, cold start period %d ms",
+ pool_config->track_table_mutation_cold_start_duration)));
+}
+
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (pool_config->track_table_mutation_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+
+ /* Check for watchdog-triggered global cold start first */
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now, &track_table_mutation_shmem->state.global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->track_table_mutation_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->track_table_mutation_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ int duration_ms;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ duration_ms = pool_config->track_table_mutation_cold_start_duration;
+ if (duration_ms <= 0)
+ return;
+
+ get_current_time(&now);
+ track_table_mutation_shmem->state.global_cold_start_until = now;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec += duration_ms / 1000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec += (duration_ms % 1000) * 1000;
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_usec >= 1000000)
+ {
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec +=
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec / 1000000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec %=
+ 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: entering global cold start for %d ms",
+ duration_ms)));
+}
+
+bool
+pool_track_table_mutation_table_is_stale(int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: table oid %d (dboid %d) elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_oid, dboid, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics - skip if shmem not available */
+ if (track_table_mutation_shmem != NULL)
+ {
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_allowed_replica, 1);
+ }
+
+ return is_stale;
+}
+
+void
+pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL || dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ /* Limit cleanup frequency to avoid O(N) scan on every write */
+ /* 100ms interval */
+ if (elapsed_us(&track_table_mutation_shmem->state.last_cleanup_time, &now) > 100000)
+ {
+ table_map_cleanup_expired(map, track_table_mutation_shmem->state.current_ttl_us);
+ track_table_mutation_shmem->state.last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+ table_map_insert(map, table_oid, dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_track_table_mutation_mark_table_written(int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+ pool_track_table_mutation_mark_tables_written(tables, 1, dboid);
+ }
+}
+
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->track_table_mutation_ttl_factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ track_table_mutation_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->track_table_mutation_ttl_factor)));
+}
+
+bool
+pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ query_cache_unlock();
+
+ return found;
+}
+
+void
+pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY) ?
+ TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ query_cache_unlock();
+}
+
+uint64
+pool_track_table_mutation_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.52.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-18 23:51 ` Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
1 sibling, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-18 23:51 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Hi Tatsuo,
>
> Thank you for the careful review. You raised an important concern. I've
> addressed it in the updated patch ― here's the explanation:
>
> The attack scenario you describe is now handled. In the updated patch,
> writes inside explicit transactions are only flushed to the shared-memory
> table map at COMMIT time. If the transaction is rolled back, the table is
> never marked as stale. So the attack pattern:
>
> BEGIN;
> UPDATE t1 SET i = 1 WHERE FALSE;
> ROLLBACK;
>
> has zero effect on the shared-memory table map. The dml_adaptive_global
> mode piggybacks on the existing dml_adaptive per-transaction write list
> (transaction_temp_write_list). On COMMIT, the accumulated table names are
> resolved to OIDs and flushed to shared memory. On ROLLBACK,
> the list is simply discarded (the existing dml_adaptive behavior).
>
> For autocommit statements (outside explicit transactions), tables are
> marked immediately ― but in that case the write is committed, so this is
> correct.
>
> Regression test included. Test 042 now includes:
> - Test 10: verifies that BEGIN; INSERT; ROLLBACK; SELECT does NOT route
> the SELECT to primary
> - Test 11: verifies that BEGIN; INSERT; COMMIT; SELECT DOES route the
> SELECT to primary
>
> Additional context on the threat model:
>
> 1. This feature requires disable_load_balance_on_write =
> 'dml_adaptive_global' ― it is opt-in, not enabled by default. Operators who
> enable it accept documented trade-offs (additional shared memory, TTL-based
> staleness window).
> 2. An attacker who can connect and execute SQL against pgpool already has
> the ability to cause far more damage (DROP TABLE, mass DELETEs, resource
> exhaustion via expensive queries, connection flooding, etc.). The
> table-marking via committed writes is a minor concern compared to
> those vectors. Authentication, connection limits, and network security
> are the appropriate defenses at that layer.
> 3. Even in the worst case (an attacker commits real writes in a loop),
> the impact is bounded: the stale marking is temporary (TTL-based, typically
> a few seconds), and only affects load-balancing decisions ― it doesn't
> cause data loss or correctness issues.
> 4. The existing dml_adaptive mode has analogous behavior: within a
> transaction, a write to table T causes all reads of T to go to primary for
> the remainder of that transaction. The only difference is scope ―
> dml_adaptive_global extends this across sessions with a TTL.
>
> Thanks!
Thank you for the patch. While I am looking into it, I noticed a
regression test failure.
t-ishii$ ./regress.sh 04[12]
creating pgpool-II temporary installation ...
:
:
testing 041.external_replication_delay...ok.
testing 042.track_table_mutation...failed.
out of 2 ok:1 failed:1 timeout:0
However if I run 042 only, it succeeds.
t-ishii$ ./regress.sh 042
:
:
testing 042.track_table_mutation...ok.
out of 1 ok:1 failed:0 timeout:0
Can you please take a look at this? log/042.track_table_mutation
attached.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
creating test environment...PostgreSQL major version: 180
Starting set up in streaming replication mode
creating startall and shutdownall
creating failover script
creating database cluster /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/data0...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
creating database cluster /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/data1...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
temporarily start data0 cluster to create extensions
temporarily start pgpool-II to create standby nodes
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
recovery node 1...ERROR: connection to host "localhost" failed with error "Connection refused"
done.
creating follow primary script
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
shutdown all
pgpool-II setting for streaming replication mode is done.
To start the whole system, use /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/startall.
To shutdown the whole system, use /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/shutdownall.
pcp command user name is "t-ishii", password is "t-ishii".
Each PostgreSQL, pgpool-II and pcp port is as follows:
#1 port is 11002
#2 port is 11003
pgpool port is 11000
pcp port is 11001
The info above is in README.port.
done.
waiting for server to start....364373 2026-02-19 08:42:03.860 JST LOG: redirecting log output to logging collector process
364373 2026-02-19 08:42:03.860 JST HINT: Future log output will appear in directory "log".
done
server started
waiting for server to start....364388 2026-02-19 08:42:03.963 JST LOG: redirecting log output to logging collector process
364388 2026-02-19 08:42:03.963 JST HINT: Future log output will appear in directory "log".
done
server started
psql: error: connection to server on socket "/tmp/.s.PGSQL.11000" failed: ERROR: unable to read message kind
DETAIL: kind does not match between main(53) slot[1] (45)
=== Test 1: Cold Start Routing ===
2026-02-19 08:42:24.305: main pid 364625: DEBUG: initializing pool configuration
2026-02-19 08:42:24.305: main pid 364625: DETAIL: num_backends: 2 total_weight: 1.000000
2026-02-19 08:42:24.305: main pid 364625: DEBUG: initializing pool configuration
2026-02-19 08:42:24.305: main pid 364625: DETAIL: backend 0 weight: 0.000000 flag: 0000
2026-02-19 08:42:24.305: main pid 364625: DEBUG: initializing pool configuration
2026-02-19 08:42:24.305: main pid 364625: DETAIL: backend 1 weight: 2147483647.000000 flag: 0000
2026-02-19 08:42:24.305: main pid 364625: LOG: stop request sent to pgpool (pid: 364400). waiting for termination...
.done.
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
waiting for server to start....364640 2026-02-19 08:42:25.531 JST LOG: redirecting log output to logging collector process
364640 2026-02-19 08:42:25.531 JST HINT: Future log output will appear in directory "log".
done
server started
waiting for server to start....364655 2026-02-19 08:42:25.633 JST LOG: redirecting log output to logging collector process
364655 2026-02-19 08:42:25.633 JST HINT: Future log output will appear in directory "log".
done
server started
Test 1 FAILED: Cold start routing not detected
2026-02-19 08:42:45.929: main pid 364901: DEBUG: initializing pool configuration
2026-02-19 08:42:45.929: main pid 364901: DETAIL: num_backends: 2 total_weight: 1.000000
2026-02-19 08:42:45.929: main pid 364901: DEBUG: initializing pool configuration
2026-02-19 08:42:45.929: main pid 364901: DETAIL: backend 0 weight: 0.000000 flag: 0000
2026-02-19 08:42:45.929: main pid 364901: DEBUG: initializing pool configuration
2026-02-19 08:42:45.929: main pid 364901: DETAIL: backend 1 weight: 2147483647.000000 flag: 0000
2026-02-19 08:42:45.929: main pid 364901: LOG: stop request sent to pgpool (pid: 364670). waiting for termination...
.done.
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
Attachments:
[text/plain] 042.track_table_mutation (7.8K, 2-042.track_table_mutation)
download | inline:
creating test environment...PostgreSQL major version: 180
Starting set up in streaming replication mode
creating startall and shutdownall
creating failover script
creating database cluster /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/data0...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
creating database cluster /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/data1...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
temporarily start data0 cluster to create extensions
temporarily start pgpool-II to create standby nodes
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
recovery node 1...ERROR: connection to host "localhost" failed with error "Connection refused"
done.
creating follow primary script
psql: error: connection to server at "localhost" (127.0.0.1), port 11000 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
shutdown all
pgpool-II setting for streaming replication mode is done.
To start the whole system, use /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/startall.
To shutdown the whole system, use /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/tests/042.track_table_mutation/testdir/shutdownall.
pcp command user name is "t-ishii", password is "t-ishii".
Each PostgreSQL, pgpool-II and pcp port is as follows:
#1 port is 11002
#2 port is 11003
pgpool port is 11000
pcp port is 11001
The info above is in README.port.
done.
waiting for server to start....364373 2026-02-19 08:42:03.860 JST LOG: redirecting log output to logging collector process
364373 2026-02-19 08:42:03.860 JST HINT: Future log output will appear in directory "log".
done
server started
waiting for server to start....364388 2026-02-19 08:42:03.963 JST LOG: redirecting log output to logging collector process
364388 2026-02-19 08:42:03.963 JST HINT: Future log output will appear in directory "log".
done
server started
psql: error: connection to server on socket "/tmp/.s.PGSQL.11000" failed: ERROR: unable to read message kind
DETAIL: kind does not match between main(53) slot[1] (45)
=== Test 1: Cold Start Routing ===
2026-02-19 08:42:24.305: main pid 364625: DEBUG: initializing pool configuration
2026-02-19 08:42:24.305: main pid 364625: DETAIL: num_backends: 2 total_weight: 1.000000
2026-02-19 08:42:24.305: main pid 364625: DEBUG: initializing pool configuration
2026-02-19 08:42:24.305: main pid 364625: DETAIL: backend 0 weight: 0.000000 flag: 0000
2026-02-19 08:42:24.305: main pid 364625: DEBUG: initializing pool configuration
2026-02-19 08:42:24.305: main pid 364625: DETAIL: backend 1 weight: 2147483647.000000 flag: 0000
2026-02-19 08:42:24.305: main pid 364625: LOG: stop request sent to pgpool (pid: 364400). waiting for termination...
.done.
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
waiting for server to start....364640 2026-02-19 08:42:25.531 JST LOG: redirecting log output to logging collector process
364640 2026-02-19 08:42:25.531 JST HINT: Future log output will appear in directory "log".
done
server started
waiting for server to start....364655 2026-02-19 08:42:25.633 JST LOG: redirecting log output to logging collector process
364655 2026-02-19 08:42:25.633 JST HINT: Future log output will appear in directory "log".
done
server started
Test 1 FAILED: Cold start routing not detected
2026-02-19 08:42:45.929: main pid 364901: DEBUG: initializing pool configuration
2026-02-19 08:42:45.929: main pid 364901: DETAIL: num_backends: 2 total_weight: 1.000000
2026-02-19 08:42:45.929: main pid 364901: DEBUG: initializing pool configuration
2026-02-19 08:42:45.929: main pid 364901: DETAIL: backend 0 weight: 0.000000 flag: 0000
2026-02-19 08:42:45.929: main pid 364901: DEBUG: initializing pool configuration
2026-02-19 08:42:45.929: main pid 364901: DETAIL: backend 1 weight: 2147483647.000000 flag: 0000
2026-02-19 08:42:45.929: main pid 364901: LOG: stop request sent to pgpool (pid: 364670). waiting for termination...
.done.
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-02-19 04:40 ` Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-02-19 04:40 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Thanks! I’ll look into it and share an updated patch
Nadav Shatz
Tailor Brands | CTO
On Thu, Feb 19, 2026 at 1:51 AM Tatsuo Ishii <[email protected]> wrote:
> > Hi Tatsuo,
> >
> > Thank you for the careful review. You raised an important concern. I've
> > addressed it in the updated patch ― here's the explanation:
> >
> > The attack scenario you describe is now handled. In the updated patch,
> > writes inside explicit transactions are only flushed to the shared-memory
> > table map at COMMIT time. If the transaction is rolled back, the table is
> > never marked as stale. So the attack pattern:
> >
> > BEGIN;
> > UPDATE t1 SET i = 1 WHERE FALSE;
> > ROLLBACK;
> >
> > has zero effect on the shared-memory table map. The dml_adaptive_global
> > mode piggybacks on the existing dml_adaptive per-transaction write list
> > (transaction_temp_write_list). On COMMIT, the accumulated table names are
> > resolved to OIDs and flushed to shared memory. On ROLLBACK,
> > the list is simply discarded (the existing dml_adaptive behavior).
> >
> > For autocommit statements (outside explicit transactions), tables are
> > marked immediately ― but in that case the write is committed, so this is
> > correct.
> >
> > Regression test included. Test 042 now includes:
> > - Test 10: verifies that BEGIN; INSERT; ROLLBACK; SELECT does NOT route
> > the SELECT to primary
> > - Test 11: verifies that BEGIN; INSERT; COMMIT; SELECT DOES route the
> > SELECT to primary
> >
> > Additional context on the threat model:
> >
> > 1. This feature requires disable_load_balance_on_write =
> > 'dml_adaptive_global' ― it is opt-in, not enabled by default. Operators
> who
> > enable it accept documented trade-offs (additional shared memory,
> TTL-based
> > staleness window).
> > 2. An attacker who can connect and execute SQL against pgpool already
> has
> > the ability to cause far more damage (DROP TABLE, mass DELETEs, resource
> > exhaustion via expensive queries, connection flooding, etc.). The
> > table-marking via committed writes is a minor concern compared to
> > those vectors. Authentication, connection limits, and network security
> > are the appropriate defenses at that layer.
> > 3. Even in the worst case (an attacker commits real writes in a loop),
> > the impact is bounded: the stale marking is temporary (TTL-based,
> typically
> > a few seconds), and only affects load-balancing decisions ― it doesn't
> > cause data loss or correctness issues.
> > 4. The existing dml_adaptive mode has analogous behavior: within a
> > transaction, a write to table T causes all reads of T to go to primary
> for
> > the remainder of that transaction. The only difference is scope ―
> > dml_adaptive_global extends this across sessions with a TTL.
> >
> > Thanks!
>
> Thank you for the patch. While I am looking into it, I noticed a
> regression test failure.
>
> t-ishii$ ./regress.sh 04[12]
> creating pgpool-II temporary installation ...
> :
> :
> testing 041.external_replication_delay...ok.
> testing 042.track_table_mutation...failed.
> out of 2 ok:1 failed:1 timeout:0
>
> However if I run 042 only, it succeeds.
>
> t-ishii$ ./regress.sh 042
> :
> :
> testing 042.track_table_mutation...ok.
> out of 1 ok:1 failed:0 timeout:0
>
> Can you please take a look at this? log/042.track_table_mutation
> attached.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-19 11:05 ` Nadav Shatz <[email protected]>
2026-02-26 00:02 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 2 replies; 44+ messages in thread
From: Nadav Shatz @ 2026-02-19 11:05 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Added some handling for possible causes - works now.
On Thu, Feb 19, 2026 at 6:40 AM Nadav Shatz <[email protected]> wrote:
> Thanks! I’ll look into it and share an updated patch
>
> Nadav Shatz
> Tailor Brands | CTO
>
>
> On Thu, Feb 19, 2026 at 1:51 AM Tatsuo Ishii <[email protected]> wrote:
>
>> > Hi Tatsuo,
>> >
>> > Thank you for the careful review. You raised an important concern.
>> I've
>> > addressed it in the updated patch ― here's the explanation:
>> >
>> > The attack scenario you describe is now handled. In the updated patch,
>> > writes inside explicit transactions are only flushed to the
>> shared-memory
>> > table map at COMMIT time. If the transaction is rolled back, the table
>> is
>> > never marked as stale. So the attack pattern:
>> >
>> > BEGIN;
>> > UPDATE t1 SET i = 1 WHERE FALSE;
>> > ROLLBACK;
>> >
>> > has zero effect on the shared-memory table map. The
>> dml_adaptive_global
>> > mode piggybacks on the existing dml_adaptive per-transaction write list
>> > (transaction_temp_write_list). On COMMIT, the accumulated table names
>> are
>> > resolved to OIDs and flushed to shared memory. On ROLLBACK,
>> > the list is simply discarded (the existing dml_adaptive behavior).
>> >
>> > For autocommit statements (outside explicit transactions), tables are
>> > marked immediately ― but in that case the write is committed, so this is
>> > correct.
>> >
>> > Regression test included. Test 042 now includes:
>> > - Test 10: verifies that BEGIN; INSERT; ROLLBACK; SELECT does NOT
>> route
>> > the SELECT to primary
>> > - Test 11: verifies that BEGIN; INSERT; COMMIT; SELECT DOES route the
>> > SELECT to primary
>> >
>> > Additional context on the threat model:
>> >
>> > 1. This feature requires disable_load_balance_on_write =
>> > 'dml_adaptive_global' ― it is opt-in, not enabled by default. Operators
>> who
>> > enable it accept documented trade-offs (additional shared memory,
>> TTL-based
>> > staleness window).
>> > 2. An attacker who can connect and execute SQL against pgpool already
>> has
>> > the ability to cause far more damage (DROP TABLE, mass DELETEs, resource
>> > exhaustion via expensive queries, connection flooding, etc.). The
>> > table-marking via committed writes is a minor concern compared to
>> > those vectors. Authentication, connection limits, and network security
>> > are the appropriate defenses at that layer.
>> > 3. Even in the worst case (an attacker commits real writes in a loop),
>> > the impact is bounded: the stale marking is temporary (TTL-based,
>> typically
>> > a few seconds), and only affects load-balancing decisions ― it doesn't
>> > cause data loss or correctness issues.
>> > 4. The existing dml_adaptive mode has analogous behavior: within a
>> > transaction, a write to table T causes all reads of T to go to primary
>> for
>> > the remainder of that transaction. The only difference is scope ―
>> > dml_adaptive_global extends this across sessions with a TTL.
>> >
>> > Thanks!
>>
>> Thank you for the patch. While I am looking into it, I noticed a
>> regression test failure.
>>
>> t-ishii$ ./regress.sh 04[12]
>> creating pgpool-II temporary installation ...
>> :
>> :
>> testing 041.external_replication_delay...ok.
>> testing 042.track_table_mutation...failed.
>> out of 2 ok:1 failed:1 timeout:0
>>
>> However if I run 042 only, it succeeds.
>>
>> t-ishii$ ./regress.sh 042
>> :
>> :
>> testing 042.track_table_mutation...ok.
>> out of 1 ok:1 failed:0 timeout:0
>>
>> Can you please take a look at this? log/042.track_table_mutation
>> attached.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] table_track.patch (99.8K, 3-table_track.patch)
download | inline diff:
From d819632f2dac41cbe1e01363628d1d1c2f648961 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] Feature: add in-memory table tracking to prevent stale reads
from replicas
Introduces 'dml_adaptive_global' as a new value for disable_load_balance_on_write.
This mode is a superset of dml_adaptive: it performs per-transaction local tracking
AND cross-session shared-memory tracking of recently written tables, routing reads
to primary until a TTL (based on measured replication delay) expires.
Sub-parameters (track_table_mutation_*) control TTL factor, cold start duration,
hash table sizing, and query parse cache sizing.
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..74162ef2f81f38879c552438ee9321dfde34a4be 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1108,6 +1108,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-track-table-mutation"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1193,4 +1205,255 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096 # Track up to 4096 tables (~160 KB)
+track_table_mutation_query_parse_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..39588af58deba045dffc01ae932115b8a9dbfcf2 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index 68abb7f41cb96d856c824a148842748bfb7a4d12..623d8751677fd6f39d0e12f0e3e899171890f6e0 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1757,6 +1758,17 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation (TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2355,6 +2367,61 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_cold_start_duration", CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size", CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..aa123222eccaa8505f984dbe3224958fc79424c8 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,7 +1829,7 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL || !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
@@ -1836,7 +1837,7 @@ is_select_object_in_temp_write_list(Node *node, void *context)
RangeVar *rgv = (RangeVar *) node;
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
@@ -1880,7 +1881,7 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
@@ -1944,7 +1945,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1963,6 +1964,34 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush the accumulated
+ * table writes to shared memory. On ROLLBACK, skip — the
+ * writes never committed so no stale-read risk exists.
+ * This prevents attackers from polluting the table map with
+ * rolled-back transactions.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ session_context->transaction_temp_write_list != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, session_context->transaction_temp_write_list)
+ {
+ char *table_name = (char *) lfirst(cell);
+ int table_oid = pool_table_name_to_oid(table_name);
+
+ if (table_oid > 0)
+ pool_track_table_mutation_mark_table_written(table_oid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2010,6 +2039,18 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2180,107 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check track table mutation for recently written tables.
+ * If in cold start or any table was recently written,
+ * route to primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+
+ /* During cold start, route everything to primary */
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of track table mutation cold start"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ /* Extract table oids and check if any are stale */
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids = pool_extract_table_oids_from_select_stmt(node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because database oid was unavailable"),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ if (pool_track_table_mutation_table_is_stale(ctx.table_oids[i], dboid))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because table \"%s\" was recently written",
+ ctx.table_names[i]),
+ errdetail("destination = PRIMARY for query= \"%s\"", query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ else
+ {
+ /* Proceed with load balancing */
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id = select_load_balancing_node();
+ }
+
+ /*
+ * As streaming replication delay is too much, if
+ * prefer_lower_delay_standby is true then elect new load
+ * balance node which is lowest delayed, false then send
+ * to the primary.
+ */
+ if (STREAM && check_replication_delay(session_context->load_balance_node_id))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance because of too much replication delay"),
+ errdetail("destination = %d for query= \"%s\"", dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ int new_load_balancing_node = select_load_balancing_node();
+
+ session_context->load_balance_node_id = new_load_balancing_node;
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(query_context,
+ session_context->query_context->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc64ceba1d1fafd6f4a9f10a750872374..a9596561a7e0265e928b957a2766f46fb4e9ebaa 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,7 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +738,10 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write != DLBOW_OFF && !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index ea6f87e120af866b8ed3a15790d9d8a8e009fe91..7168c1aea877856b5978de332ad636325eb9c30c 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 741de6cc5fc3368f813d6b6efa68eb7f8a79506b..8798b86eb3620ab36be733bb60bbb8464b0063c8 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -365,6 +369,15 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Track table mutation configuration for tracking recently written tables */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for replication delay */
+ int track_table_mutation_cold_start_duration; /* Cold start duration in ms */
+ int track_table_mutation_table_buckets; /* Number of hash buckets for table map */
+ int track_table_mutation_table_size; /* Max entries in table map */
+ int track_table_mutation_query_buckets; /* Number of hash buckets for query cache */
+ int track_table_mutation_query_parse_cache_size; /* Max entries in query parse cache */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 0000000000000000000000000000000000000000..5cd5d4ef409645fe77e3bb02239e140456de0554
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,237 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next entry in collision chain (-1 if none) */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names[TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY][TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Has shared memory been initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ volatile uint32 stats_queries_checked; /* Number of queries checked */
+ volatile uint32 stats_forced_primary; /* Queries forced to primary */
+ volatile uint32 stats_allowed_replica; /* Queries allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..b88b0478cb150f89bd9b6b8ab38db0d6912fddd0 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1485,11 +1486,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1503,6 +1507,10 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR && pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3068,6 +3076,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1, "track_table_mutation: %zu bytes requested for shared memory", MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3198,12 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for tracking recently written tables */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..b15db53248433cb3112246274ed771b79abe1392 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,29 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature.
+ * For autocommit statements (not in explicit transaction), mark tables
+ * immediately. For explicit transactions, marking is deferred to COMMIT
+ * in dml_adaptive() so that ROLLBACKed writes don't pollute the shared
+ * memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL && !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid = pool_track_table_mutation_get_database_oid();
+
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..d53f571421968bd789d0b55f97e0a1eb68a813e5 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,12 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index 47b5c8f98a5b4c92d675840eea88f7e03bb18b4c..75fc7508480d79aacc281dd5e624f9e34a998833 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,7 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1804,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index f38f711469576342ce59469b085c97365116004c..dca93334e9e47bb7978064edece5ca0e40021ce3 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
}
return num_oids;
}
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+
+ table = make_table_name_from_rangevar(stmt->relation);
+ }
else if (IsA(node, ExplainStmt))
{
ListCell *cell;
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 454fdb9e5d1fd65437b6a67f12ab62658ea08f49..de99a7a97ba4a1a03cb3d5589d55ea61cb6e51fa 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,46 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..347f88a88688309b298311a282fe1c1ef2aa0f73 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track maximum delay for table mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,10 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update track table mutation TTL based on maximum observed delay */
+ if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE_GLOBAL && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c1821191a572430b658d80ab34554110363..1c8ae392daa10056119c09c7127e839d859d700d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..8b4dd17b820d36e3fc48216ac7f0544cbf0f5a9c
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..fcb93d27a7e7e8a5efe6eacfb0f88f6f3c8bc765
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
@@ -0,0 +1,3 @@
+leader
+standby
+*.pid
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
new file mode 100644
index 0000000000000000000000000000000000000000..945cff9860d0357fbb0e3e9a5643124d916bd9c3
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
@@ -0,0 +1,25 @@
+# leader watchdog config for track_table_mutation watchdog test
+use_watchdog = on
+wd_interval = 1
+wd_priority = 2
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature via dml_adaptive_global
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
new file mode 100644
index 0000000000000000000000000000000000000000..a11c3dfca427cf6b246451d067c30b0255b9c4ce
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/standby.conf
@@ -0,0 +1,27 @@
+# standby watchdog config for track_table_mutation watchdog test
+port = 11100
+pcp_port = 11105
+use_watchdog = on
+wd_interval = 1
+wd_priority = 1
+
+hostname0 = 'localhost'
+wd_port0 = 21004
+pgpool_port0 = 11000
+hostname1 = 'localhost'
+wd_port1 = 21104
+pgpool_port1 = 11100
+
+heartbeat_hostname0 = 'localhost'
+heartbeat_port0 = 21005
+heartbeat_hostname1 = 'localhost'
+heartbeat_port1 = 21105
+
+enable_consensus_with_half_votes = on
+
+# Enable track table mutation feature via dml_adaptive_global
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+
+# Enable debug logging to see feature messages
+log_min_messages = debug1
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..752a6e6aa377fe0c54244975e606648101c98cf8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,179 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation global cold start on watchdog leader change.
+# Tests that when the watchdog leader changes, the new leader triggers
+# a global cold start to force all queries to primary.
+#
+source $TESTLIBS
+LEADER_DIR=leader
+STANDBY_DIR=standby
+PSQL=$PGBIN/psql
+success_count=0
+
+rm -fr $LEADER_DIR
+rm -fr $STANDBY_DIR
+
+mkdir $LEADER_DIR
+mkdir $STANDBY_DIR
+
+# dir in leader directory
+cd $LEADER_DIR
+
+# create leader environment with streaming replication
+echo -n "creating leader pgpool..."
+$PGPOOL_SETUP -m s -n 2 -p 11000 || exit 1
+echo "leader setup done."
+
+# copy the configurations to standby
+cp -r etc ../$STANDBY_DIR/
+
+source ./bashrc.ports
+cat ../leader.conf >> etc/pgpool.conf
+echo 0 > etc/pgpool_node_id
+
+./startall
+wait_for_pgpool_startup
+
+# back to test root dir
+cd ..
+
+# create standby environment
+mkdir $STANDBY_DIR/log
+echo -n "creating standby pgpool..."
+cat standby.conf >> $STANDBY_DIR/etc/pgpool.conf
+# since we are using the same pgpool-II conf as of leader, change the pid file path
+echo "pid_file_name = '$PWD/pgpool2.pid'" >> $STANDBY_DIR/etc/pgpool.conf
+echo 1 > $STANDBY_DIR/etc/pgpool_node_id
+# start the standby pgpool-II by hand
+$PGPOOL_INSTALL_DIR/bin/pgpool -D -n -f $STANDBY_DIR/etc/pgpool.conf -F $STANDBY_DIR/etc/pcp.conf -a $STANDBY_DIR/etc/pool_hba.conf > $STANDBY_DIR/log/pgpool.log 2>&1 &
+
+# Test 1: Check if leader pgpool-II started correctly
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node. Starting escalation process" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up successfully."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 2: Check if standby has successfully joined
+echo "=== Test 2: Waiting for the standby to join cluster... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster as standby node" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby successfully connected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join cluster"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation is enabled and working on leader
+echo "=== Test 3: Verify track_table_mutation is enabled ==="
+if grep -a "track_table_mutation: initialized" $LEADER_DIR/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: track_table_mutation initialized on leader"
+else
+ echo "Test 3 FAILED: track_table_mutation not initialized on leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader pgpool and trigger failover
+echo "=== Test 4: Triggering leader failover... ==="
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $LEADER_DIR/etc/pgpool.conf -m f stop
+
+echo "Checking if the Standby pgpool-II detected the leader shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a " is shutting down" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Leader shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Leader shutdown not detected"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby becomes new leader
+echo "=== Test 5: Checking if standby takes over as leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node. Starting escalation process" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became the new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ $PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+ cd $LEADER_DIR && ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered on new leader
+echo "=== Test 6: Checking if global cold start was triggered... ==="
+# The new leader should trigger global cold start when it becomes coordinator
+# Look for the log message that indicates global cold start was triggered
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: entering global cold start" $STANDBY_DIR/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered on new leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+$PGPOOL_INSTALL_DIR/bin/pgpool -f $STANDBY_DIR/etc/pgpool.conf -m f stop 2>/dev/null
+cd $LEADER_DIR
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Track Table Mutation Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 0000000000000000000000000000000000000000..27d4f0380d43a237f518c60cdd73aba2ff51b723
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1188 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently written tables
+ * to avoid stale reads from replicas during replication lag
+ *
+ * Based on the "lagless" architecture from Tailor Brands:
+ * https://medium.com/tailor-tech/using-database-read-replicas-in-distributed-systems-d80eaf6bbf8a
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY "SELECT oid FROM pg_catalog.pg_database WHERE datname = '%s'"
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for accessing flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in query cache */
+#define QUERY_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in query cache */
+#define QUERY_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+query_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+query_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ uint64 hash = 14695981039346656037ULL; /* FNV offset basis for 64-bit */
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime for 64-bit */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000) +
+ (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ /* Ensure we have a valid query context */
+ if (session_context->query_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL || MAIN_CONNECTION(backend) == NULL || MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: error creating relcache while getting database OID")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map, int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized table map with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int idx;
+
+ if (map->free_list_head == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map, int idx)
+{
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table
+ * Returns entry index or TRACK_TABLE_MUTATION_INVALID_INDEX if not found
+ * Must be called with lock held
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map, int table_oid, int dboid, uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Insert or update a table entry
+ * Must be called with lock held
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map, int table_oid, int dboid,
+ uint32 hash, struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ int bucket = hash % map->num_buckets;
+ int idx;
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ /* Update existing entry */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int b;
+ /* Table is full - evict an entry */
+ /* For simplicity, just use the first entry in first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int victim = buckets[b];
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate entry for table oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: marked table oid %d (dboid %d) as written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map
+ * Must be called with lock held
+ */
+static void
+table_map_cleanup_expired(TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ struct timeval now;
+ int removed = 0;
+ int b;
+
+ get_current_time(&now);
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+
+ if (elapsed > (int64)ttl_us)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: cleaned up %d expired table entries", removed)));
+ }
+}
+
+/* ----------------
+ * Query parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize query parse cache
+ */
+static void
+query_cache_init(QueryParseCache *cache, int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->lru_tail = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ buckets = QUERY_CACHE_BUCKETS(cache);
+ entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ? i + 1 : TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[i].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: initialized query cache with %d buckets, %d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+query_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+query_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+query_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+
+ if (entries[idx].lru_prev != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_prev].lru_next = entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ entries[entries[idx].lru_next].lru_prev = entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ entries[idx].lru_next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if necessary
+ */
+static int
+query_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ int idx;
+
+ if (cache->free_list_head != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ int bucket = entries[idx].query_hash % cache->num_buckets;
+ int *prev_ptr = &buckets[bucket];
+ int curr = buckets[bucket];
+
+ while (curr != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+
+ /* Remove from LRU list */
+ query_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the cache
+ */
+static int
+query_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = QUERY_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+
+ while (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return TRACK_TABLE_MUTATION_INVALID_INDEX;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- style and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ *
+ * This is a simplified version - pgpool2 already does this elsewhere,
+ * but we need a standalone version for the track table mutation feature.
+ */
+static size_t
+normalize_query(const char *query, char *output, size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true; /* Start true to skip leading space */
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ *dst++ = '$'; /* Replace string content with placeholder */
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src && !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' || *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' && *(src + 1) <= '9'))
+ {
+ while (*src && ((*src >= '0' && *src <= '9') || *src == '.'))
+ src++;
+ if (!last_was_space && dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int table_buckets = pool_config->track_table_mutation_table_buckets;
+ int table_size = pool_config->track_table_mutation_table_size;
+ int query_buckets = pool_config->track_table_mutation_query_buckets;
+ int query_cache_size = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += table_buckets * sizeof(int); /* buckets array */
+ size += table_size * sizeof(TrackTableMutationEntry); /* entries array */
+
+ /* Query parse cache */
+ size += sizeof(QueryParseCache);
+ size += query_buckets * sizeof(int); /* buckets array */
+ size += query_cache_size * sizeof(QueryParseEntry); /* entries array */
+
+ return size;
+}
+
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: feature disabled")));
+ return;
+ }
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is already zeroed by initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: failed to allocate %zu bytes of shared memory",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers to structures within shared memory */
+ track_table_mutation_shmem = (TrackTableMutationShmem *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map = (TrackTableMutationHashTable *)shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += pool_config->track_table_mutation_table_buckets * sizeof(int);
+ shmem_ptr += pool_config->track_table_mutation_table_size * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache = (QueryParseCache *)shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(track_table_mutation_shmem->table_map,
+ pool_config->track_table_mutation_table_buckets,
+ pool_config->track_table_mutation_table_size);
+
+ query_cache_init(track_table_mutation_shmem->query_cache,
+ pool_config->track_table_mutation_query_buckets,
+ pool_config->track_table_mutation_query_parse_cache_size);
+
+ /* Initialize global state */
+ track_table_mutation_shmem->state.initialized = true;
+ track_table_mutation_shmem->state.current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+ get_current_time(&track_table_mutation_shmem->state.last_cleanup_time);
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec = 0;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec = 0;
+ track_table_mutation_shmem->state.stats_queries_checked = 0;
+ track_table_mutation_shmem->state.stats_forced_primary = 0;
+ track_table_mutation_shmem->state.stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: initialized with %zu bytes shared memory",
+ shmem_size)));
+#endif
+}
+
+void
+pool_track_table_mutation_child_init(void)
+{
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: child initialized, cold start period %d ms",
+ pool_config->track_table_mutation_cold_start_duration)));
+}
+
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (pool_config->track_table_mutation_cold_start_duration <= 0)
+ return false;
+
+ get_current_time(&now);
+
+ /* Check for watchdog-triggered global cold start first */
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now, &track_table_mutation_shmem->state.global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < pool_config->track_table_mutation_cold_start_duration)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: in cold start (%ld/%d ms)",
+ (long)elapsed_ms, pool_config->track_table_mutation_cold_start_duration)));
+ return true;
+ }
+
+ return false;
+}
+
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ int duration_ms;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ duration_ms = pool_config->track_table_mutation_cold_start_duration;
+ if (duration_ms <= 0)
+ return;
+
+ get_current_time(&now);
+ track_table_mutation_shmem->state.global_cold_start_until = now;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec += duration_ms / 1000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec += (duration_ms % 1000) * 1000;
+ if (track_table_mutation_shmem->state.global_cold_start_until.tv_usec >= 1000000)
+ {
+ track_table_mutation_shmem->state.global_cold_start_until.tv_sec +=
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec / 1000000;
+ track_table_mutation_shmem->state.global_cold_start_until.tv_usec %=
+ 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: entering global cold start for %d ms",
+ duration_ms)));
+}
+
+bool
+pool_track_table_mutation_table_is_stale(int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state.current_ttl_us;
+
+ int64 elapsed = elapsed_us(&entries[idx].last_write_time, &now);
+ is_stale = (elapsed < (int64)ttl_us);
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: table oid %d (dboid %d) elapsed=%ld us, ttl=%lu us, stale=%d",
+ table_oid, dboid, (long)elapsed, (unsigned long)ttl_us, is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics - skip if shmem not available */
+ if (track_table_mutation_shmem != NULL)
+ {
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_queries_checked, 1);
+ if (is_stale)
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_forced_primary, 1);
+ else
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_allowed_replica, 1);
+ }
+
+ return is_stale;
+}
+
+void
+pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ int i;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL || dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ /* Limit cleanup frequency to avoid O(N) scan on every write */
+ /* 100ms interval */
+ if (elapsed_us(&track_table_mutation_shmem->state.last_cleanup_time, &now) > 100000)
+ {
+ table_map_cleanup_expired(map, track_table_mutation_shmem->state.current_ttl_us);
+ track_table_mutation_shmem->state.last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+ table_map_insert(map, table_oid, dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Convenience function to mark a single table as written
+ */
+void
+pool_track_table_mutation_mark_table_written(int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+ pool_track_table_mutation_mark_tables_written(tables, 1, dboid);
+ }
+}
+
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ /* Calculate new TTL: delay * factor, with minimum of default TTL */
+ new_ttl = (uint64)(delay_us * pool_config->track_table_mutation_ttl_factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ track_table_mutation_shmem->state.current_ttl_us = new_ttl;
+ get_current_time(&track_table_mutation_shmem->state.ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: updated TTL to %lu us (delay=%lu us, factor=%.1f)",
+ (unsigned long)new_ttl, (unsigned long)delay_us,
+ pool_config->track_table_mutation_ttl_factor)));
+}
+
+bool
+pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return false;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries = QUERY_CACHE_ENTRIES(cache);
+ int i;
+
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0; i < entries[idx].num_tables && i < TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY; i++)
+ {
+ strlcpy(table_names[i], entries[idx].table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+
+ /* Move to front of LRU */
+ query_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ query_cache_unlock();
+
+ return found;
+}
+
+void
+pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+
+ if (pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE_GLOBAL || track_table_mutation_shmem == NULL)
+ return;
+
+ cache = track_table_mutation_shmem->query_cache;
+
+ query_cache_lock();
+
+ /* Check if already exists */
+ idx = query_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = query_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ query_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: failed to allocate query cache entry")));
+ return;
+ }
+
+ entries = QUERY_CACHE_ENTRIES(cache);
+ buckets = QUERY_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables = (num_tables > TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY) ?
+ TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY : num_tables;
+
+ {
+ int i;
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i], table_names[i], TRACK_TABLE_MUTATION_TABLE_NAME_LEN);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ query_cache_lru_add(cache, idx);
+
+ query_cache_unlock();
+}
+
+uint64
+pool_track_table_mutation_normalize_and_hash(const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized, sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.53.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-26 00:02 ` Tatsuo Ishii <[email protected]>
1 sibling, 0 replies; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-26 00:02 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Added some handling for possible causes - works now.
Unfortunately this doesn't work here. Still 042 test fails if it is
executed *after* 041. i.e.
./regress.sh 04[12] <-- 042 fails
./regress.sh 042 <-- Ok
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-26 07:47 ` Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
1 sibling, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-26 07:47 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Added some handling for possible causes - works now.
Here are comments for the patch.
- Some code lines are too long. We recommend to limit each source code
line up to 78 chars. You can use following script to detect too long
lines (you can ignore reports other than *.[c]) See
https://wiki.postgresql.org/wiki/Committing_checklist
git diff origin/master | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4 | awk "length > 78 || /^diff/"
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
Please avoid to install .gitignore. .gitignore file are maintained by
pgpool core developers.
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
To test watchdog, you should use the standard watchdog_setup too.
+++ b/src/utils/pool_track_table_mutation.c
+static inline void
+query_cache_lock(void)
"query_cache_*" is confusing since we already have query cache
feature. Please use different name.
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
:
:
+ /* Ensure we have a valid query context */
+ if (session_context->query_context == NULL)
+ return oid;
Why does this need? The query context is not used in this function.
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
Please add a comments on what these function do.
+Size
+pool_track_table_mutation_shmem_size(void)
+void
+pool_track_table_mutation_init(void)
+void
+pool_track_table_mutation_child_init(void)
+bool
+pool_track_table_mutation_in_cold_start(void)
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+bool
+pool_track_table_mutation_table_is_stale(int table_oid, int dboid)
__sync_fetch_and_add are old functions. I recommend to replace with
ordinary statements using semaphore to protect the critical region.
+ __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_queries_checked, 1);
Please add a comments on what these function do.
+pool_track_table_mutation_mark_tables_written(const int *table_oids, int num_tables, int dboid)
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+bool
+pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
+void
+pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+uint64
+pool_track_table_mutation_normalize_and_hash(const char *query)
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-02-26 15:26 ` Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-02-26 15:26 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Thank you for the thorough review and the thoughtful pushback on the
cross-session security concern. You're right that "dml_adaptive just hits
himself in the foot, your patch allows him to hit someone else's foot" —
that asymmetry needs a technical answer, not just a threat-model argument.
I've addressed this in the updated patch with two mechanisms:
*1. Maximum staleness cap (`track_table_mutation_max_staleness`)*
New configuration parameter (default: 60 seconds, range: 0–3600000 ms).
This bounds how long any single table entry can continuously force primary
routing, measured from the first write that created the entry. Even under
sustained writes, the entry expires after this period. If the table is
written to again after expiry, a fresh entry is created.
This directly addresses the concern: the worst-case cross-session impact is
bounded and configurable. An operator can look at this parameter and know:
"no matter what, a table's staleness effect on other sessions cannot exceed
X seconds continuously."
For legitimately busy tables, the brief gap between forced expiry and the
next write re-marking the table is negligible — typically milliseconds,
since writes are frequent. The correctness guarantee is preserved.
*2. Database-scoped isolation (documented)*
The tracking is already scoped by database OID — writes in one database
never affect routing decisions for sessions in a different database. I've
documented this explicitly as a security boundary in the docs. In
multi-tenant deployments with separate databases, tenants are isolated from
each other's write activity.
Combined with the existing safeguards (committed writes only, bounded table
map size, opt-in mode), the cross-session impact is now bounded in both
duration and scope.
I've also addressed the other review comments — they should be applied in
the patch as well. tests and code structure.
Thanks!
On Thu, Feb 26, 2026 at 9:47 AM Tatsuo Ishii <[email protected]> wrote:
> > Added some handling for possible causes - works now.
>
> Here are comments for the patch.
>
> - Some code lines are too long. We recommend to limit each source code
> line up to 78 chars. You can use following script to detect too long
> lines (you can ignore reports other than *.[c]) See
> https://wiki.postgresql.org/wiki/Committing_checklist
>
> git diff origin/master | grep -E '^(\+|diff)' | sed 's/^+//' | expand -t4
> | awk "length > 78 || /^diff/"
>
> --- /dev/null
> +++
> b/src/test/regression/tests/043.track_table_mutation_watchdog/.gitignore
>
> Please avoid to install .gitignore. .gitignore file are maintained by
> pgpool core developers.
>
> +++
> b/src/test/regression/tests/043.track_table_mutation_watchdog/leader.conf
>
> To test watchdog, you should use the standard watchdog_setup too.
>
> +++ b/src/utils/pool_track_table_mutation.c
>
> +static inline void
> +query_cache_lock(void)
>
> "query_cache_*" is confusing since we already have query cache
> feature. Please use different name.
>
> +static int
> +track_table_mutation_get_database_oid_internal(void)
> +{
> :
> :
> + /* Ensure we have a valid query context */
> + if (session_context->query_context == NULL)
> + return oid;
>
> Why does this need? The query context is not used in this function.
>
> +/* ----------------
> + * Public API implementation
> + * ----------------
> + */
>
> Please add a comments on what these function do.
>
> +Size
> +pool_track_table_mutation_shmem_size(void)
>
> +void
> +pool_track_table_mutation_init(void)
>
> +void
> +pool_track_table_mutation_child_init(void)
>
> +bool
> +pool_track_table_mutation_in_cold_start(void)
>
> +void
> +pool_track_table_mutation_trigger_global_cold_start(void)
>
> +bool
> +pool_track_table_mutation_table_is_stale(int table_oid, int dboid)
>
> __sync_fetch_and_add are old functions. I recommend to replace with
> ordinary statements using semaphore to protect the critical region.
>
> +
> __sync_fetch_and_add(&track_table_mutation_shmem->state.stats_queries_checked,
> 1);
>
> Please add a comments on what these function do.
>
> +pool_track_table_mutation_mark_tables_written(const int *table_oids, int
> num_tables, int dboid)
>
> +void
> +pool_track_table_mutation_update_ttl(uint64 delay_us)
>
> +bool
> +pool_track_table_mutation_get_cached_parse(uint64 hash, bool *is_write,
>
> +void
> +pool_track_table_mutation_cache_parse(uint64 hash, bool is_write,
> + const char
> table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
> + int num_tables)
>
> +uint64
> +pool_track_table_mutation_normalize_and_hash(const char *query)
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] table_track.patch (102.1K, 3-table_track.patch)
download | inline diff:
From f4065c1cae848812515f028dfb90e05f5fbe53fc Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] feat(load_balance): add in-memory table mutation tracking
Introduces 'dml_adaptive_global' as a new value for disable_load_balance_on_write.
This mode is a superset of dml_adaptive: it performs per-transaction local tracking
AND cross-session shared-memory tracking of recently written tables, routing reads
to primary until a TTL (based on measured replication delay) expires.
Sub-parameters (track_table_mutation_*) control TTL factor, cold start duration,
hash table sizing, and query parse cache sizing.
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index ee19fabebab2210cd4abe59a711a036ac0ac8943..1838a57913e9acb933bfcbf70cce32122740a490 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1108,6 +1108,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-track-table-mutation"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1193,4 +1205,321 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-max-staleness" xreflabel="track_table_mutation_max_staleness">
+ <term><varname>track_table_mutation_max_staleness</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_max_staleness</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum duration in milliseconds that a single table entry can continuously force queries to primary,
+ measured from when the table was first marked stale. When this cap is reached, the entry is expired
+ regardless of recent writes. If the table is written to again after expiry, a fresh tracking entry
+ is created.
+ </para>
+ <para>
+ This parameter bounds the cross-session impact of table mutation tracking. Even if a table is written
+ to in a tight loop, its effect on other sessions' load balancing is limited to this duration. For
+ legitimately busy tables, the gap between forced expiry and the next write re-marking the table is
+ negligible (typically milliseconds).
+ </para>
+ <para>
+ Set to 0 to disable the cap (not recommended for production).
+ Valid range: 0-3600000 ms. Default is <literal>60000</literal> (60 seconds).
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_max_staleness = 60000
+track_table_mutation_cold_start_duration = 2000
+
+# Configure external replication delay monitoring
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096 # Track up to 4096 tables (~160 KB)
+track_table_mutation_query_parse_cache_size = 50000 # Cache 50k queries (~31 MB)
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+
+ <para>
+ <emphasis>Cross-Session Impact and Safety Bounds:</emphasis>
+ Unlike <literal>dml_adaptive</literal> (which only affects the session that issued the write),
+ <literal>dml_adaptive_global</literal> affects all sessions reading the same table in the same database.
+ The following safety mechanisms bound this cross-session impact:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis>Maximum staleness cap:</emphasis> The <xref linkend="guc-track-table-mutation-max-staleness">
+ parameter (default: 60 seconds) limits how long any single table entry can continuously force primary
+ routing. Even under sustained writes, the entry expires after this period and is only renewed by
+ subsequent committed writes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Database isolation:</emphasis> Table staleness tracking is scoped by database OID. Writes
+ in one database never affect load balancing decisions for sessions connected to a different database.
+ In multi-tenant deployments where tenants use separate databases, one tenant's write activity cannot
+ influence another tenant's query routing.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Committed writes only:</emphasis> Only committed transactions mark tables as stale.
+ Rolled-back transactions have no effect on the shared tracking state.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Bounded table map size:</emphasis> The shared memory table map has a fixed maximum size
+ (<xref linkend="guc-track-table-mutation-table-size">). At most this many tables can be marked stale
+ simultaneously, providing a natural ceiling on the feature's impact.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab53055e828a37b6477801640aff17ff84a7..39588af58deba045dffc01ae932115b8a9dbfcf2 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index ce13c42f6a81cbecd87ef35c5507d0ff2d7a7f85..a6b909d427f30366fab7325fa2169068c489a263 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1777,6 +1778,19 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation "
+ "(TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2397,6 +2411,81 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_max_staleness",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Maximum duration in milliseconds that a "
+ "table can be marked stale from its first "
+ "write. 0 disables the cap.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_max_staleness,
+ 60000, /* 60 seconds */
+ 0, 3600000, /* 0 to 1 hour */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_cold_start_duration",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries "
+ "to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb7d58678bc86a0aaa38bd3c6445b6687..683b0ec66fabd708a5a61a54ba0697bf869ecafe 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
- POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
+ POOL_SESSION_CONTEXT *session_context;
+ bool is_adaptive;
+
+ session_context = pool_get_session_context(false);
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ if (is_adaptive &&
+ session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
@@ -1880,7 +1889,13 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
+
+ if (is_adaptive &&
+ pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
@@ -1944,7 +1959,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1963,6 +1978,46 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush
+ * the accumulated table writes to shared
+ * memory. On ROLLBACK, skip -- the writes
+ * never committed so no stale-read risk
+ * exists. This prevents polluting the table
+ * map with rolled-back transactions.
+ */
+ int dlbow =
+ pool_config->disable_load_balance_on_write;
+ List *wlist =
+ session_context->transaction_temp_write_list;
+
+ if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ wlist != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, wlist)
+ {
+ char *tname;
+ int toid;
+
+ tname = (char *) lfirst(cell);
+ toid =
+ pool_table_name_to_oid(tname);
+
+ if (toid > 0)
+ pool_track_table_mutation_mark_table_written(
+ toid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2010,6 +2065,20 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2208,154 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check track table mutation for recently
+ * written tables. If in cold start or any
+ * table was recently written, route to
+ * primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+ int lb_node;
+ POOL_QUERY_CONTEXT *qctx =
+ session_context->query_context;
+
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance"
+ " because of track table"
+ " mutation cold start"),
+ errdetail("destination = PRIMARY"
+ " for query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids =
+ pool_extract_table_oids_from_select_stmt(
+ node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " database oid was"
+ " unavailable"),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ bool stale;
+
+ stale =
+ pool_track_table_mutation_table_is_stale(
+ ctx.table_oids[i],
+ dboid);
+ if (stale)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " table \"%s\" was"
+ " recently written",
+ ctx.table_names[i]),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ else
+ {
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id =
+ select_load_balancing_node();
+ }
+
+ /*
+ * If replication delay is too much,
+ * and prefer_lower_delay_standby is
+ * true then elect the lowest-delayed
+ * node, otherwise send to primary.
+ */
+ lb_node =
+ session_context->load_balance_node_id;
+ if (STREAM &&
+ check_replication_delay(lb_node))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because of"
+ " too much replication"
+ " delay"),
+ errdetail("destination"
+ " = %d for"
+ " query= \"%s\"",
+ dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ lb_node =
+ select_load_balancing_node();
+ session_context->load_balance_node_id =
+ lb_node;
+ qctx->load_balance_node_id =
+ lb_node;
+ pool_set_node_to_be_sent(
+ query_context,
+ lb_node);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ qctx->load_balance_node_id =
+ session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(
+ query_context,
+ qctx->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc64ceba1d1fafd6f4a9f10a750872374..3ebd68e105adc4e94fd8ef96c871d4b04bed8ae0 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,9 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write) &&
+ session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +740,13 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_OFF &&
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index ea6f87e120af866b8ed3a15790d9d8a8e009fe91..7168c1aea877856b5978de332ad636325eb9c30c 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9a397d1666b408dc16dad743a955a718ccbf23f5..c1e6ecc6f0ce62b2fa9d7560f0a199b40126908e 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -365,6 +369,24 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Track table mutation configuration */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for
+ * replication delay */
+ int track_table_mutation_max_staleness; /* max staleness
+ * duration ms */
+ int track_table_mutation_cold_start_duration; /* cold start
+ * duration ms */
+ int track_table_mutation_table_buckets; /* hash buckets for
+ * table map */
+ int track_table_mutation_table_size; /* max table map
+ * entries */
+ int track_table_mutation_query_buckets; /* hash buckets for
+ * query cache */
+ int track_table_mutation_query_parse_cache_size; /* max query
+ * cache
+ * entries */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 0000000000000000000000000000000000000000..b0de2d8093430500c9c1796c418c9bd1d0edd4b3
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,245 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of
+ * recently written tables to prevent stale reads.
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval first_write_time; /* When the entry was first created */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next in collision chain */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names
+ [TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY]
+ [TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Shmem initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ uint32 stats_queries_checked; /* Queries checked */
+ uint32 stats_forced_primary; /* Forced to primary */
+ uint32 stats_allowed_replica; /* Allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(
+ uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(
+ uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index fa05e15e7ac435e072298063f918c70aa4e5680c..395191a1c53a1d76438ce52148375ea89f4f32cf 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1485,11 +1486,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1503,6 +1507,12 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR &&
+ pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3068,6 +3078,16 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1,
+ "track_table_mutation: %zu bytes requested"
+ " for shared memory",
+ MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3184,6 +3204,13 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for recently written tables */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea194ffecc79e58566be80562a46eb75ab..a4ec83f938c74a339ab6a1b8bca2dc547cc5c219 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,33 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature.
+ * For autocommit statements (not in explicit transaction), mark tables
+ * immediately. For explicit transactions, marking is deferred to COMMIT
+ * in dml_adaptive() so that ROLLBACKed writes don't pollute the shared
+ * memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL &&
+ !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(
+ oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f057281be62feaf39db1bb605062f56dc398c..316b76239d163bfdb428f03446384059261f34be 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,13 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index f9458bb557acb8128a6f0d3411d4f08c1f598c29..706abff5bdbd24fee407ee2a82e8911f74695ea6 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,9 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write)
+ && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1806,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index f38f711469576342ce59469b085c97365116004c..dca93334e9e47bb7978064edece5ca0e40021ce3 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
}
return num_oids;
}
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+
+ table = make_table_name_from_rangevar(stmt->relation);
+ }
else if (IsA(node, ExplainStmt))
{
ListCell *cell;
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 1ac982907d2de3ad8cb8d1c70a73dc41428c1327..00132d534f4c78f6844df16aee67365a8d102a61 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,54 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_max_staleness = 60000
+ # Maximum duration (ms) a table can be marked stale
+ # from its first write. Bounds cross-session impact:
+ # even under continuous writes, staleness expires
+ # after this period and is only renewed by new writes.
+ # 0 disables the cap. Range: 0-3600000 (default: 60000 = 60s)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b638658e66ebb56162ad9fa4392315b2df64e..7eaf63010d79c75ba82a27361bff6fcdbc60dfc6 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -695,6 +696,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track max delay for mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1005,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1027,12 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation TTL based on max observed delay */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c1821191a572430b658d80ab34554110363..1c8ae392daa10056119c09c7127e839d859d700d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..8b4dd17b820d36e3fc48216ac7f0544cbf0f5a9c
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 0000000000000000000000000000000000000000..c50c213d6fd741508aab78930808bac303e72b1c
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,184 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# Test script for track table mutation global cold start
+# on watchdog leader change.
+#
+# Uses $WATCHDOG_SETUP to create a 2-node watchdog cluster,
+# then verifies that when the leader is stopped the new
+# leader triggers a global cold start.
+#-------------------------------------------------------------------
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+success_count=0
+
+dir=`pwd`
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# Create 2-node watchdog cluster
+$WATCHDOG_SETUP -wn 2 || exit 1
+
+# Ensure per-node scripts are executable
+# (sed -i in watchdog_setup can strip permissions)
+chmod 755 pgpool*/startall pgpool*/shutdownall
+
+# Append track_table_mutation config to both nodes
+for i in 0 1
+do
+ cat >> pgpool${i}/etc/pgpool.conf <<EOF
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+enable_consensus_with_half_votes = on
+log_min_messages = debug1
+EOF
+done
+
+./startall
+export PCPPASSFILE=$dir/$TESTDIR/pgpool0/pcppass
+
+# Wait for watchdog lifecheck on node 0
+echo -n "waiting for watchdog node 0 starting up..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "lifecheck started" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ break
+ fi
+ sleep 2
+done
+echo "done."
+
+# Test 1: Verify leader came up
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 2: Verify standby joined cluster
+echo "=== Test 2: Waiting for standby to join... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby joined."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation initialized
+echo "=== Test 3: Verify feature initialized ==="
+if grep -a "track_table_mutation: initialized" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: Feature initialized."
+else
+ echo "Test 3 FAILED: Feature not initialized"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader (pgpool0) to trigger failover
+echo "=== Test 4: Stopping leader... ==="
+cd pgpool0
+source ./bashrc.ports
+$PGPOOL_INSTALL_DIR/bin/pgpool \
+ -f etc/pgpool.conf -m f stop
+cd ..
+
+echo "Checking standby detected shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "is shutting down" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Shutdown not detected"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby became new leader
+echo "=== Test 5: Checking standby takes over... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered
+echo "=== Test 6: Checking global cold start... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: global cold start" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 0000000000000000000000000000000000000000..ee09b3f509ed7919c84d1c2ce6906c9c94e5cb06
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1453 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently
+ * written tables to prevent stale reads from replicas.
+ *
+ * Based on the "lagless" architecture from Tailor Brands.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY \
+ "SELECT oid FROM pg_catalog.pg_database" \
+ " WHERE datname = '%s'"
+
+/*
+ * Helper macro: true when the feature is not active.
+ */
+#define TRACK_TABLE_MUTATION_DISABLED() \
+ (pool_config->disable_load_balance_on_write != \
+ DLBOW_DML_ADAPTIVE_GLOBAL || \
+ track_table_mutation_shmem == NULL)
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in parse cache */
+#define PARSE_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in parse cache */
+#define PARSE_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + \
+ sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+parse_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+parse_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ /* FNV offset basis for 64-bit */
+ uint64 hash = 14695981039346656037ULL;
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000)
+ + (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL ||
+ MAIN_CONNECTION(backend) == NULL ||
+ MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(
+ pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "error creating relcache")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(
+ relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "table map init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ if (map->free_list_head == invalid)
+ return invalid;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map,
+ int idx)
+{
+ TrackTableMutationEntry *entries;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table.
+ * Returns entry index or INVALID_INDEX if not found.
+ * Must be called with lock held.
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ while (idx != invalid)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/*
+ * Insert or update a table entry.
+ * Must be called with lock held.
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash,
+ struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != invalid)
+ {
+ /* Update last write time; keep first_write_time */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == invalid)
+ {
+ int b;
+
+ /* Table is full - evict first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != invalid)
+ {
+ int victim = buckets[b];
+
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == invalid)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "failed to allocate entry "
+ "for oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].first_write_time = *write_time;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "marked oid %d (dboid %d) written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map.
+ * Must be called with lock held.
+ */
+static void
+table_map_cleanup_expired(
+ TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ struct timeval now;
+ int64 max_stale_us;
+ int removed = 0;
+ int b;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+
+ max_stale_us = (int64)pool_config
+ ->track_table_mutation_max_staleness * 1000LL;
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != invalid)
+ {
+ int64 age;
+ int64 total_age;
+ bool expired;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ expired = (age > (int64)ttl_us);
+
+ /*
+ * Also evict entries that exceed
+ * max_staleness from first write.
+ */
+ if (!expired && max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ expired = (total_age >= max_stale_us);
+ }
+
+ if (expired)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "cleaned up %d expired entries",
+ removed)));
+ }
+}
+
+/* ----------------
+ * Parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize parse cache
+ */
+static void
+parse_cache_init(QueryParseCache *cache,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = invalid;
+ cache->lru_tail = invalid;
+
+ buckets = PARSE_CACHE_BUCKETS(cache);
+ entries = PARSE_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ entries[i].lru_prev = invalid;
+ entries[i].lru_next = invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "parse cache init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+parse_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != invalid)
+ entries[entries[idx].lru_prev].lru_next =
+ entries[idx].lru_next;
+ if (entries[idx].lru_next != invalid)
+ entries[entries[idx].lru_next].lru_prev =
+ entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != invalid)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == invalid)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+parse_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != invalid)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == invalid)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+parse_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ if (entries[idx].lru_prev != invalid)
+ entries[entries[idx].lru_prev].lru_next =
+ entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != invalid)
+ entries[entries[idx].lru_next].lru_prev =
+ entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = invalid;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if needed
+ */
+static int
+parse_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int *buckets = PARSE_CACHE_BUCKETS(cache);
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ if (cache->free_list_head != invalid)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == invalid)
+ return invalid;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ {
+ int bucket;
+ int *prev_ptr;
+ int curr;
+
+ bucket = entries[idx].query_hash %
+ cache->num_buckets;
+ prev_ptr = &buckets[bucket];
+ curr = buckets[bucket];
+
+ while (curr != invalid)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+ }
+
+ /* Remove from LRU list */
+ parse_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the parse cache
+ */
+static int
+parse_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = PARSE_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ while (idx != invalid)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ */
+static size_t
+normalize_query(const char *query, char *output,
+ size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true;
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ /* Replace string with placeholder */
+ *dst++ = '$';
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src &&
+ !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' ||
+ *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' &&
+ *(src + 1) <= '9'))
+ {
+ while (*src &&
+ ((*src >= '0' && *src <= '9') ||
+ *src == '.'))
+ src++;
+ if (!last_was_space &&
+ dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+/*
+ * Calculate the total shared memory size required
+ * for the track table mutation feature.
+ */
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int tbl_bkt;
+ int tbl_sz;
+ int qry_bkt;
+ int qry_sz;
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+ qry_bkt = pool_config->track_table_mutation_query_buckets;
+ qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += tbl_bkt * sizeof(int);
+ size += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ /* Parse cache */
+ size += sizeof(QueryParseCache);
+ size += qry_bkt * sizeof(int);
+ size += qry_sz * sizeof(QueryParseEntry);
+
+ return size;
+}
+
+/*
+ * Initialize shared memory structures for the
+ * track table mutation feature. Allocates and sets
+ * up the table map and parse cache in shared memory.
+ * Called once from pgpool main process at startup.
+ */
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+ TrackTableMutationState *st;
+ int tbl_bkt;
+ int tbl_sz;
+ int qry_bkt;
+ int qry_sz;
+
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "feature disabled")));
+ return;
+ }
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+ qry_bkt = pool_config->track_table_mutation_query_buckets;
+ qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is zeroed by
+ * initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(
+ shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: "
+ "failed to allocate %zu bytes",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers within shared memory */
+ track_table_mutation_shmem =
+ (TrackTableMutationShmem *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map =
+ (TrackTableMutationHashTable *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += tbl_bkt * sizeof(int);
+ shmem_ptr += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache =
+ (QueryParseCache *) shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(
+ track_table_mutation_shmem->table_map,
+ tbl_bkt, tbl_sz);
+
+ parse_cache_init(
+ track_table_mutation_shmem->query_cache,
+ qry_bkt, qry_sz);
+
+ /* Initialize global state */
+ st = &track_table_mutation_shmem->state;
+ st->initialized = true;
+ st->current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&st->ttl_last_updated);
+ get_current_time(&st->last_cleanup_time);
+ st->global_cold_start_until.tv_sec = 0;
+ st->global_cold_start_until.tv_usec = 0;
+ st->stats_queries_checked = 0;
+ st->stats_forced_primary = 0;
+ st->stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "initialized with %zu bytes shmem",
+ shmem_size)));
+#endif
+}
+
+/*
+ * Initialize per-child process state.
+ * Records the process start time for cold start
+ * period tracking. Called when a child process starts.
+ */
+void
+pool_track_table_mutation_child_init(void)
+{
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+ dur = pool_config->track_table_mutation_cold_start_duration;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "child init, cold start %d ms",
+ dur)));
+}
+
+/*
+ * Check if the process is in cold start period.
+ * During cold start, all queries are routed to
+ * primary to avoid stale reads. Checks both
+ * per-process and global (watchdog) cold start.
+ */
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+ int dur;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return false;
+
+ get_current_time(&now);
+ st = &track_table_mutation_shmem->state;
+
+ /* Check watchdog-triggered global cold start */
+ if (st->global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now,
+ &st->global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < dur)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "cold start (%ld/%d ms)",
+ (long)elapsed_ms, dur)));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Trigger a global cold start for all processes.
+ * Sets the cold start end time in shared memory.
+ * Called after watchdog leader change to force all
+ * queries to primary during the transition.
+ */
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ struct timeval *until;
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return;
+
+ get_current_time(&now);
+ until = &track_table_mutation_shmem->state
+ .global_cold_start_until;
+ *until = now;
+ until->tv_sec += dur / 1000;
+ until->tv_usec += (dur % 1000) * 1000;
+ if (until->tv_usec >= 1000000)
+ {
+ until->tv_sec += until->tv_usec / 1000000;
+ until->tv_usec %= 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "global cold start for %d ms",
+ dur)));
+}
+
+/*
+ * Check if a table was recently written (is "stale").
+ * Returns true if reads should go to primary because
+ * the table was written within the current TTL window.
+ */
+bool
+pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries;
+ int64 age;
+ int64 total_age;
+ int64 max_stale_us;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state
+ .current_ttl_us;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ is_stale = (age < (int64)ttl_us);
+
+ /*
+ * Enforce max_staleness hard cap: no entry
+ * can force primary routing longer than
+ * max_staleness from its first write.
+ */
+ if (is_stale)
+ {
+ max_stale_us = (int64)pool_config
+ ->track_table_mutation_max_staleness
+ * 1000LL;
+ if (max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ if (total_age >= max_stale_us)
+ is_stale = false;
+ }
+ }
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "oid %d dboid %d "
+ "elapsed=%ld ttl=%lu stale=%d",
+ table_oid, dboid,
+ (long)age,
+ (unsigned long)ttl_us,
+ is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics using semaphore */
+ if (track_table_mutation_shmem != NULL)
+ {
+ TrackTableMutationState *st;
+
+ table_map_lock();
+ st = &track_table_mutation_shmem->state;
+ st->stats_queries_checked++;
+ if (is_stale)
+ st->stats_forced_primary++;
+ else
+ st->stats_allowed_replica++;
+ table_map_unlock();
+ }
+
+ return is_stale;
+}
+
+/*
+ * Mark multiple tables as recently written.
+ * Called after DML queries complete to record
+ * which tables were modified.
+ */
+void
+pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ TrackTableMutationState *st;
+ struct timeval now;
+ int i;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL ||
+ dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ st = &track_table_mutation_shmem->state;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ int64 since_cleanup;
+
+ since_cleanup = elapsed_us(
+ &st->last_cleanup_time, &now);
+ /* 100ms interval */
+ if (since_cleanup > 100000)
+ {
+ table_map_cleanup_expired(
+ map, st->current_ttl_us);
+ st->last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(
+ table_oid, dboid);
+ table_map_insert(map, table_oid,
+ dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Mark a single table as recently written.
+ */
+void
+pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+
+ pool_track_table_mutation_mark_tables_written(
+ tables, 1, dboid);
+ }
+}
+
+/*
+ * Update the staleness TTL based on observed
+ * replication delay. New TTL = delay * factor,
+ * clamped to [default_ttl, 1 hour].
+ */
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+ double factor;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ factor = pool_config->track_table_mutation_ttl_factor;
+ new_ttl = (uint64)(delay_us * factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ st = &track_table_mutation_shmem->state;
+ st->current_ttl_us = new_ttl;
+ get_current_time(&st->ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "TTL=%lu us (delay=%lu factor=%.1f)",
+ (unsigned long)new_ttl,
+ (unsigned long)delay_us,
+ factor)));
+}
+
+/*
+ * Look up a cached parse result by query hash.
+ * Returns true and fills output parameters if
+ * the query was found in the parse cache.
+ */
+bool
+pool_track_table_mutation_get_cached_parse(
+ uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+ int max_tables;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
+ cache = track_table_mutation_shmem->query_cache;
+
+ parse_cache_lock();
+
+ idx = parse_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries;
+ int i;
+ int namelen;
+
+ entries = PARSE_CACHE_ENTRIES(cache);
+ namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0;
+ i < entries[idx].num_tables &&
+ i < max_tables;
+ i++)
+ {
+ strlcpy(table_names[i],
+ entries[idx].table_names[i],
+ namelen);
+ }
+
+ /* Move to front of LRU */
+ parse_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ parse_cache_unlock();
+
+ return found;
+}
+
+/*
+ * Store a parse result in the shared cache.
+ * Evicts the LRU entry if the cache is full.
+ */
+void
+pool_track_table_mutation_cache_parse(
+ uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+ int max_tables;
+ int namelen;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
+ namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
+ cache = track_table_mutation_shmem->query_cache;
+
+ parse_cache_lock();
+
+ /* Check if already exists */
+ idx = parse_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ parse_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = parse_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ parse_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "parse cache alloc failed")));
+ return;
+ }
+
+ entries = PARSE_CACHE_ENTRIES(cache);
+ buckets = PARSE_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables =
+ (num_tables > max_tables) ?
+ max_tables : num_tables;
+
+ {
+ int i;
+
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i],
+ table_names[i], namelen);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ parse_cache_lru_add(cache, idx);
+
+ parse_cache_unlock();
+}
+
+/*
+ * Normalize a SQL query and compute its 64-bit hash.
+ * Strips comments, collapses whitespace, lowercases,
+ * and replaces literals with placeholders.
+ */
+uint64
+pool_track_table_mutation_normalize_and_hash(
+ const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized,
+ sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.53.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-03-09 05:18 ` Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-03-09 05:18 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
Sorry for late response. I have been working on this issue.
> > Added some handling for possible causes - works now.
>
> Unfortunately this doesn't work here. Still 042 test fails if it is
> executed *after* 041. i.e.
>
> ./regress.sh 04[12] <-- 042 fails
> ./regress.sh 042 <-- Ok
I ran following script to see if some sockets are left after the 041 test.
./regress.sh '041';netstat -ap|grep 11000;./regress.sh 042
:
testing 041.external_replication_delay...ok.
out of 1 ok:1 failed:0 timeout:0
(一部のプロセスが識別されますが, 所有していないプロセスの情報は
表示されません。それら全てを見るにはルートになる必要があります.)
tcp 0 0 0.0.0.0:11000 0.0.0.0:* LISTEN 1401942/bash
tcp 0 0 localhost:36366 localhost:11000 TIME_WAIT -
tcp 0 0 localhost:36380 localhost:11000 TIME_WAIT -
tcp 0 0 localhost:36384 localhost:11000 TIME_WAIT -
tcp 0 0 localhost:36390 localhost:11000 TIME_WAIT -
tcp 0 0 localhost:43580 localhost:11000 TIME_WAIT -
tcp 0 0 localhost:43596 localhost:11000 TIME_WAIT -
tcp6 0 0 [::]:11000 [::]:* LISTEN 1401942/bash
unix 2 [ ACC ] STREAM LISTENING 10164557 1401942/bash /tmp/.s.PGSQL.11000
creating pgpool-II temporary installation ...
moving pgpool_setup to temporary installation path ...
moving watchdog_setup to temporary installation path ...
using pgpool-II at /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
*************************
REGRESSION MODE : install
Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
Pgpool-II install path : /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
PostgreSQL bin : /usr/local/pgsql/bin
PostgreSQL Major version : 18
pgbench : /usr/local/pgsql/bin/pgbench
PostgreSQL jdbc : /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
*************************
testing 042.track_table_mutation...failed.
out of 1 ok:0 failed:1 timeout:0
It seems the cause of the issue is the bash process:
unix 2 [ ACC ] STREAM LISTENING 10164557 1401942/bash /tmp/.s.PGSQL.11000
It keeps on listening to the socket even after the test and it
prevents pgpool in 042 test from binding the socket, which causes the
test failure. Possible solutions are:
1) fix the external replication delay checking to close the listening
socket before starting bash.
2) close the listening socket when the streaming replication check
worker process forks.
While invensting the issue, I found similar problem is in other
places. For example, pcp process inherits pgpool listening sockets
which are not necessary for pcp. I posted a proposal to fix the issue:
https://www.postgresql.org/message-id/20260302.100028.1346768433787074248.ishii%40postgresql.org
This includes the fix #2. I plan to commit the patch today. After our
buildfarm reports no new problem (it will take 2- 3 days), I am going
to test your patch again.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-03-09 09:22 ` Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-03-09 09:22 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Thank you for looking into this, fixing it and getting back to me.
Looking forward to your update.
On Mon, Mar 9, 2026 at 6:18 AM Tatsuo Ishii <[email protected]> wrote:
> Hi Nadav,
>
> Sorry for late response. I have been working on this issue.
>
> > > Added some handling for possible causes - works now.
> >
> > Unfortunately this doesn't work here. Still 042 test fails if it is
> > executed *after* 041. i.e.
> >
> > ./regress.sh 04[12] <-- 042 fails
> > ./regress.sh 042 <-- Ok
>
> I ran following script to see if some sockets are left after the 041 test.
>
> ./regress.sh '041';netstat -ap|grep 11000;./regress.sh 042
> :
> testing 041.external_replication_delay...ok.
> out of 1 ok:1 failed:0 timeout:0
> (一部のプロセスが識別されますが, 所有していないプロセスの情報は
> 表示されません。それら全てを見るにはルートになる必要があります.)
> tcp 0 0 0.0.0.0:11000 0.0.0.0:*
> LISTEN 1401942/bash
> tcp 0 0 localhost:36366 localhost:11000
> TIME_WAIT -
> tcp 0 0 localhost:36380 localhost:11000
> TIME_WAIT -
> tcp 0 0 localhost:36384 localhost:11000
> TIME_WAIT -
> tcp 0 0 localhost:36390 localhost:11000
> TIME_WAIT -
> tcp 0 0 localhost:43580 localhost:11000
> TIME_WAIT -
> tcp 0 0 localhost:43596 localhost:11000
> TIME_WAIT -
> tcp6 0 0 [::]:11000 [::]:*
> LISTEN 1401942/bash
> unix 2 [ ACC ] STREAM LISTENING 10164557 1401942/bash
> /tmp/.s.PGSQL.11000
> creating pgpool-II temporary installation ...
> moving pgpool_setup to temporary installation path ...
> moving watchdog_setup to temporary installation path ...
> using pgpool-II at
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> *************************
> REGRESSION MODE : install
> Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
> Pgpool-II install path :
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> PostgreSQL bin : /usr/local/pgsql/bin
> PostgreSQL Major version : 18
> pgbench : /usr/local/pgsql/bin/pgbench
> PostgreSQL jdbc :
> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> *************************
> testing 042.track_table_mutation...failed.
> out of 1 ok:0 failed:1 timeout:0
>
> It seems the cause of the issue is the bash process:
>
> unix 2 [ ACC ] STREAM LISTENING 10164557 1401942/bash
> /tmp/.s.PGSQL.11000
>
> It keeps on listening to the socket even after the test and it
> prevents pgpool in 042 test from binding the socket, which causes the
> test failure. Possible solutions are:
>
> 1) fix the external replication delay checking to close the listening
> socket before starting bash.
>
> 2) close the listening socket when the streaming replication check
> worker process forks.
>
> While invensting the issue, I found similar problem is in other
> places. For example, pcp process inherits pgpool listening sockets
> which are not necessary for pcp. I posted a proposal to fix the issue:
>
>
> https://www.postgresql.org/message-id/20260302.100028.1346768433787074248.ishii%40postgresql.org
>
> This includes the fix #2. I plan to commit the patch today. After our
> buildfarm reports no new problem (it will take 2- 3 days), I am going
> to test your patch again.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-03-23 05:13 ` Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-03-23 05:13 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Thank you for looking into this, fixing it and getting back to me.
>
> Looking forward to your update.
It seems my commit fixed the issue.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=18f7f632de982d8fb5d0da2f2fdc48e26ac467e7
So, I continue the review.
+ <para>
+ This feature requires <xref linkend="guc-replication-delay-source-cmd"> to be configured
+ for monitoring replication delay from replicas.
+ </para>
Why this feature requires replication_delay_source_cmd to be set? Why
can't we enable the feature as well when delay_threshold_by_time > 0?
Both replication_delay_source_cmd and delay_threshold_by_time should
provide standy delay in time, which provides enogh information to
run the feature.
1. documentation
- I get a compile error.
openjade -wall -wno-unused-param -wno-empty -wfully-tagged -c /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d stylesheet.dsl -t sgml -i output-html -V html-index pgpool.sgml
openjade:loadbalance.sgml:1122:21:X: reference to non-existent ID "RUNTIME-CONFIG-TRACK-TABLE-MUTATION"
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-03-23 13:07 ` Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-03-23 13:07 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Thank you for the thorough review and the fix to the tests!. Here's the
updated patch addressing all your comments.
re - replication_delay_source_cmd requirement
Good catch — the feature now also works when `delay_threshold_by_time > 0`.
I've added the TTL update call to `check_replication_time_lag()` (the
pg_stat_replication path), not just
`check_replication_time_lag_with_cmd()`. The docs are updated to reflect
that either `replication_delay_source_cmd` or `delay_threshold_by_time` can
provide the time-based delay.
re - Documentation compile error
Fixed — the xref was pointing to `runtime-config-track-table-mutation` but
the actual section ID is `runtime-config-table-mutation-map`.
Thanks again and looking forward to hearing back from you.
On Mon, Mar 23, 2026 at 6:13 AM Tatsuo Ishii <[email protected]> wrote:
>
> > Thank you for looking into this, fixing it and getting back to me.
> >
> > Looking forward to your update.
>
> It seems my commit fixed the issue.
>
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=18f7f632de982d8fb5d0da2f2fdc48e26ac467e7
>
> So, I continue the review.
>
> + <para>
> + This feature requires <xref
> linkend="guc-replication-delay-source-cmd"> to be configured
> + for monitoring replication delay from replicas.
> + </para>
>
> Why this feature requires replication_delay_source_cmd to be set? Why
> can't we enable the feature as well when delay_threshold_by_time > 0?
> Both replication_delay_source_cmd and delay_threshold_by_time should
> provide standy delay in time, which provides enogh information to
> run the feature.
>
> 1. documentation
>
> - I get a compile error.
>
> openjade -wall -wno-unused-param -wno-empty -wfully-tagged -c
> /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d stylesheet.dsl
> -t sgml -i output-html -V html-index pgpool.sgml
> openjade:loadbalance.sgml:1122:21:X: reference to non-existent ID
> "RUNTIME-CONFIG-TRACK-TABLE-MUTATION"
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] table_track.patch (103.6K, 3-table_track.patch)
download | inline diff:
From 0a42bca011460e83156e5181ca7e2c4895b689c6 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Tue, 6 Jan 2026 12:41:50 +0200
Subject: [PATCH] feat(load_balance): add in-memory table mutation tracking
Introduces 'dml_adaptive_global' as a new value for disable_load_balance_on_write.
This mode is a superset of dml_adaptive: it performs per-transaction local tracking
AND cross-session shared-memory tracking of recently written tables, routing reads
to primary until a TTL (based on measured replication delay) expires.
Sub-parameters (track_table_mutation_*) control TTL factor, cold start duration,
hash table sizing, and query parse cache sizing.
---
doc/src/sgml/loadbalance.sgml | 334 ++++
src/Makefile.am | 1 +
src/config/pool_config_variables.c | 89 +
src/context/pool_query_context.c | 227 ++-
src/context/pool_session_context.c | 15 +-
src/include/pool.h | 4 +-
src/include/pool_config.h | 24 +-
src/include/utils/pool_track_table_mutation.h | 245 +++
src/main/pgpool_main.c | 29 +-
src/protocol/CommandComplete.c | 29 +
src/protocol/child.c | 8 +
src/protocol/pool_proto_modules.c | 6 +-
src/query_cache/pool_memqcache.c | 6 +
src/sample/pgpool.conf.sample-stream | 56 +
src/streaming_replication/pool_worker_child.c | 24 +
src/test/regression/libs.sh | 2 +
.../tests/042.track_table_mutation/test.sh | 354 ++++
.../043.track_table_mutation_watchdog/test.sh | 184 +++
src/utils/pool_track_table_mutation.c | 1453 +++++++++++++++++
19 files changed, 3075 insertions(+), 15 deletions(-)
create mode 100644 src/include/utils/pool_track_table_mutation.h
create mode 100755 src/test/regression/tests/042.track_table_mutation/test.sh
create mode 100755 src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
create mode 100644 src/utils/pool_track_table_mutation.c
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 9e1e7b39b..7384ce81a 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1110,6 +1110,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-table-mutation-map"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1195,4 +1207,326 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires time-based replication delay monitoring. This can be provided by either
+ <xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
+ <xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
+ from PostgreSQL 10+). At least one of these must be configured for the TTL calculation to work.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-max-staleness" xreflabel="track_table_mutation_max_staleness">
+ <term><varname>track_table_mutation_max_staleness</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_max_staleness</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum duration in milliseconds that a single table entry can continuously force queries to primary,
+ measured from when the table was first marked stale. When this cap is reached, the entry is expired
+ regardless of recent writes. If the table is written to again after expiry, a fresh tracking entry
+ is created.
+ </para>
+ <para>
+ This parameter bounds the cross-session impact of table mutation tracking. Even if a table is written
+ to in a tight loop, its effect on other sessions' load balancing is limited to this duration. For
+ legitimately busy tables, the gap between forced expiry and the next write re-marking the table is
+ negligible (typically milliseconds).
+ </para>
+ <para>
+ Set to 0 to disable the cap (not recommended for production).
+ Valid range: 0-3600000 ms. Default is <literal>60000</literal> (60 seconds).
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_max_staleness = 60000
+track_table_mutation_cold_start_duration = 2000
+
+# Option A: Use external command for replication delay
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Option B: Use pg_stat_replication replay_lag (PG 10+)
+# delay_threshold_by_time = 1000
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096
+track_table_mutation_query_parse_cache_size = 50000
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+
+ <para>
+ <emphasis>Cross-Session Impact and Safety Bounds:</emphasis>
+ Unlike <literal>dml_adaptive</literal> (which only affects the session that issued the write),
+ <literal>dml_adaptive_global</literal> affects all sessions reading the same table in the same database.
+ The following safety mechanisms bound this cross-session impact:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis>Maximum staleness cap:</emphasis> The <xref linkend="guc-track-table-mutation-max-staleness">
+ parameter (default: 60 seconds) limits how long any single table entry can continuously force primary
+ routing. Even under sustained writes, the entry expires after this period and is only renewed by
+ subsequent committed writes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Database isolation:</emphasis> Table staleness tracking is scoped by database OID. Writes
+ in one database never affect load balancing decisions for sessions connected to a different database.
+ In multi-tenant deployments where tenants use separate databases, one tenant's write activity cannot
+ influence another tenant's query routing.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Committed writes only:</emphasis> Only committed transactions mark tables as stale.
+ Rolled-back transactions have no effect on the shared tracking state.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Bounded table map size:</emphasis> The shared memory table map has a fixed maximum size
+ (<xref linkend="guc-track-table-mutation-table-size">). At most this many tables can be marked stale
+ simultaneously, providing a natural ceiling on the feature's impact.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab530..39588af58 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index ce13c42f6..a6b909d42 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1777,6 +1778,19 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation "
+ "(TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2397,6 +2411,81 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_max_staleness",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Maximum duration in milliseconds that a "
+ "table can be marked stale from its first "
+ "write. 0 disables the cap.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_max_staleness,
+ 60000, /* 60 seconds */
+ 0, 3600000, /* 0 to 1 hour */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_cold_start_duration",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries "
+ "to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 7cf9813eb..683b0ec66 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
- POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
+ POOL_SESSION_CONTEXT *session_context;
+ bool is_adaptive;
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ session_context = pool_get_session_context(false);
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
+
+ if (is_adaptive &&
+ session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
@@ -1880,7 +1889,13 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
+
+ if (is_adaptive &&
+ pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
@@ -1944,7 +1959,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1963,6 +1978,46 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush
+ * the accumulated table writes to shared
+ * memory. On ROLLBACK, skip -- the writes
+ * never committed so no stale-read risk
+ * exists. This prevents polluting the table
+ * map with rolled-back transactions.
+ */
+ int dlbow =
+ pool_config->disable_load_balance_on_write;
+ List *wlist =
+ session_context->transaction_temp_write_list;
+
+ if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ wlist != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, wlist)
+ {
+ char *tname;
+ int toid;
+
+ tname = (char *) lfirst(cell);
+ toid =
+ pool_table_name_to_oid(tname);
+
+ if (toid > 0)
+ pool_track_table_mutation_mark_table_written(
+ toid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2010,6 +2065,20 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache.
+ * This avoids potential hangs in CommandComplete where we shouldn't
+ * be running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2139,6 +2208,154 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+ /*
+ * Check track table mutation for recently
+ * written tables. If in cold start or any
+ * table was recently written, route to
+ * primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+ int lb_node;
+ POOL_QUERY_CONTEXT *qctx =
+ session_context->query_context;
+
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance"
+ " because of track table"
+ " mutation cold start"),
+ errdetail("destination = PRIMARY"
+ " for query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids =
+ pool_extract_table_oids_from_select_stmt(
+ node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " database oid was"
+ " unavailable"),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ bool stale;
+
+ stale =
+ pool_track_table_mutation_table_is_stale(
+ ctx.table_oids[i],
+ dboid);
+ if (stale)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " table \"%s\" was"
+ " recently written",
+ ctx.table_names[i]),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ else
+ {
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id =
+ select_load_balancing_node();
+ }
+
+ /*
+ * If replication delay is too much,
+ * and prefer_lower_delay_standby is
+ * true then elect the lowest-delayed
+ * node, otherwise send to primary.
+ */
+ lb_node =
+ session_context->load_balance_node_id;
+ if (STREAM &&
+ check_replication_delay(lb_node))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because of"
+ " too much replication"
+ " delay"),
+ errdetail("destination"
+ " = %d for"
+ " query= \"%s\"",
+ dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ lb_node =
+ select_load_balancing_node();
+ session_context->load_balance_node_id =
+ lb_node;
+ qctx->load_balance_node_id =
+ lb_node;
+ pool_set_node_to_be_sent(
+ query_context,
+ lb_node);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ qctx->load_balance_node_id =
+ session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(
+ query_context,
+ qctx->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc..3ebd68e10 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,9 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write) &&
+ session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +740,13 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_OFF &&
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index 65907dcf1..0e901691a 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9a397d166..c1e6ecc6f 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -365,6 +369,24 @@ typedef struct
* replication check */
char *replication_delay_source_cmd; /* external command for replication delay */
int replication_delay_source_timeout; /* timeout for external command in seconds */
+
+ /* Track table mutation configuration */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for
+ * replication delay */
+ int track_table_mutation_max_staleness; /* max staleness
+ * duration ms */
+ int track_table_mutation_cold_start_duration; /* cold start
+ * duration ms */
+ int track_table_mutation_table_buckets; /* hash buckets for
+ * table map */
+ int track_table_mutation_table_size; /* max table map
+ * entries */
+ int track_table_mutation_query_buckets; /* hash buckets for
+ * query cache */
+ int track_table_mutation_query_parse_cache_size; /* max query
+ * cache
+ * entries */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 000000000..b0de2d809
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,245 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of
+ * recently written tables to prevent stale reads.
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval first_write_time; /* When the entry was first created */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next in collision chain */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names
+ [TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY]
+ [TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+ /* Flexible array members follow in shared memory:
+ * int buckets[num_buckets];
+ * QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Shmem initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ uint32 stats_queries_checked; /* Queries checked */
+ uint32 stats_forced_primary; /* Forced to primary */
+ uint32 stats_allowed_replica; /* Allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(
+ uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(
+ uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index bf7c452e2..d4e274f02 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1500,11 +1501,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1518,6 +1522,12 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR &&
+ pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3083,6 +3093,16 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1,
+ "track_table_mutation: %zu bytes requested"
+ " for shared memory",
+ MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3199,6 +3219,13 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for recently written tables */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea1..a4ec83f93 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,33 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature.
+ * For autocommit statements (not in explicit transaction), mark tables
+ * immediately. For explicit transactions, marking is deferred to COMMIT
+ * in dml_adaptive() so that ROLLBACKed writes don't pollute the shared
+ * memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL &&
+ !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(
+ oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f05728..316b76239 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,13 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index f9458bb55..706abff5b 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,9 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write)
+ && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1806,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index f38f71146..dca93334e 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
}
return num_oids;
}
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+
+ table = make_table_name_from_rangevar(stmt->relation);
+ }
else if (IsA(node, ExplainStmt))
{
ListCell *cell;
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 1ac982907..00132d534 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,54 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_max_staleness = 60000
+ # Maximum duration (ms) a table can be marked stale
+ # from its first write. Bounds cross-session impact:
+ # even under continuous writes, staleness expires
+ # after this period and is only renewed by new writes.
+ # 0 disables the cap. Range: 0-3600000 (default: 60000 = 60s)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b63865..3ad806e3e 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -419,6 +420,7 @@ check_replication_time_lag(void)
BackendInfo *bkinfo;
uint64 lag;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0;
ErrorContextCallback callback;
int active_standby_node;
bool replication_delay_by_time;
@@ -643,6 +645,10 @@ check_replication_time_lag(void)
* seconds to micro
* seconds */
+ /* Track max delay for mutation TTL */
+ if (lag > max_delay_us)
+ max_delay_us = lag;
+
/* Log delay if necessary */
if ((pool_config->log_standby_delay == LSD_ALWAYS && lag > 0) ||
(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -668,6 +674,13 @@ check_replication_time_lag(void)
}
}
+ /*
+ * Update track table mutation TTL from the max
+ * observed time-based replication delay.
+ */
+ if (replication_delay_by_time && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
error_context_stack = callback.previous;
}
@@ -695,6 +708,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track max delay for mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1017,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1039,12 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation TTL based on max observed delay */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c182..1c8ae392d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 000000000..8b4dd17b8
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 000000000..c50c213d6
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,184 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# Test script for track table mutation global cold start
+# on watchdog leader change.
+#
+# Uses $WATCHDOG_SETUP to create a 2-node watchdog cluster,
+# then verifies that when the leader is stopped the new
+# leader triggers a global cold start.
+#-------------------------------------------------------------------
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+success_count=0
+
+dir=`pwd`
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# Create 2-node watchdog cluster
+$WATCHDOG_SETUP -wn 2 || exit 1
+
+# Ensure per-node scripts are executable
+# (sed -i in watchdog_setup can strip permissions)
+chmod 755 pgpool*/startall pgpool*/shutdownall
+
+# Append track_table_mutation config to both nodes
+for i in 0 1
+do
+ cat >> pgpool${i}/etc/pgpool.conf <<EOF
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+enable_consensus_with_half_votes = on
+log_min_messages = debug1
+EOF
+done
+
+./startall
+export PCPPASSFILE=$dir/$TESTDIR/pgpool0/pcppass
+
+# Wait for watchdog lifecheck on node 0
+echo -n "waiting for watchdog node 0 starting up..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "lifecheck started" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ break
+ fi
+ sleep 2
+done
+echo "done."
+
+# Test 1: Verify leader came up
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 2: Verify standby joined cluster
+echo "=== Test 2: Waiting for standby to join... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby joined."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation initialized
+echo "=== Test 3: Verify feature initialized ==="
+if grep -a "track_table_mutation: initialized" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: Feature initialized."
+else
+ echo "Test 3 FAILED: Feature not initialized"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader (pgpool0) to trigger failover
+echo "=== Test 4: Stopping leader... ==="
+cd pgpool0
+source ./bashrc.ports
+$PGPOOL_INSTALL_DIR/bin/pgpool \
+ -f etc/pgpool.conf -m f stop
+cd ..
+
+echo "Checking standby detected shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "is shutting down" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Shutdown not detected"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby became new leader
+echo "=== Test 5: Checking standby takes over... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered
+echo "=== Test 6: Checking global cold start... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: global cold start" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 000000000..ee09b3f50
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1453 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently
+ * written tables to prevent stale reads from replicas.
+ *
+ * Based on the "lagless" architecture from Tailor Brands.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY \
+ "SELECT oid FROM pg_catalog.pg_database" \
+ " WHERE datname = '%s'"
+
+/*
+ * Helper macro: true when the feature is not active.
+ */
+#define TRACK_TABLE_MUTATION_DISABLED() \
+ (pool_config->disable_load_balance_on_write != \
+ DLBOW_DML_ADAPTIVE_GLOBAL || \
+ track_table_mutation_shmem == NULL)
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in parse cache */
+#define PARSE_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in parse cache */
+#define PARSE_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + \
+ sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+parse_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+parse_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ /* FNV offset basis for 64-bit */
+ uint64 hash = 14695981039346656037ULL;
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8)str[i];
+ hash *= 1099511628211ULL; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64)(end->tv_sec - start->tv_sec) * 1000000)
+ + (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL ||
+ MAIN_CONNECTION(backend) == NULL ||
+ MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(
+ pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "error creating relcache")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(
+ relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "table map init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ if (map->free_list_head == invalid)
+ return invalid;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map,
+ int idx)
+{
+ TrackTableMutationEntry *entries;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table.
+ * Returns entry index or INVALID_INDEX if not found.
+ * Must be called with lock held.
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ while (idx != invalid)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/*
+ * Insert or update a table entry.
+ * Must be called with lock held.
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash,
+ struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != invalid)
+ {
+ /* Update last write time; keep first_write_time */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == invalid)
+ {
+ int b;
+
+ /* Table is full - evict first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != invalid)
+ {
+ int victim = buckets[b];
+
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == invalid)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "failed to allocate entry "
+ "for oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].first_write_time = *write_time;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "marked oid %d (dboid %d) written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map.
+ * Must be called with lock held.
+ */
+static void
+table_map_cleanup_expired(
+ TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ struct timeval now;
+ int64 max_stale_us;
+ int removed = 0;
+ int b;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+
+ max_stale_us = (int64)pool_config
+ ->track_table_mutation_max_staleness * 1000LL;
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != invalid)
+ {
+ int64 age;
+ int64 total_age;
+ bool expired;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ expired = (age > (int64)ttl_us);
+
+ /*
+ * Also evict entries that exceed
+ * max_staleness from first write.
+ */
+ if (!expired && max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ expired = (total_age >= max_stale_us);
+ }
+
+ if (expired)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "cleaned up %d expired entries",
+ removed)));
+ }
+}
+
+/* ----------------
+ * Parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize parse cache
+ */
+static void
+parse_cache_init(QueryParseCache *cache,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = invalid;
+ cache->lru_tail = invalid;
+
+ buckets = PARSE_CACHE_BUCKETS(cache);
+ entries = PARSE_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ entries[i].lru_prev = invalid;
+ entries[i].lru_next = invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "parse cache init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+parse_cache_lru_touch(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != invalid)
+ entries[entries[idx].lru_prev].lru_next =
+ entries[idx].lru_next;
+ if (entries[idx].lru_next != invalid)
+ entries[entries[idx].lru_next].lru_prev =
+ entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != invalid)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == invalid)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+parse_cache_lru_add(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != invalid)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == invalid)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+parse_cache_lru_remove(QueryParseCache *cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ if (entries[idx].lru_prev != invalid)
+ entries[entries[idx].lru_prev].lru_next =
+ entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != invalid)
+ entries[entries[idx].lru_next].lru_prev =
+ entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = invalid;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if needed
+ */
+static int
+parse_cache_alloc_entry(QueryParseCache *cache)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int *buckets = PARSE_CACHE_BUCKETS(cache);
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ if (cache->free_list_head != invalid)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == invalid)
+ return invalid;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ {
+ int bucket;
+ int *prev_ptr;
+ int curr;
+
+ bucket = entries[idx].query_hash %
+ cache->num_buckets;
+ prev_ptr = &buckets[bucket];
+ curr = buckets[bucket];
+
+ while (curr != invalid)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+ }
+
+ /* Remove from LRU list */
+ parse_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the parse cache
+ */
+static int
+parse_cache_lookup(QueryParseCache *cache, uint64 hash)
+{
+ int *buckets = PARSE_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ while (idx != invalid)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ */
+static size_t
+normalize_query(const char *query, char *output,
+ size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true;
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ /* Replace string with placeholder */
+ *dst++ = '$';
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src &&
+ !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' ||
+ *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' &&
+ *(src + 1) <= '9'))
+ {
+ while (*src &&
+ ((*src >= '0' && *src <= '9') ||
+ *src == '.'))
+ src++;
+ if (!last_was_space &&
+ dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+/*
+ * Calculate the total shared memory size required
+ * for the track table mutation feature.
+ */
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int tbl_bkt;
+ int tbl_sz;
+ int qry_bkt;
+ int qry_sz;
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+ qry_bkt = pool_config->track_table_mutation_query_buckets;
+ qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += tbl_bkt * sizeof(int);
+ size += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ /* Parse cache */
+ size += sizeof(QueryParseCache);
+ size += qry_bkt * sizeof(int);
+ size += qry_sz * sizeof(QueryParseEntry);
+
+ return size;
+}
+
+/*
+ * Initialize shared memory structures for the
+ * track table mutation feature. Allocates and sets
+ * up the table map and parse cache in shared memory.
+ * Called once from pgpool main process at startup.
+ */
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+ TrackTableMutationState *st;
+ int tbl_bkt;
+ int tbl_sz;
+ int qry_bkt;
+ int qry_sz;
+
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "feature disabled")));
+ return;
+ }
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+ qry_bkt = pool_config->track_table_mutation_query_buckets;
+ qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment.
+ * Memory is zeroed by
+ * initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(
+ shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: "
+ "failed to allocate %zu bytes",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers within shared memory */
+ track_table_mutation_shmem =
+ (TrackTableMutationShmem *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map =
+ (TrackTableMutationHashTable *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += tbl_bkt * sizeof(int);
+ shmem_ptr += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache =
+ (QueryParseCache *) shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(
+ track_table_mutation_shmem->table_map,
+ tbl_bkt, tbl_sz);
+
+ parse_cache_init(
+ track_table_mutation_shmem->query_cache,
+ qry_bkt, qry_sz);
+
+ /* Initialize global state */
+ st = &track_table_mutation_shmem->state;
+ st->initialized = true;
+ st->current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&st->ttl_last_updated);
+ get_current_time(&st->last_cleanup_time);
+ st->global_cold_start_until.tv_sec = 0;
+ st->global_cold_start_until.tv_usec = 0;
+ st->stats_queries_checked = 0;
+ st->stats_forced_primary = 0;
+ st->stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "initialized with %zu bytes shmem",
+ shmem_size)));
+#endif
+}
+
+/*
+ * Initialize per-child process state.
+ * Records the process start time for cold start
+ * period tracking. Called when a child process starts.
+ */
+void
+pool_track_table_mutation_child_init(void)
+{
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+ dur = pool_config->track_table_mutation_cold_start_duration;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "child init, cold start %d ms",
+ dur)));
+}
+
+/*
+ * Check if the process is in cold start period.
+ * During cold start, all queries are routed to
+ * primary to avoid stale reads. Checks both
+ * per-process and global (watchdog) cold start.
+ */
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+ int dur;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return false;
+
+ get_current_time(&now);
+ st = &track_table_mutation_shmem->state;
+
+ /* Check watchdog-triggered global cold start */
+ if (st->global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now,
+ &st->global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < dur)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "cold start (%ld/%d ms)",
+ (long)elapsed_ms, dur)));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Trigger a global cold start for all processes.
+ * Sets the cold start end time in shared memory.
+ * Called after watchdog leader change to force all
+ * queries to primary during the transition.
+ */
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ struct timeval *until;
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return;
+
+ get_current_time(&now);
+ until = &track_table_mutation_shmem->state
+ .global_cold_start_until;
+ *until = now;
+ until->tv_sec += dur / 1000;
+ until->tv_usec += (dur % 1000) * 1000;
+ if (until->tv_usec >= 1000000)
+ {
+ until->tv_sec += until->tv_usec / 1000000;
+ until->tv_usec %= 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "global cold start for %d ms",
+ dur)));
+}
+
+/*
+ * Check if a table was recently written (is "stale").
+ * Returns true if reads should go to primary because
+ * the table was written within the current TTL window.
+ */
+bool
+pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries;
+ int64 age;
+ int64 total_age;
+ int64 max_stale_us;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state
+ .current_ttl_us;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ is_stale = (age < (int64)ttl_us);
+
+ /*
+ * Enforce max_staleness hard cap: no entry
+ * can force primary routing longer than
+ * max_staleness from its first write.
+ */
+ if (is_stale)
+ {
+ max_stale_us = (int64)pool_config
+ ->track_table_mutation_max_staleness
+ * 1000LL;
+ if (max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ if (total_age >= max_stale_us)
+ is_stale = false;
+ }
+ }
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "oid %d dboid %d "
+ "elapsed=%ld ttl=%lu stale=%d",
+ table_oid, dboid,
+ (long)age,
+ (unsigned long)ttl_us,
+ is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics using semaphore */
+ if (track_table_mutation_shmem != NULL)
+ {
+ TrackTableMutationState *st;
+
+ table_map_lock();
+ st = &track_table_mutation_shmem->state;
+ st->stats_queries_checked++;
+ if (is_stale)
+ st->stats_forced_primary++;
+ else
+ st->stats_allowed_replica++;
+ table_map_unlock();
+ }
+
+ return is_stale;
+}
+
+/*
+ * Mark multiple tables as recently written.
+ * Called after DML queries complete to record
+ * which tables were modified.
+ */
+void
+pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ TrackTableMutationState *st;
+ struct timeval now;
+ int i;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL ||
+ dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ st = &track_table_mutation_shmem->state;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ int64 since_cleanup;
+
+ since_cleanup = elapsed_us(
+ &st->last_cleanup_time, &now);
+ /* 100ms interval */
+ if (since_cleanup > 100000)
+ {
+ table_map_cleanup_expired(
+ map, st->current_ttl_us);
+ st->last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(
+ table_oid, dboid);
+ table_map_insert(map, table_oid,
+ dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Mark a single table as recently written.
+ */
+void
+pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = { table_oid };
+
+ pool_track_table_mutation_mark_tables_written(
+ tables, 1, dboid);
+ }
+}
+
+/*
+ * Update the staleness TTL based on observed
+ * replication delay. New TTL = delay * factor,
+ * clamped to [default_ttl, 1 hour].
+ */
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+ double factor;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ factor = pool_config->track_table_mutation_ttl_factor;
+ new_ttl = (uint64)(delay_us * factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ st = &track_table_mutation_shmem->state;
+ st->current_ttl_us = new_ttl;
+ get_current_time(&st->ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "TTL=%lu us (delay=%lu factor=%.1f)",
+ (unsigned long)new_ttl,
+ (unsigned long)delay_us,
+ factor)));
+}
+
+/*
+ * Look up a cached parse result by query hash.
+ * Returns true and fills output parameters if
+ * the query was found in the parse cache.
+ */
+bool
+pool_track_table_mutation_get_cached_parse(
+ uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+ int max_tables;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
+ cache = track_table_mutation_shmem->query_cache;
+
+ parse_cache_lock();
+
+ idx = parse_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries;
+ int i;
+ int namelen;
+
+ entries = PARSE_CACHE_ENTRIES(cache);
+ namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0;
+ i < entries[idx].num_tables &&
+ i < max_tables;
+ i++)
+ {
+ strlcpy(table_names[i],
+ entries[idx].table_names[i],
+ namelen);
+ }
+
+ /* Move to front of LRU */
+ parse_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ parse_cache_unlock();
+
+ return found;
+}
+
+/*
+ * Store a parse result in the shared cache.
+ * Evicts the LRU entry if the cache is full.
+ */
+void
+pool_track_table_mutation_cache_parse(
+ uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+ int max_tables;
+ int namelen;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
+ namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
+ cache = track_table_mutation_shmem->query_cache;
+
+ parse_cache_lock();
+
+ /* Check if already exists */
+ idx = parse_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ parse_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = parse_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ parse_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "parse cache alloc failed")));
+ return;
+ }
+
+ entries = PARSE_CACHE_ENTRIES(cache);
+ buckets = PARSE_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables =
+ (num_tables > max_tables) ?
+ max_tables : num_tables;
+
+ {
+ int i;
+
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i],
+ table_names[i], namelen);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ parse_cache_lru_add(cache, idx);
+
+ parse_cache_unlock();
+}
+
+/*
+ * Normalize a SQL query and compute its 64-bit hash.
+ * Strips comments, collapses whitespace, lowercases,
+ * and replaces literals with placeholders.
+ */
+uint64
+pool_track_table_mutation_normalize_and_hash(
+ const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized,
+ sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.53.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-04-07 00:08 ` Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-04-07 00:08 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
> Hi Tatsuo,
>
> Thank you for the thorough review and the fix to the tests!. Here's the
> updated patch addressing all your comments.
>
> re - replication_delay_source_cmd requirement
>
> Good catch ― the feature now also works when `delay_threshold_by_time > 0`.
> I've added the TTL update call to `check_replication_time_lag()` (the
> pg_stat_replication path), not just
> `check_replication_time_lag_with_cmd()`. The docs are updated to reflect
> that either `replication_delay_source_cmd` or `delay_threshold_by_time` can
> provide the time-based delay.
>
> re - Documentation compile error
>
> Fixed ― the xref was pointing to `runtime-config-track-table-mutation` but
> the actual section ID is `runtime-config-table-mutation-map`.
>
> Thanks again and looking forward to hearing back from you.
While looking into your patch, I noticed that following part would be
better to be committed as a separate patch as it's actually a bug fix,
not related to the feature: query cache is not invalidated if MERGE
statement is executed. I will take care of it and get back to you.
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
}
return num_oids;
}
+ else if (IsA(node, MergeStmt))
+ {
+ MergeStmt *stmt = (MergeStmt *) node;
+
+ table = make_table_name_from_rangevar(stmt->relation);
+ }
else if (IsA(node, ExplainStmt))
{
ListCell *cell;
Thanks for letting me know about the bug!
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-04-07 05:45 ` Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-04-07 05:45 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Yes I ran into it during the work on the feature. Let me know if you want
me to separately submit it.
Cheers
Nadav Shatz
Tailor Brands | CTO
On Tue, Apr 7, 2026 at 2:08 AM Tatsuo Ishii <[email protected]> wrote:
> Hi Nadav,
>
> > Hi Tatsuo,
> >
> > Thank you for the thorough review and the fix to the tests!. Here's the
> > updated patch addressing all your comments.
> >
> > re - replication_delay_source_cmd requirement
> >
> > Good catch ― the feature now also works when `delay_threshold_by_time >
> 0`.
> > I've added the TTL update call to `check_replication_time_lag()` (the
> > pg_stat_replication path), not just
> > `check_replication_time_lag_with_cmd()`. The docs are updated to reflect
> > that either `replication_delay_source_cmd` or `delay_threshold_by_time`
> can
> > provide the time-based delay.
> >
> > re - Documentation compile error
> >
> > Fixed ― the xref was pointing to `runtime-config-track-table-mutation`
> but
> > the actual section ID is `runtime-config-table-mutation-map`.
> >
> > Thanks again and looking forward to hearing back from you.
>
> While looking into your patch, I noticed that following part would be
> better to be committed as a separate patch as it's actually a bug fix,
> not related to the feature: query cache is not invalidated if MERGE
> statement is executed. I will take care of it and get back to you.
>
> --- a/src/query_cache/pool_memqcache.c
> +++ b/src/query_cache/pool_memqcache.c
> @@ -1305,6 +1305,12 @@ pool_extract_table_oids(Node *node, int **oidsp)
> }
> return num_oids;
> }
> + else if (IsA(node, MergeStmt))
> + {
> + MergeStmt *stmt = (MergeStmt *) node;
> +
> + table = make_table_name_from_rangevar(stmt->relation);
> + }
> else if (IsA(node, ExplainStmt))
> {
> ListCell *cell;
>
> Thanks for letting me know about the bug!
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-04-07 09:10 ` Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-04-07 09:10 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
> Hi Tatsuo,
>
> Yes I ran into it during the work on the feature. Let me know if you want
> me to separately submit it.
Thank you for the offering, but I have already pushed the part.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=24755985692be577bdcf487ebddb2c2ff6116661
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-04-07 09:43 ` Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-04-07 09:43 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
>> Yes I ran into it during the work on the feature. Let me know if you want
>> me to separately submit it.
>
> Thank you for the offering, but I have already pushed the part.
>
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=24755985692be577bdcf487ebddb2c2ff6116661
I have modified your patch by just running pgindent (plus subtle
addition to typedefs.list). No detailed code review is done yet. Also
I created a commit message which tries to summarize the
feature. Please let me know any correction and enhancement.
Based on this, I will start more detailed review. It will take a
while.
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Attachments:
[application/octet-stream] v1-0001-Feature-load-balancing-control-by-table-tracking.patch (108.0K, 2-v1-0001-Feature-load-balancing-control-by-table-tracking.patch)
download | inline diff:
From 9006612e720a031f8de93bbe7f0d314061dbd28b Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii <[email protected]>
Date: Tue, 7 Apr 2026 18:27:43 +0900
Subject: [PATCH v1] Feature: load balancing control by table tracking.
Prevent routing of read only queries to standby if replication delay
of tables used in the query exceeds certain amount of value
collected by streaming replication process. To enable this feature,
set disable_load_balance_on_write to dml_adaptive_global.
In this mode, when tables are modified by
INSERT/UPDATE/DELETE/TRUNCATE/MERGE/data modification WITH, for
certain peoriod SELECTs using the tables are not load balanced:
i.e. routed to the primary PostgreSQL server to avoid the data
staleness by replication delay.
Unlike dml_adaptive mode, any table modifications decribed above are
detected even they happn in other sessions (in dml_adaptive, table
modifications are only detected in the same transaction). Note,
however, you cannot use dml_adaptive_object_relationship_list to track
dependency among table and other objects.
Besides dml_adaptive_global, there are some tuning knobs for the
feature:
- track_table_mutation_ttl_factor
Parameter to calculate TTL of each tracking data.
- track_table_mutation_max_staleness
Maximum duration in milliseconds that a single table entry can
continuously force queries to primary.
- track_table_mutation_cold_start_duration
Duration in milliseconds to route all queries to primary after a
child process starts.
- track_table_mutation_table_buckets
Number of hash buckets for the track table mutation hash table.
- track_table_mutation_table_size
Maximum number of tables that can be tracked simultaneously in the
track table mutation.
- track_table_mutation_query_buckets
Number of hash buckets for the query parse cache.
- track_table_mutation_query_parse_cache_size
Maximum number of query parse results to cache.
Author: Nadav Shatz <[email protected]>
Reviewed-by: Tatsuo Ishii <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/20260407.181009.1762204033074164841.ishii%40postgresql.org#58c139c1a7f8d5562865921d0733667b
---
doc/src/sgml/loadbalance.sgml | 334 ++++
src/Makefile.am | 1 +
src/config/pool_config_variables.c | 90 +
src/context/pool_query_context.c | 235 ++-
src/context/pool_session_context.c | 15 +-
src/include/pool.h | 4 +-
src/include/pool_config.h | 28 +-
src/include/utils/pool_track_table_mutation.h | 247 +++
src/main/pgpool_main.c | 29 +-
src/protocol/CommandComplete.c | 28 +
src/protocol/child.c | 8 +
src/protocol/pool_proto_modules.c | 6 +-
src/sample/pgpool.conf.sample-stream | 56 +
src/streaming_replication/pool_worker_child.c | 24 +
src/test/regression/libs.sh | 2 +
.../tests/042.track_table_mutation/test.sh | 354 ++++
.../043.track_table_mutation_watchdog/test.sh | 184 +++
src/tools/pgindent/typedefs.list | 6 +
src/utils/pool_track_table_mutation.c | 1450 +++++++++++++++++
19 files changed, 3080 insertions(+), 21 deletions(-)
create mode 100644 src/include/utils/pool_track_table_mutation.h
create mode 100755 src/test/regression/tests/042.track_table_mutation/test.sh
create mode 100755 src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
create mode 100644 src/utils/pool_track_table_mutation.c
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 9e1e7b39b..7384ce81a 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1110,6 +1110,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-table-mutation-map"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1195,4 +1207,326 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires time-based replication delay monitoring. This can be provided by either
+ <xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
+ <xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
+ from PostgreSQL 10+). At least one of these must be configured for the TTL calculation to work.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
+ memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-max-staleness" xreflabel="track_table_mutation_max_staleness">
+ <term><varname>track_table_mutation_max_staleness</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_max_staleness</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum duration in milliseconds that a single table entry can continuously force queries to primary,
+ measured from when the table was first marked stale. When this cap is reached, the entry is expired
+ regardless of recent writes. If the table is written to again after expiry, a fresh tracking entry
+ is created.
+ </para>
+ <para>
+ This parameter bounds the cross-session impact of table mutation tracking. Even if a table is written
+ to in a tight loop, its effect on other sessions' load balancing is limited to this duration. For
+ legitimately busy tables, the gap between forced expiry and the next write re-marking the table is
+ negligible (typically milliseconds).
+ </para>
+ <para>
+ Set to 0 to disable the cap (not recommended for production).
+ Valid range: 0-3600000 ms. Default is <literal>60000</literal> (60 seconds).
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
+ <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the query parse cache. The cache stores normalized
+ query strings mapped to their table dependencies to avoid repeated parsing.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>2048</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
+ <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of query parse results to cache. Uses LRU eviction when full.
+ Larger caches reduce parsing overhead but consume more shared memory.
+ </para>
+ <para>
+ Valid range: 100-1000000. Default is <literal>10000</literal>.
+ Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_max_staleness = 60000
+track_table_mutation_cold_start_duration = 2000
+
+# Option A: Use external command for replication delay
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Option B: Use pg_stat_replication replay_lag (PG 10+)
+# delay_threshold_by_time = 1000
+
+# Adjust cache sizes based on workload (increases memory usage)
+track_table_mutation_table_size = 4096
+track_table_mutation_query_parse_cache_size = 50000
+ </programlisting>
+ <para>
+ Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
+ Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitation:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+
+ <para>
+ <emphasis>Cross-Session Impact and Safety Bounds:</emphasis>
+ Unlike <literal>dml_adaptive</literal> (which only affects the session that issued the write),
+ <literal>dml_adaptive_global</literal> affects all sessions reading the same table in the same database.
+ The following safety mechanisms bound this cross-session impact:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis>Maximum staleness cap:</emphasis> The <xref linkend="guc-track-table-mutation-max-staleness">
+ parameter (default: 60 seconds) limits how long any single table entry can continuously force primary
+ routing. Even under sustained writes, the entry expires after this period and is only renewed by
+ subsequent committed writes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Database isolation:</emphasis> Table staleness tracking is scoped by database OID. Writes
+ in one database never affect load balancing decisions for sessions connected to a different database.
+ In multi-tenant deployments where tenants use separate databases, one tenant's write activity cannot
+ influence another tenant's query routing.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Committed writes only:</emphasis> Only committed transactions mark tables as stale.
+ Rolled-back transactions have no effect on the shared tracking state.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Bounded table map size:</emphasis> The shared memory table map has a fixed maximum size
+ (<xref linkend="guc-track-table-mutation-table-size">). At most this many tables can be marked stale
+ simultaneously, providing a natural ceiling on the feature's impact.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab530..39588af58 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index ce13c42f6..d5f4fb605 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1777,6 +1778,19 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation "
+ "(TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2397,6 +2411,81 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_max_staleness",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Maximum duration in milliseconds that a "
+ "table can be marked stale from its first "
+ "write. 0 disables the cap.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_max_staleness,
+ 60000, /* 60 seconds */
+ 0, 3600000, /* 0 to 1 hour */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_cold_start_duration",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries "
+ "to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_buckets,
+ 2048,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_query_parse_cache_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in query parse cache.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_query_parse_cache_size,
+ 10000,
+ 100, 1000000,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
@@ -4615,6 +4704,7 @@ static const char *
BackendFlagsShowFunc(int index)
{
unsigned short flag = g_pool_config.backend_desc->backend_info[index].flag;
+
return pool_flag_to_str(flag);
}
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index a056ac596..0190d3673 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
- POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
+ POOL_SESSION_CONTEXT *session_context;
+ bool is_adaptive;
+
+ session_context = pool_get_session_context(false);
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ if (is_adaptive &&
+ session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
@@ -1880,7 +1889,13 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
+
+ if (is_adaptive &&
+ pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
@@ -1947,7 +1962,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1966,6 +1981,45 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush the accumulated
+ * table writes to shared memory. On ROLLBACK, skip -- the
+ * writes never committed so no stale-read risk exists. This
+ * prevents polluting the table map with rolled-back
+ * transactions.
+ */
+ int dlbow =
+ pool_config->disable_load_balance_on_write;
+ List *wlist =
+ session_context->transaction_temp_write_list;
+
+ if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ wlist != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, wlist)
+ {
+ char *tname;
+ int toid;
+
+ tname = (char *) lfirst(cell);
+ toid =
+ pool_table_name_to_oid(tname);
+
+ if (toid > 0)
+ pool_track_table_mutation_mark_table_written(
+ toid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2008,7 +2062,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context = pool_get_session_context(false);
backend = session_context->backend;
- /*
+ /*
* Collect/discard information for disable_load_balance_on_write =
* dml_adaptive case.
*/
@@ -2022,6 +2076,20 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache. This
+ * avoids potential hangs in CommandComplete where we shouldn't be
+ * running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2151,6 +2219,153 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+
+ /*
+ * Check track table mutation for recently written tables. If
+ * in cold start or any table was recently written, route to
+ * primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+ int lb_node;
+ POOL_QUERY_CONTEXT *qctx =
+ session_context->query_context;
+
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance"
+ " because of track table"
+ " mutation cold start"),
+ errdetail("destination = PRIMARY"
+ " for query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids =
+ pool_extract_table_oids_from_select_stmt(
+ node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " database oid was"
+ " unavailable"),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ bool stale;
+
+ stale =
+ pool_track_table_mutation_table_is_stale(
+ ctx.table_oids[i],
+ dboid);
+ if (stale)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " table \"%s\" was"
+ " recently written",
+ ctx.table_names[i]),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ else
+ {
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id =
+ select_load_balancing_node();
+ }
+
+ /*
+ * If replication delay is too much, and
+ * prefer_lower_delay_standby is true then elect the
+ * lowest-delayed node, otherwise send to primary.
+ */
+ lb_node =
+ session_context->load_balance_node_id;
+ if (STREAM &&
+ check_replication_delay(lb_node))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because of"
+ " too much replication"
+ " delay"),
+ errdetail("destination"
+ " = %d for"
+ " query= \"%s\"",
+ dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ lb_node =
+ select_load_balancing_node();
+ session_context->load_balance_node_id =
+ lb_node;
+ qctx->load_balance_node_id =
+ lb_node;
+ pool_set_node_to_be_sent(
+ query_context,
+ lb_node);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ qctx->load_balance_node_id =
+ session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(
+ query_context,
+ qctx->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
@@ -2171,7 +2386,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
errdetail("destination = %d for query= \"%s\"", dest, query)));
/*
- * If prefer_lower_delay_standby is on, choose lower delay standby.
+ * If prefer_lower_delay_standby is on, choose lower
+ * delay standby.
*/
if (pool_config->prefer_lower_delay_standby)
{
@@ -2181,7 +2397,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
}
- else /* delay is too much. prefer to send to primary */
+ else /* delay is too much. prefer to send to
+ * primary */
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
@@ -2191,7 +2408,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
* Not streaming replication mode, or delay_threshold is 0
* or replication delay is small enough.
*/
- else
+ else
{
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context,
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc..05d0b635b 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,9 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write) &&
+ session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +740,13 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_OFF &&
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index 65907dcf1..0e901691a 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 10
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,8 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
+#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9a397d166..ae507dc60 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -363,8 +367,26 @@ typedef struct
char *sr_check_password; /* password for sr_check_user */
char *sr_check_database; /* PostgreSQL database name for streaming
* replication check */
- char *replication_delay_source_cmd; /* external command for replication delay */
- int replication_delay_source_timeout; /* timeout for external command in seconds */
+ char *replication_delay_source_cmd; /* external command for
+ * replication delay */
+ int replication_delay_source_timeout; /* timeout for external
+ * command in seconds */
+
+ /* Track table mutation configuration */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for
+ * replication delay */
+ int track_table_mutation_max_staleness; /* max staleness duration
+ * ms */
+ int track_table_mutation_cold_start_duration; /* cold start duration
+ * ms */
+ int track_table_mutation_table_buckets; /* hash buckets for table
+ * map */
+ int track_table_mutation_table_size; /* max table map entries */
+ int track_table_mutation_query_buckets; /* hash buckets for query
+ * cache */
+ int track_table_mutation_query_parse_cache_size; /* max query cache
+ * entries */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 000000000..28dec1c8a
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,247 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of
+ * recently written tables to prevent stale reads.
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Maximum table name length including schema: "schema"."table"
+ * Using NAMEDATALEN * 2 + 4 for quotes and dot
+ */
+#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
+
+/*
+ * Maximum number of tables we track per query
+ */
+#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval first_write_time; /* When the entry was first created */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next in collision chain */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+
+ /*
+ * Flexible array members follow in shared memory: int
+ * buckets[num_buckets]; TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Entry in the query parse cache
+ */
+typedef struct QueryParseEntry
+{
+ uint64 query_hash; /* Hash of normalized query */
+ bool is_write; /* True if INSERT/UPDATE/DELETE */
+ int num_tables; /* Number of tables in query */
+ char table_names
+ [ TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY]
+ [ TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
+ int next; /* Next entry in collision chain */
+ int lru_prev; /* Previous in LRU list */
+ int lru_next; /* Next in LRU list */
+ bool in_use; /* Is this entry in use? */
+} QueryParseEntry;
+
+/*
+ * Header for the query parse cache in shared memory
+ */
+typedef struct QueryParseCache
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+ int lru_head; /* Most recently used */
+ int lru_tail; /* Least recently used */
+
+ /*
+ * Flexible array members follow in shared memory: int
+ * buckets[num_buckets]; QueryParseEntry entries[max_entries];
+ */
+} QueryParseCache;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Shmem initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ uint32 stats_queries_checked; /* Queries checked */
+ uint32 stats_forced_primary; /* Forced to primary */
+ uint32 stats_allowed_replica; /* Allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+ QueryParseCache *query_cache;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Look up cached parse result for a query.
+ * hash: hash of normalized query
+ * is_write: output - true if query is a write
+ * table_names: output - array to fill with table names
+ * num_tables: output - number of tables found
+ * Returns true if found in cache, false otherwise.
+ */
+extern bool pool_track_table_mutation_get_cached_parse(
+ uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables);
+
+/*
+ * Cache a parse result for a query.
+ * hash: hash of normalized query
+ * is_write: true if query is a write
+ * table_names: array of table names
+ * num_tables: number of tables
+ */
+extern void pool_track_table_mutation_cache_parse(
+ uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables);
+
+/*
+ * Normalize a query and compute its hash.
+ * Strips comments, normalizes whitespace and literals.
+ * query: input SQL query string
+ * Returns: 64-bit hash of normalized query
+ */
+extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index bf7c452e2..d4e274f02 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1500,11 +1501,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1518,6 +1522,12 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR &&
+ pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3083,6 +3093,16 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1,
+ "track_table_mutation: %zu bytes requested"
+ " for shared memory",
+ MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3199,6 +3219,13 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for recently written tables */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea1..f445f268b 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,32 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature. For autocommit
+ * statements (not in explicit transaction), mark tables immediately. For
+ * explicit transactions, marking is deferred to COMMIT in dml_adaptive()
+ * so that ROLLBACKed writes don't pollute the shared memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL &&
+ !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(
+ oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index c34f05728..316b76239 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,13 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index f9458bb55..74ee00d16 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,9 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write)
+ && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1806,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 1ac982907..00132d534 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,54 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~6.4 MB shared memory
+ # (0.1 MB table tracking + 6.3 MB query cache)
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_max_staleness = 60000
+ # Maximum duration (ms) a table can be marked stale
+ # from its first write. Bounds cross-session impact:
+ # even under continuous writes, staleness expires
+ # after this period and is only renewed by new writes.
+ # 0 disables the cap. Range: 0-3600000 (default: 60000 = 60s)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_buckets = 2048
+ # Number of hash buckets for query parse cache
+ # Range: 64-65536 (default: 2048)
+ # (change requires restart)
+
+#track_table_mutation_query_parse_cache_size = 10000
+ # Maximum number of query parse results to cache
+ # Range: 100-1000000 (default: 10000)
+ # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
+ # (change requires restart)
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b63865..cdd570396 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -419,6 +420,7 @@ check_replication_time_lag(void)
BackendInfo *bkinfo;
uint64 lag;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0;
ErrorContextCallback callback;
int active_standby_node;
bool replication_delay_by_time;
@@ -643,6 +645,10 @@ check_replication_time_lag(void)
* seconds to micro
* seconds */
+ /* Track max delay for mutation TTL */
+ if (lag > max_delay_us)
+ max_delay_us = lag;
+
/* Log delay if necessary */
if ((pool_config->log_standby_delay == LSD_ALWAYS && lag > 0) ||
(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -668,6 +674,13 @@ check_replication_time_lag(void)
}
}
+ /*
+ * Update track table mutation TTL from the max observed time-based
+ * replication delay.
+ */
+ if (replication_delay_by_time && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
error_context_stack = callback.previous;
}
@@ -695,6 +708,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track max delay for mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1017,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1039,12 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation TTL based on max observed delay */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c182..1c8ae392d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/042.track_table_mutation/test.sh b/src/test/regression/tests/042.track_table_mutation/test.sh
new file mode 100755
index 000000000..8b4dd17b8
--- /dev/null
+++ b/src/test/regression/tests/042.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
new file mode 100755
index 000000000..c50c213d6
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,184 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# Test script for track table mutation global cold start
+# on watchdog leader change.
+#
+# Uses $WATCHDOG_SETUP to create a 2-node watchdog cluster,
+# then verifies that when the leader is stopped the new
+# leader triggers a global cold start.
+#-------------------------------------------------------------------
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+success_count=0
+
+dir=`pwd`
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# Create 2-node watchdog cluster
+$WATCHDOG_SETUP -wn 2 || exit 1
+
+# Ensure per-node scripts are executable
+# (sed -i in watchdog_setup can strip permissions)
+chmod 755 pgpool*/startall pgpool*/shutdownall
+
+# Append track_table_mutation config to both nodes
+for i in 0 1
+do
+ cat >> pgpool${i}/etc/pgpool.conf <<EOF
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+enable_consensus_with_half_votes = on
+log_min_messages = debug1
+EOF
+done
+
+./startall
+export PCPPASSFILE=$dir/$TESTDIR/pgpool0/pcppass
+
+# Wait for watchdog lifecheck on node 0
+echo -n "waiting for watchdog node 0 starting up..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "lifecheck started" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ break
+ fi
+ sleep 2
+done
+echo "done."
+
+# Test 1: Verify leader came up
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 2: Verify standby joined cluster
+echo "=== Test 2: Waiting for standby to join... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby joined."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation initialized
+echo "=== Test 3: Verify feature initialized ==="
+if grep -a "track_table_mutation: initialized" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: Feature initialized."
+else
+ echo "Test 3 FAILED: Feature not initialized"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader (pgpool0) to trigger failover
+echo "=== Test 4: Stopping leader... ==="
+cd pgpool0
+source ./bashrc.ports
+$PGPOOL_INSTALL_DIR/bin/pgpool \
+ -f etc/pgpool.conf -m f stop
+cd ..
+
+echo "Checking standby detected shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "is shutting down" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Shutdown not detected"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby became new leader
+echo "=== Test 5: Checking standby takes over... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered
+echo "=== Test 6: Checking global cold start... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: global cold start" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 939200965..0f1fa884c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -431,6 +431,8 @@ PublicationObjSpec
PublicationObjSpecType
PublicationTable
Query
+QueryParseCache
+QueryParseEntry
QuerySource
RELQTARGET_OPTION
RTEKind
@@ -519,6 +521,10 @@ TableLikeClause
TableSampleClause
TargetEntry
TokenizedLine
+TrackTableMutationEntry
+TrackTableMutationHashTable
+TrackTableMutationShmem
+TrackTableMutationState
TransactionId
TransactionStmt
TransactionStmtKind
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 000000000..9be46b28f
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,1450 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently
+ * written tables to prevent stale reads from replicas.
+ *
+ * Based on the "lagless" architecture from Tailor Brands.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY \
+ "SELECT oid FROM pg_catalog.pg_database" \
+ " WHERE datname = '%s'"
+
+/*
+ * Helper macro: true when the feature is not active.
+ */
+#define TRACK_TABLE_MUTATION_DISABLED() \
+ (pool_config->disable_load_balance_on_write != \
+ DLBOW_DML_ADAPTIVE_GLOBAL || \
+ track_table_mutation_shmem == NULL)
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* Get pointer to bucket array in parse cache */
+#define PARSE_CACHE_BUCKETS(cache) \
+ ((int *)((char *)(cache) + sizeof(QueryParseCache)))
+
+/* Get pointer to entry array in parse cache */
+#define PARSE_CACHE_ENTRIES(cache) \
+ ((QueryParseEntry *)((char *)(cache) + \
+ sizeof(QueryParseCache) + \
+ (cache)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+parse_cache_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+static inline void
+parse_cache_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/*
+ * FNV-1a hash for 64-bit value
+ */
+static uint64
+fnv1a_hash_64(const char *str, size_t len)
+{
+ /* FNV offset basis for 64-bit */
+ uint64 hash = 14695981039346656037ULL;
+ size_t i;
+
+ for (i = 0; i < len; i++)
+ {
+ hash ^= (uint8) str[i];
+ hash *= 1099511628211ULL; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64) (end->tv_sec - start->tv_sec) * 1000000)
+ + (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL ||
+ MAIN_CONNECTION(backend) == NULL ||
+ MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(
+ pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "error creating relcache")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(
+ relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "table map init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ if (map->free_list_head == invalid)
+ return invalid;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map,
+ int idx)
+{
+ TrackTableMutationEntry *entries;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table.
+ * Returns entry index or INVALID_INDEX if not found.
+ * Must be called with lock held.
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ while (idx != invalid)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/*
+ * Insert or update a table entry.
+ * Must be called with lock held.
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash,
+ struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != invalid)
+ {
+ /* Update last write time; keep first_write_time */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == invalid)
+ {
+ int b;
+
+ /* Table is full - evict first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != invalid)
+ {
+ int victim = buckets[b];
+
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == invalid)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "failed to allocate entry "
+ "for oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].first_write_time = *write_time;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "marked oid %d (dboid %d) written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map.
+ * Must be called with lock held.
+ */
+static void
+table_map_cleanup_expired(
+ TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ struct timeval now;
+ int64 max_stale_us;
+ int removed = 0;
+ int b;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness * 1000LL;
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != invalid)
+ {
+ int64 age;
+ int64 total_age;
+ bool expired;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ expired = (age > (int64) ttl_us);
+
+ /*
+ * Also evict entries that exceed max_staleness from first write.
+ */
+ if (!expired && max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ expired = (total_age >= max_stale_us);
+ }
+
+ if (expired)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "cleaned up %d expired entries",
+ removed)));
+ }
+}
+
+/* ----------------
+ * Parse cache operations
+ * ----------------
+ */
+
+/*
+ * Initialize parse cache
+ */
+static void
+parse_cache_init(QueryParseCache * cache,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ QueryParseEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ cache->num_buckets = num_buckets;
+ cache->max_entries = max_entries;
+ cache->num_entries = 0;
+ cache->free_list_head = 0;
+ cache->lru_head = invalid;
+ cache->lru_tail = invalid;
+
+ buckets = PARSE_CACHE_BUCKETS(cache);
+ entries = PARSE_CACHE_ENTRIES(cache);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ entries[i].lru_prev = invalid;
+ entries[i].lru_next = invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "parse cache init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Move entry to front of LRU list (most recently used)
+ */
+static void
+parse_cache_lru_touch(QueryParseCache * cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ /* Already at head? */
+ if (cache->lru_head == idx)
+ return;
+
+ /* Remove from current position */
+ if (entries[idx].lru_prev != invalid)
+ entries[entries[idx].lru_prev].lru_next =
+ entries[idx].lru_next;
+ if (entries[idx].lru_next != invalid)
+ entries[entries[idx].lru_next].lru_prev =
+ entries[idx].lru_prev;
+ if (cache->lru_tail == idx)
+ cache->lru_tail = entries[idx].lru_prev;
+
+ /* Insert at head */
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = cache->lru_head;
+ if (cache->lru_head != invalid)
+ entries[cache->lru_head].lru_prev = idx;
+ cache->lru_head = idx;
+ if (cache->lru_tail == invalid)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Add entry to LRU list (at head)
+ */
+static void
+parse_cache_lru_add(QueryParseCache * cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = cache->lru_head;
+
+ if (cache->lru_head != invalid)
+ entries[cache->lru_head].lru_prev = idx;
+
+ cache->lru_head = idx;
+
+ if (cache->lru_tail == invalid)
+ cache->lru_tail = idx;
+}
+
+/*
+ * Remove entry from LRU list
+ */
+static void
+parse_cache_lru_remove(QueryParseCache * cache, int idx)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ if (entries[idx].lru_prev != invalid)
+ entries[entries[idx].lru_prev].lru_next =
+ entries[idx].lru_next;
+ else
+ cache->lru_head = entries[idx].lru_next;
+
+ if (entries[idx].lru_next != invalid)
+ entries[entries[idx].lru_next].lru_prev =
+ entries[idx].lru_prev;
+ else
+ cache->lru_tail = entries[idx].lru_prev;
+
+ entries[idx].lru_prev = invalid;
+ entries[idx].lru_next = invalid;
+}
+
+/*
+ * Allocate entry from free list, evicting LRU if needed
+ */
+static int
+parse_cache_alloc_entry(QueryParseCache * cache)
+{
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int *buckets = PARSE_CACHE_BUCKETS(cache);
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ if (cache->free_list_head != invalid)
+ {
+ idx = cache->free_list_head;
+ cache->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ cache->num_entries++;
+ return idx;
+ }
+
+ /* No free entries - evict LRU */
+ if (cache->lru_tail == invalid)
+ return invalid;
+
+ idx = cache->lru_tail;
+
+ /* Remove from hash bucket */
+ {
+ int bucket;
+ int *prev_ptr;
+ int curr;
+
+ bucket = entries[idx].query_hash %
+ cache->num_buckets;
+ prev_ptr = &buckets[bucket];
+ curr = buckets[bucket];
+
+ while (curr != invalid)
+ {
+ if (curr == idx)
+ {
+ *prev_ptr = entries[curr].next;
+ break;
+ }
+ prev_ptr = &entries[curr].next;
+ curr = entries[curr].next;
+ }
+ }
+
+ /* Remove from LRU list */
+ parse_cache_lru_remove(cache, idx);
+
+ /* Reinitialize entry */
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+
+ return idx;
+}
+
+/*
+ * Look up a query in the parse cache
+ */
+static int
+parse_cache_lookup(QueryParseCache * cache, uint64 hash)
+{
+ int *buckets = PARSE_CACHE_BUCKETS(cache);
+ QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
+ int bucket = hash % cache->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ while (idx != invalid)
+ {
+ if (entries[idx].query_hash == hash)
+ return idx;
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/* ----------------
+ * Query normalization
+ * ----------------
+ */
+
+/*
+ * Simple query normalization:
+ * - Strip comments (-- and C-style block comments)
+ * - Collapse whitespace
+ * - Convert to lowercase (except inside strings)
+ * - Replace literal values with placeholders
+ */
+static size_t
+normalize_query(const char *query, char *output,
+ size_t output_size)
+{
+ const char *src = query;
+ char *dst = output;
+ char *dst_end = output + output_size - 1;
+ bool in_string = false;
+ char string_char = 0;
+ bool last_was_space = true;
+
+ while (*src && dst < dst_end)
+ {
+ /* Handle string literals */
+ if (in_string)
+ {
+ if (*src == string_char)
+ {
+ if (*(src + 1) == string_char)
+ {
+ /* Escaped quote */
+ src += 2;
+ continue;
+ }
+ in_string = false;
+ /* Replace string with placeholder */
+ *dst++ = '$';
+ }
+ src++;
+ continue;
+ }
+
+ /* Check for string start */
+ if (*src == '\'' || *src == '"')
+ {
+ in_string = true;
+ string_char = *src;
+ src++;
+ continue;
+ }
+
+ /* Handle single-line comments */
+ if (*src == '-' && *(src + 1) == '-')
+ {
+ while (*src && *src != '\n')
+ src++;
+ continue;
+ }
+
+ /* Handle multi-line comments */
+ if (*src == '/' && *(src + 1) == '*')
+ {
+ src += 2;
+ while (*src &&
+ !(*src == '*' && *(src + 1) == '/'))
+ src++;
+ if (*src)
+ src += 2;
+ continue;
+ }
+
+ /* Handle whitespace */
+ if (*src == ' ' || *src == '\t' ||
+ *src == '\n' || *src == '\r')
+ {
+ if (!last_was_space)
+ {
+ *dst++ = ' ';
+ last_was_space = true;
+ }
+ src++;
+ continue;
+ }
+
+ /* Handle numbers - replace with placeholder */
+ if ((*src >= '0' && *src <= '9') ||
+ (*src == '.' && *(src + 1) >= '0' &&
+ *(src + 1) <= '9'))
+ {
+ while (*src &&
+ ((*src >= '0' && *src <= '9') ||
+ *src == '.'))
+ src++;
+ if (!last_was_space &&
+ dst > output && *(dst - 1) != '$')
+ *dst++ = '$';
+ last_was_space = false;
+ continue;
+ }
+
+ /* Regular character - convert to lowercase */
+ if (*src >= 'A' && *src <= 'Z')
+ *dst++ = *src + 32;
+ else
+ *dst++ = *src;
+
+ last_was_space = false;
+ src++;
+ }
+
+ /* Remove trailing space */
+ if (dst > output && *(dst - 1) == ' ')
+ dst--;
+
+ *dst = '\0';
+ return dst - output;
+}
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+/*
+ * Calculate the total shared memory size required
+ * for the track table mutation feature.
+ */
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int tbl_bkt;
+ int tbl_sz;
+ int qry_bkt;
+ int qry_sz;
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+ qry_bkt = pool_config->track_table_mutation_query_buckets;
+ qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += tbl_bkt * sizeof(int);
+ size += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ /* Parse cache */
+ size += sizeof(QueryParseCache);
+ size += qry_bkt * sizeof(int);
+ size += qry_sz * sizeof(QueryParseEntry);
+
+ return size;
+}
+
+/*
+ * Initialize shared memory structures for the
+ * track table mutation feature. Allocates and sets
+ * up the table map and parse cache in shared memory.
+ * Called once from pgpool main process at startup.
+ */
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+ TrackTableMutationState *st;
+ int tbl_bkt;
+ int tbl_sz;
+ int qry_bkt;
+ int qry_sz;
+
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "feature disabled")));
+ return;
+ }
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+ qry_bkt = pool_config->track_table_mutation_query_buckets;
+ qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment. Memory is zeroed by
+ * initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(
+ shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: "
+ "failed to allocate %zu bytes",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers within shared memory */
+ track_table_mutation_shmem =
+ (TrackTableMutationShmem *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map =
+ (TrackTableMutationHashTable *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationHashTable);
+ shmem_ptr += tbl_bkt * sizeof(int);
+ shmem_ptr += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ track_table_mutation_shmem->query_cache =
+ (QueryParseCache *) shmem_ptr;
+
+ /* Initialize structures */
+ table_map_init(
+ track_table_mutation_shmem->table_map,
+ tbl_bkt, tbl_sz);
+
+ parse_cache_init(
+ track_table_mutation_shmem->query_cache,
+ qry_bkt, qry_sz);
+
+ /* Initialize global state */
+ st = &track_table_mutation_shmem->state;
+ st->initialized = true;
+ st->current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&st->ttl_last_updated);
+ get_current_time(&st->last_cleanup_time);
+ st->global_cold_start_until.tv_sec = 0;
+ st->global_cold_start_until.tv_usec = 0;
+ st->stats_queries_checked = 0;
+ st->stats_forced_primary = 0;
+ st->stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "initialized with %zu bytes shmem",
+ shmem_size)));
+#endif
+}
+
+/*
+ * Initialize per-child process state.
+ * Records the process start time for cold start
+ * period tracking. Called when a child process starts.
+ */
+void
+pool_track_table_mutation_child_init(void)
+{
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+ dur = pool_config->track_table_mutation_cold_start_duration;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "child init, cold start %d ms",
+ dur)));
+}
+
+/*
+ * Check if the process is in cold start period.
+ * During cold start, all queries are routed to
+ * primary to avoid stale reads. Checks both
+ * per-process and global (watchdog) cold start.
+ */
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+ int dur;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return false;
+
+ get_current_time(&now);
+ st = &track_table_mutation_shmem->state;
+
+ /* Check watchdog-triggered global cold start */
+ if (st->global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now,
+ &st->global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < dur)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "cold start (%ld/%d ms)",
+ (long) elapsed_ms, dur)));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Trigger a global cold start for all processes.
+ * Sets the cold start end time in shared memory.
+ * Called after watchdog leader change to force all
+ * queries to primary during the transition.
+ */
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ struct timeval *until;
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return;
+
+ get_current_time(&now);
+ until = &track_table_mutation_shmem->state
+ .global_cold_start_until;
+ *until = now;
+ until->tv_sec += dur / 1000;
+ until->tv_usec += (dur % 1000) * 1000;
+ if (until->tv_usec >= 1000000)
+ {
+ until->tv_sec += until->tv_usec / 1000000;
+ until->tv_usec %= 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "global cold start for %d ms",
+ dur)));
+}
+
+/*
+ * Check if a table was recently written (is "stale").
+ * Returns true if reads should go to primary because
+ * the table was written within the current TTL window.
+ */
+bool
+pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries;
+ int64 age;
+ int64 total_age;
+ int64 max_stale_us;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state
+ .current_ttl_us;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ is_stale = (age < (int64) ttl_us);
+
+ /*
+ * Enforce max_staleness hard cap: no entry can force primary routing
+ * longer than max_staleness from its first write.
+ */
+ if (is_stale)
+ {
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness
+ * 1000LL;
+ if (max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ if (total_age >= max_stale_us)
+ is_stale = false;
+ }
+ }
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "oid %d dboid %d "
+ "elapsed=%ld ttl=%lu stale=%d",
+ table_oid, dboid,
+ (long) age,
+ (unsigned long) ttl_us,
+ is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics using semaphore */
+ if (track_table_mutation_shmem != NULL)
+ {
+ TrackTableMutationState *st;
+
+ table_map_lock();
+ st = &track_table_mutation_shmem->state;
+ st->stats_queries_checked++;
+ if (is_stale)
+ st->stats_forced_primary++;
+ else
+ st->stats_allowed_replica++;
+ table_map_unlock();
+ }
+
+ return is_stale;
+}
+
+/*
+ * Mark multiple tables as recently written.
+ * Called after DML queries complete to record
+ * which tables were modified.
+ */
+void
+pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ TrackTableMutationState *st;
+ struct timeval now;
+ int i;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL ||
+ dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ st = &track_table_mutation_shmem->state;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ int64 since_cleanup;
+
+ since_cleanup = elapsed_us(
+ &st->last_cleanup_time, &now);
+ /* 100ms interval */
+ if (since_cleanup > 100000)
+ {
+ table_map_cleanup_expired(
+ map, st->current_ttl_us);
+ st->last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(
+ table_oid, dboid);
+ table_map_insert(map, table_oid,
+ dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Mark a single table as recently written.
+ */
+void
+pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = {table_oid};
+
+ pool_track_table_mutation_mark_tables_written(
+ tables, 1, dboid);
+ }
+}
+
+/*
+ * Update the staleness TTL based on observed
+ * replication delay. New TTL = delay * factor,
+ * clamped to [default_ttl, 1 hour].
+ */
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+ double factor;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ factor = pool_config->track_table_mutation_ttl_factor;
+ new_ttl = (uint64) (delay_us * factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ st = &track_table_mutation_shmem->state;
+ st->current_ttl_us = new_ttl;
+ get_current_time(&st->ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "TTL=%lu us (delay=%lu factor=%.1f)",
+ (unsigned long) new_ttl,
+ (unsigned long) delay_us,
+ factor)));
+}
+
+/*
+ * Look up a cached parse result by query hash.
+ * Returns true and fills output parameters if
+ * the query was found in the parse cache.
+ */
+bool
+pool_track_table_mutation_get_cached_parse(
+ uint64 hash, bool *is_write,
+ char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int *num_tables)
+{
+ QueryParseCache *cache;
+ int idx;
+ bool found = false;
+ int max_tables;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
+ cache = track_table_mutation_shmem->query_cache;
+
+ parse_cache_lock();
+
+ idx = parse_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ QueryParseEntry *entries;
+ int i;
+ int namelen;
+
+ entries = PARSE_CACHE_ENTRIES(cache);
+ namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
+ *is_write = entries[idx].is_write;
+ *num_tables = entries[idx].num_tables;
+
+ for (i = 0;
+ i < entries[idx].num_tables &&
+ i < max_tables;
+ i++)
+ {
+ strlcpy(table_names[i],
+ entries[idx].table_names[i],
+ namelen);
+ }
+
+ /* Move to front of LRU */
+ parse_cache_lru_touch(cache, idx);
+ found = true;
+ }
+
+ parse_cache_unlock();
+
+ return found;
+}
+
+/*
+ * Store a parse result in the shared cache.
+ * Evicts the LRU entry if the cache is full.
+ */
+void
+pool_track_table_mutation_cache_parse(
+ uint64 hash, bool is_write,
+ const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
+ int num_tables)
+{
+ QueryParseCache *cache;
+ int *buckets;
+ QueryParseEntry *entries;
+ int idx;
+ int bucket;
+ int max_tables;
+ int namelen;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
+ namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
+ cache = track_table_mutation_shmem->query_cache;
+
+ parse_cache_lock();
+
+ /* Check if already exists */
+ idx = parse_cache_lookup(cache, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ parse_cache_unlock();
+ return;
+ }
+
+ /* Allocate new entry (may evict LRU) */
+ idx = parse_cache_alloc_entry(cache);
+ if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ parse_cache_unlock();
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "parse cache alloc failed")));
+ return;
+ }
+
+ entries = PARSE_CACHE_ENTRIES(cache);
+ buckets = PARSE_CACHE_BUCKETS(cache);
+
+ /* Fill in entry */
+ entries[idx].query_hash = hash;
+ entries[idx].is_write = is_write;
+ entries[idx].num_tables =
+ (num_tables > max_tables) ?
+ max_tables : num_tables;
+
+ {
+ int i;
+
+ for (i = 0; i < entries[idx].num_tables; i++)
+ {
+ strlcpy(entries[idx].table_names[i],
+ table_names[i], namelen);
+ }
+ }
+
+ /* Insert into hash bucket */
+ bucket = hash % cache->num_buckets;
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ /* Add to LRU list */
+ parse_cache_lru_add(cache, idx);
+
+ parse_cache_unlock();
+}
+
+/*
+ * Normalize a SQL query and compute its 64-bit hash.
+ * Strips comments, collapses whitespace, lowercases,
+ * and replaces literals with placeholders.
+ */
+uint64
+pool_track_table_mutation_normalize_and_hash(
+ const char *query)
+{
+ char normalized[8192];
+ size_t len;
+
+ if (query == NULL || query[0] == '\0')
+ return 0;
+
+ len = normalize_query(query, normalized,
+ sizeof(normalized));
+ if (len == 0)
+ return 0;
+
+ return fnv1a_hash_64(normalized, len);
+}
--
2.43.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-04-09 07:21 ` Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-04-09 07:21 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Looks good to me thanks!
Please go ahead with your review. waiting to hear back from you.
Best,
On Tue, Apr 7, 2026 at 11:44 AM Tatsuo Ishii <[email protected]> wrote:
> Hi Nadav,
>
> >> Yes I ran into it during the work on the feature. Let me know if you
> want
> >> me to separately submit it.
> >
> > Thank you for the offering, but I have already pushed the part.
> >
> >
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=24755985692be577bdcf487ebddb2c2ff6116661
>
> I have modified your patch by just running pgindent (plus subtle
> addition to typedefs.list). No detailed code review is done yet. Also
> I created a commit message which tries to summarize the
> feature. Please let me know any correction and enhancement.
>
> Based on this, I will start more detailed review. It will take a
> while.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-04-14 22:43 ` Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-04-14 22:43 UTC (permalink / raw)
To: ; +Cc: [email protected]
Hi Nadav,
> Hi Tatsuo,
>
> Looks good to me thanks!
>
> Please go ahead with your review. waiting to hear back from you.
Here are the code review results.
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 9e1e7b39b..7384ce81a 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
:
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
"(Lagless Replica Reads)" sounds like an advertisement to me. It
should be removed.
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag, implementing the "lagless" architecture pattern for distributed systems
+ with read replicas.
I think the feature does not guarantee "lagless" anytime, in all cases.
+ <para>
+ This feature requires time-based replication delay monitoring. This can be provided by either
+ <xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
+ <xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
+ from PostgreSQL 10+). At least one of these must be configured for the TTL calculation to work.
If one of these is not set, what happens? Error? Need to describe it.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
"query cache" should be "query parse cache".
+ Memory usage scales with configuration parameters:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
"query cache" should be "query parse cache".
+ <title>Limitations</title>
I think number of tables tacked in a SELECT is limited to 8. It should
be mentioned.
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index a056ac596..0190d3673 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
You don't need to split the line.
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
You don't need to split the line.
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ if (is_adaptive &&
+ session_context->is_in_transaction)
{
ereport(DEBUG1,
(errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
This line is too long. Please split.
@@ -1880,7 +1889,13 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive = DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write);
I wrote in the commit message:
modifications are only detected in the same transaction). Note,
however, you cannot use dml_adaptive_object_relationship_list to track
dependency among table and other objects.
In my understanding the feature does not use
dml_adaptive_object_relationship_list. If this is correct, why
check_object_relationship_list() is called here in case
dml_adaptive_global? If the feature uses
dml_adaptive_object_relationship_list, test cases should be included.
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 000000000..9be46b28f
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
It seems following functions are not used anywhere. I wonder if this
feature actually use "query parse cache".
pool_track_table_mutation_get_cached_parse
pool_track_table_mutation_cache_parse
pool_track_table_mutation_normalize_and_hash
Besides the code review, I mutated one of regression tests to check
whether the feature co exists with in the existing memory query cache
feature. After attached patch applied, I ran 006.memqcache and got the
following result.
cd src/test/regression
./regress.sh 006
creating pgpool-II temporary installation ...
moving pgpool_setup to temporary installation path ...
moving watchdog_setup to temporary installation path ...
using pgpool-II at /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
*************************
REGRESSION MODE : install
Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
Pgpool-II install path : /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
PostgreSQL bin : /usr/local/pgsql/bin
PostgreSQL Major version : 18
pgbench : /usr/local/pgsql/bin/pgbench
PostgreSQL jdbc : /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
*************************
testing 006.memqcache...failed.
out of 1 ok:0 failed:1 timeout:0
log/006.memqcache shows:
../expected.txt result.txt differ: char 1, line 1
So I checked the test script and found the error was generated by a
Java program test.
java jdbctest > result.txt 2>&1
cmp ../expected.txt result.txt
if [ $? != 0 ];then
./shutdownall
exit 1
fi
In jdbctest.java:
/*
* Cache test in an explicit transaction
*/
conn.setAutoCommit(false);
// execute DML. This should prevent SELECTs from using query cache in the transaction.
sql = "UPDATE t1 SET i = 2;";
pst = conn.createStatement();
pst.executeUpdate(sql);
pst.close();
// should not use the cache and should return "2", rather than "1"
prest = conn.prepareStatement("SELECT * FROM t1");
rs = prest.executeQuery();
The expected file (expected.txt) has "2" but the result file
(testdir/result.txt) was "1". This is the reason why the test
failed. I wonder if there's something wrong with the feature when the
query cache is enabled. Can you look into this?
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Attachments:
[text/x-patch] 006.memqcache.patch (545B, 2-006.memqcache.patch)
download | inline diff:
diff --git a/src/test/regression/tests/006.memqcache/test.sh b/src/test/regression/tests/006.memqcache/test.sh
index f2371744d..1e854acc4 100755
--- a/src/test/regression/tests/006.memqcache/test.sh
+++ b/src/test/regression/tests/006.memqcache/test.sh
@@ -37,6 +37,7 @@ do
echo "log_per_node_statement = on" >> etc/pgpool.conf
echo "log_client_messages = on" >> etc/pgpool.conf
echo "log_min_messages = debug5" >> etc/pgpool.conf
+ echo "disable_load_balance_on_write = dml_adaptive_global" >> etc/pgpool.conf
source ./bashrc.ports
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-04-15 12:17 ` Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-04-15 12:17 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
hank you for the detailed review. Attached patch addresses all items.
memqcache bug fix
-----------------
Good catch. The root cause: pool_set_writing_transaction() was
explicitly skipping dml_adaptive_global, so
pool_is_writing_transaction() always returned false in this mode.
The query cache fetch guard at pool_proto_modules.c:270
(!pool_is_writing_transaction()) then served stale cached results
after DML in the same transaction.
Fix: pool_set_writing_transaction() now sets the flag for
dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
ensures the query cache is properly bypassed after writes within
the same transaction.
Removed dead query parse cache code (~700 lines)
-------------------------------------------------
You're right -- pool_track_table_mutation_get_cached_parse,
pool_track_table_mutation_cache_parse, and
pool_track_table_mutation_normalize_and_hash were never called.
These were leftover from an earlier design where we planned to
cache SQL parse results in shared memory. The feature ended up
using pgpool's existing parser directly, and this code was never
wired up.
Removed: QueryParseCache and QueryParseEntry structs, all related
static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
and the track_table_mutation_query_buckets /
track_table_mutation_query_parse_cache_size configuration
parameters. This also reduces shared memory usage from ~6.4 MB
to ~80 KB with default settings.
check_object_relationship_list scope
-------------------------------------
You're correct -- dml_adaptive_global does not use
dml_adaptive_object_relationship_list. Changed
check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
Documentation fixes
-------------------
- Removed "(Lagless Replica Reads)" from section title and
"lagless" language from description.
- Described fallback behavior when neither
replication_delay_source_cmd nor delay_threshold_by_time is
configured (TTL stays at 100ms default minimum).
- "query cache" references removed (the query parse cache is gone).
- Added 128-table-per-SELECT limit to Limitations section
(uses POOL_MAX_SELECT_OIDS).
Code style fixes
----------------
- DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
- Split the long errmsg line in
is_select_object_in_temp_write_list.
- Removed redundant is_adaptive variable in
is_select_object_in_temp_write_list (the check at function
entry already guarantees it).
Thanks!
On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]> wrote:
> Hi Nadav,
>
> > Hi Tatsuo,
> >
> > Looks good to me thanks!
> >
> > Please go ahead with your review. waiting to hear back from you.
>
> Here are the code review results.
>
> diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
> index 9e1e7b39b..7384ce81a 100644
> --- a/doc/src/sgml/loadbalance.sgml
> +++ b/doc/src/sgml/loadbalance.sgml
> :
> + <sect2 id="runtime-config-table-mutation-map">
> + <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
>
> "(Lagless Replica Reads)" sounds like an advertisement to me. It
> should be removed.
>
> + <para>
> + These parameters configure the track table mutation feature, which is
> activated by setting
> + <xref linkend="guc-disable-load-balance-on-write"> to
> <literal>dml_adaptive_global</literal>.
> + The feature tracks recently written tables to prevent stale reads from
> replica nodes during
> + replication lag, implementing the "lagless" architecture pattern for
> distributed systems
> + with read replicas.
>
> I think the feature does not guarantee "lagless" anytime, in all cases.
>
> + <para>
> + This feature requires time-based replication delay monitoring. This
> can be provided by either
> + <xref linkend="guc-replication-delay-source-cmd"> (external command
> mode) or by setting
> + <xref linkend="guc-delay-threshold-by-time"> (which uses
> <literal>pg_stat_replication.replay_lag</literal>
> + from PostgreSQL 10+). At least one of these must be configured for the
> TTL calculation to work.
>
> If one of these is not set, what happens? Error? Need to describe it.
>
> + </para>
> +
> + <warning>
> + <para>
> + Enabling <literal>dml_adaptive_global</literal> increases shared
> memory consumption. With default settings,
> + the feature requires approximately 6.4 MB of shared memory (0.1 MB
> for table tracking + 6.3 MB for query cache).
>
> "query cache" should be "query parse cache".
>
> + Memory usage scales with configuration parameters:
> + </para>
> + <itemizedlist>
> + <listitem>
> + <para>
> + Table tracking: <literal>track_table_mutation_table_size * 40
> bytes</literal> (default: 2048 * 40 = ~80 KB)
> + </para>
> + </listitem>
> + <listitem>
> + <para>
> + Query cache: <literal>track_table_mutation_query_parse_cache_size *
> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
>
> "query cache" should be "query parse cache".
>
> + <title>Limitations</title>
>
> I think number of tables tacked in a SELECT is limited to 8. It should
> be mentioned.
>
> diff --git a/src/context/pool_query_context.c
> b/src/context/pool_query_context.c
> index a056ac596..0190d3673 100644
> --- a/src/context/pool_query_context.c
> +++ b/src/context/pool_query_context.c
> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
> static bool
> is_select_object_in_temp_write_list(Node *node, void *context)
> {
> - if (node == NULL || pool_config->disable_load_balance_on_write !=
> DLBOW_DML_ADAPTIVE)
> + if (node == NULL ||
> + !DLBOW_IS_DML_ADAPTIVE(
> +
> pool_config->disable_load_balance_on_write))
>
> You don't need to split the line.
>
> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
> +
> pool_config->disable_load_balance_on_write);
>
> You don't need to split the line.
>
> - if (pool_config->disable_load_balance_on_write ==
> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
> + if (is_adaptive &&
> + session_context->is_in_transaction)
> {
> ereport(DEBUG1,
>
> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
> \"%s\"", (char *) context, rgv->relname)));
> This line is too long. Please split.
>
> @@ -1880,7 +1889,13 @@ static char
> *get_associated_object_from_dml_adaptive_relations
> void
> check_object_relationship_list(char *name, bool is_func_name)
> {
> - if (pool_config->disable_load_balance_on_write ==
> DLBOW_DML_ADAPTIVE &&
> pool_config->parsed_dml_adaptive_object_relationship_list)
> + bool is_adaptive;
> +
> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
> +
> pool_config->disable_load_balance_on_write);
>
> I wrote in the commit message:
>
> modifications are only detected in the same transaction). Note,
> however, you cannot use dml_adaptive_object_relationship_list to track
> dependency among table and other objects.
>
> In my understanding the feature does not use
> dml_adaptive_object_relationship_list. If this is correct, why
> check_object_relationship_list() is called here in case
> dml_adaptive_global? If the feature uses
> dml_adaptive_object_relationship_list, test cases should be included.
>
> diff --git a/src/utils/pool_track_table_mutation.c
> b/src/utils/pool_track_table_mutation.c
> new file mode 100644
> index 000000000..9be46b28f
> --- /dev/null
> +++ b/src/utils/pool_track_table_mutation.c
>
> It seems following functions are not used anywhere. I wonder if this
> feature actually use "query parse cache".
>
> pool_track_table_mutation_get_cached_parse
> pool_track_table_mutation_cache_parse
> pool_track_table_mutation_normalize_and_hash
>
> Besides the code review, I mutated one of regression tests to check
> whether the feature co exists with in the existing memory query cache
> feature. After attached patch applied, I ran 006.memqcache and got the
> following result.
>
> cd src/test/regression
> ./regress.sh 006
> creating pgpool-II temporary installation ...
> moving pgpool_setup to temporary installation path ...
> moving watchdog_setup to temporary installation path ...
> using pgpool-II at
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> *************************
> REGRESSION MODE : install
> Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
> Pgpool-II install path :
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> PostgreSQL bin : /usr/local/pgsql/bin
> PostgreSQL Major version : 18
> pgbench : /usr/local/pgsql/bin/pgbench
> PostgreSQL jdbc :
> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> *************************
> testing 006.memqcache...failed.
> out of 1 ok:0 failed:1 timeout:0
>
> log/006.memqcache shows:
>
> ../expected.txt result.txt differ: char 1, line 1
>
> So I checked the test script and found the error was generated by a
> Java program test.
>
> java jdbctest > result.txt 2>&1
> cmp ../expected.txt result.txt
> if [ $? != 0 ];then
> ./shutdownall
> exit 1
> fi
>
> In jdbctest.java:
>
> /*
> * Cache test in an explicit transaction
> */
> conn.setAutoCommit(false);
> // execute DML. This should prevent SELECTs from using
> query cache in the transaction.
> sql = "UPDATE t1 SET i = 2;";
> pst = conn.createStatement();
> pst.executeUpdate(sql);
> pst.close();
> // should not use the cache and should return "2", rather
> than "1"
> prest = conn.prepareStatement("SELECT * FROM t1");
> rs = prest.executeQuery();
>
> The expected file (expected.txt) has "2" but the result file
> (testdir/result.txt) was "1". This is the reason why the test
> failed. I wonder if there's something wrong with the feature when the
> query cache is enabled. Can you look into this?
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] v2-0001-address-review.patch (34.2K, 3-v2-0001-address-review.patch)
download | inline diff:
From ceebe131825941e1d49dd071bf32ffcb021339a5 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Wed, 15 Apr 2026 11:44:21 +0300
Subject: [PATCH] Address review: remove query parse cache, fix memqcache bug.
- Remove dead query parse cache code (QueryParseCache,
QueryParseEntry, and all related functions). These were
never wired up; the feature uses pgpool's existing parser.
This removes ~700 lines, the TRACK_TABLE_MUTATION_QUERY_SEM
semaphore, and the track_table_mutation_query_buckets and
track_table_mutation_query_parse_cache_size parameters.
- Fix stale read from query cache (memqcache) when
dml_adaptive_global is active. pool_set_writing_transaction()
was skipping dml_adaptive_global, so pool_is_writing_transaction()
always returned false, allowing cached results after DML in the
same transaction. Now dml_adaptive_global sets the flag so the
query cache is properly skipped after writes.
- Restrict check_object_relationship_list() to dml_adaptive only.
dml_adaptive_global does not use
dml_adaptive_object_relationship_list.
- Fix docs: remove marketing language, describe behavior when
no delay source is configured, add 128-table-per-SELECT limit
to limitations, fix line length and split issues.
Author: Nadav Shatz <[email protected]>
---
doc/src/sgml/loadbalance.sgml | 82 +--
src/config/pool_config_variables.c | 24 -
src/context/pool_query_context.c | 31 +-
src/context/pool_session_context.c | 10 +-
src/include/pool.h | 3 +-
src/include/pool_config.h | 4 -
src/include/utils/pool_track_table_mutation.h | 80 ---
src/sample/pgpool.conf.sample-stream | 13 +-
src/tools/pgindent/typedefs.list | 2 -
src/utils/pool_track_table_mutation.c | 550 +-----------------
10 files changed, 42 insertions(+), 757 deletions(-)
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 7384ce81a..d4fbcf1a5 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1209,14 +1209,13 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</sect2>
<sect2 id="runtime-config-table-mutation-map">
- <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
+ <title>Table Mutation Tracking Configuration</title>
<para>
These parameters configure the track table mutation feature, which is activated by setting
<xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
The feature tracks recently written tables to prevent stale reads from replica nodes during
- replication lag, implementing the "lagless" architecture pattern for distributed systems
- with read replicas.
+ replication lag.
</para>
<para>
@@ -1229,30 +1228,16 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
This feature requires time-based replication delay monitoring. This can be provided by either
<xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
<xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
- from PostgreSQL 10+). At least one of these must be configured for the TTL calculation to work.
+ from PostgreSQL 10+). If neither is configured, the TTL remains at its default minimum value
+ (100 milliseconds) and is never updated based on actual replication delay, which may result
+ in suboptimal routing decisions.
</para>
<warning>
<para>
Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
- the feature requires approximately 6.4 MB of shared memory (0.1 MB for table tracking + 6.3 MB for query cache).
- Memory usage scales with configuration parameters:
- </para>
- <itemizedlist>
- <listitem>
- <para>
- Table tracking: <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB)
- </para>
- </listitem>
- <listitem>
- <para>
- Query cache: <literal>track_table_mutation_query_parse_cache_size * 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
- </para>
- </listitem>
- </itemizedlist>
- <para>
- For high-traffic systems with large cache sizes (e.g., <literal>track_table_mutation_query_parse_cache_size = 100000</literal>),
- memory usage can reach 64 MB or more. Consider your system's available shared memory when using <literal>dml_adaptive_global</literal>.
+ the feature requires approximately 80 KB of shared memory for table tracking:
+ <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB).
</para>
</warning>
@@ -1364,43 +1349,6 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</listitem>
</varlistentry>
- <varlistentry id="guc-track-table-mutation-query-buckets" xreflabel="track_table_mutation_query_buckets">
- <term><varname>track_table_mutation_query_buckets</varname> (<type>integer</type>)
- <indexterm>
- <primary><varname>track_table_mutation_query_buckets</varname> configuration parameter</primary>
- </indexterm>
- </term>
- <listitem>
- <para>
- Number of hash buckets for the query parse cache. The cache stores normalized
- query strings mapped to their table dependencies to avoid repeated parsing.
- </para>
- <para>
- Valid range: 64-65536. Default is <literal>2048</literal>.
- This parameter can only be set at server start.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="guc-track-table-mutation-query-parse-cache-size" xreflabel="track_table_mutation_query_parse_cache_size">
- <term><varname>track_table_mutation_query_parse_cache_size</varname> (<type>integer</type>)
- <indexterm>
- <primary><varname>track_table_mutation_query_parse_cache_size</varname> configuration parameter</primary>
- </indexterm>
- </term>
- <listitem>
- <para>
- Maximum number of query parse results to cache. Uses LRU eviction when full.
- Larger caches reduce parsing overhead but consume more shared memory.
- </para>
- <para>
- Valid range: 100-1000000. Default is <literal>10000</literal>.
- Memory usage: approximately 640 bytes per entry (~6.3 MB for default, ~64 MB for 100000 entries).
- This parameter can only be set at server start.
- </para>
- </listitem>
- </varlistentry>
-
</variablelist>
<sect3 id="runtime-config-track-table-mutation-example">
@@ -1422,20 +1370,19 @@ replication_delay_source_timeout = 10
# Option B: Use pg_stat_replication replay_lag (PG 10+)
# delay_threshold_by_time = 1000
-# Adjust cache sizes based on workload (increases memory usage)
+# Adjust table map size based on workload
track_table_mutation_table_size = 4096
-track_table_mutation_query_parse_cache_size = 50000
</programlisting>
<para>
- Total shared memory required for above configuration: approximately 31.2 MB (31 MB query cache + 0.2 MB table map + overhead).
- Default configuration (10000 query cache entries, 2048 tables) requires approximately 6.4 MB.
+ Shared memory required for above configuration: approximately 160 KB for the table map.
+ Default configuration (2048 tables) requires approximately 80 KB.
</para>
</sect3>
<sect3 id="runtime-config-track-table-mutation-limitations">
<title>Limitations</title>
<para>
- The track table mutation feature has the following limitation:
+ The track table mutation feature has the following limitations:
</para>
<itemizedlist>
<listitem>
@@ -1444,6 +1391,13 @@ track_table_mutation_query_parse_cache_size = 50000
containing data modification is executed, the table mutation is not recorded.
</para>
</listitem>
+ <listitem>
+ <para>
+ A maximum of 128 tables can be tracked per SELECT query for staleness checking.
+ This limit is shared with the query cache subsystem
+ (<literal>POOL_MAX_SELECT_OIDS</literal>).
+ </para>
+ </listitem>
</itemizedlist>
<para>
If your application uses prepared statements and requires read-after-write consistency,
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index d5f4fb605..bbd65b176 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -2462,30 +2462,6 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
- {
- {"track_table_mutation_query_buckets",
- CFGCXT_INIT, LOAD_BALANCE_CONFIG,
- "Number of hash buckets for query parse cache.",
- CONFIG_VAR_TYPE_INT, false, 0
- },
- &g_pool_config.track_table_mutation_query_buckets,
- 2048,
- 64, 65536,
- NULL, NULL, NULL
- },
-
- {
- {"track_table_mutation_query_parse_cache_size",
- CFGCXT_INIT, LOAD_BALANCE_CONFIG,
- "Maximum number of entries in query parse cache.",
- CONFIG_VAR_TYPE_INT, false, 0
- },
- &g_pool_config.track_table_mutation_query_parse_cache_size,
- 10000,
- 100, 1000000,
- NULL, NULL, NULL
- },
-
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index 0190d3673..c20a3a420 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -1830,27 +1830,25 @@ static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
if (node == NULL ||
- !DLBOW_IS_DML_ADAPTIVE(
- pool_config->disable_load_balance_on_write))
+ !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
POOL_SESSION_CONTEXT *session_context;
- bool is_adaptive;
session_context = pool_get_session_context(false);
- is_adaptive = DLBOW_IS_DML_ADAPTIVE(
- pool_config->disable_load_balance_on_write);
- if (is_adaptive &&
- session_context->is_in_transaction)
+ if (session_context->is_in_transaction)
{
ereport(DEBUG1,
- (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
+ (errmsg("is_select_object_in_temp_write_list:"
+ " \"%s\", found relation \"%s\"",
+ (char *) context, rgv->relname)));
- return is_in_list(rgv->relname, session_context->transaction_temp_write_list);
+ return is_in_list(rgv->relname,
+ session_context->transaction_temp_write_list);
}
}
@@ -1891,8 +1889,9 @@ check_object_relationship_list(char *name, bool is_func_name)
{
bool is_adaptive;
- is_adaptive = DLBOW_IS_DML_ADAPTIVE(
- pool_config->disable_load_balance_on_write);
+ is_adaptive =
+ (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE);
if (is_adaptive &&
pool_config->parsed_dml_adaptive_object_relationship_list)
@@ -1902,8 +1901,8 @@ check_object_relationship_list(char *name, bool is_func_name)
if (session_context->is_in_transaction)
{
char *right_token =
- get_associated_object_from_dml_adaptive_relations
- (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
+ get_associated_object_from_dml_adaptive_relations
+ (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
if (right_token)
{
@@ -1989,9 +1988,9 @@ dml_adaptive(Node *node, char *query)
* transactions.
*/
int dlbow =
- pool_config->disable_load_balance_on_write;
+ pool_config->disable_load_balance_on_write;
List *wlist =
- session_context->transaction_temp_write_list;
+ session_context->transaction_temp_write_list;
if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
is_commit_query(node) &&
@@ -2231,7 +2230,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
bool force_primary = false;
int lb_node;
POOL_QUERY_CONTEXT *qctx =
- session_context->query_context;
+ session_context->query_context;
if (pool_track_table_mutation_in_cold_start())
{
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index 05d0b635b..be30f1a7c 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -740,13 +740,15 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
- * 'dml_adaptive_global', then never turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive', then never
+ * turn on writing transaction flag. For dml_adaptive_global we do set it
+ * so that the query cache (memqcache) is properly skipped after DML
+ * within the same transaction.
*/
if (pool_config->disable_load_balance_on_write !=
DLBOW_OFF &&
- !DLBOW_IS_DML_ADAPTIVE(
- pool_config->disable_load_balance_on_write))
+ pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE)
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index 0e901691a..79d7988fc 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 10
+#define MAX_NUM_SEMAPHORES 9
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -435,7 +435,6 @@ typedef enum
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
#define TRACK_TABLE_MUTATION_TABLE_SEM 8
-#define TRACK_TABLE_MUTATION_QUERY_SEM 9
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index ae507dc60..b8abadd50 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -382,10 +382,6 @@ typedef struct
int track_table_mutation_table_buckets; /* hash buckets for table
* map */
int track_table_mutation_table_size; /* max table map entries */
- int track_table_mutation_query_buckets; /* hash buckets for query
- * cache */
- int track_table_mutation_query_parse_cache_size; /* max query cache
- * entries */
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
index 28dec1c8a..dfbac666d 100644
--- a/src/include/utils/pool_track_table_mutation.h
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -26,17 +26,6 @@
#include "pool.h"
#include <sys/time.h>
-/*
- * Maximum table name length including schema: "schema"."table"
- * Using NAMEDATALEN * 2 + 4 for quotes and dot
- */
-#define TRACK_TABLE_MUTATION_TABLE_NAME_LEN (NAMEDATALEN * 2 + 4)
-
-/*
- * Maximum number of tables we track per query
- */
-#define TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY 8
-
/*
* Invalid index marker for linked lists
*/
@@ -77,41 +66,6 @@ typedef struct TrackTableMutationHashTable
*/
} TrackTableMutationHashTable;
-/*
- * Entry in the query parse cache
- */
-typedef struct QueryParseEntry
-{
- uint64 query_hash; /* Hash of normalized query */
- bool is_write; /* True if INSERT/UPDATE/DELETE */
- int num_tables; /* Number of tables in query */
- char table_names
- [ TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY]
- [ TRACK_TABLE_MUTATION_TABLE_NAME_LEN];
- int next; /* Next entry in collision chain */
- int lru_prev; /* Previous in LRU list */
- int lru_next; /* Next in LRU list */
- bool in_use; /* Is this entry in use? */
-} QueryParseEntry;
-
-/*
- * Header for the query parse cache in shared memory
- */
-typedef struct QueryParseCache
-{
- int num_buckets; /* Number of hash buckets */
- int max_entries; /* Maximum entries allowed */
- int num_entries; /* Current number of entries */
- int free_list_head; /* Head of free entry list */
- int lru_head; /* Most recently used */
- int lru_tail; /* Least recently used */
-
- /*
- * Flexible array members follow in shared memory: int
- * buckets[num_buckets]; QueryParseEntry entries[max_entries];
- */
-} QueryParseCache;
-
/*
* Global state for track table mutation feature
*/
@@ -134,7 +88,6 @@ typedef struct TrackTableMutationShmem
{
TrackTableMutationState state;
TrackTableMutationHashTable *table_map;
- QueryParseCache *query_cache;
} TrackTableMutationShmem;
/* ----------------
@@ -206,39 +159,6 @@ extern void pool_track_table_mutation_mark_table_written(
*/
extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
-/*
- * Look up cached parse result for a query.
- * hash: hash of normalized query
- * is_write: output - true if query is a write
- * table_names: output - array to fill with table names
- * num_tables: output - number of tables found
- * Returns true if found in cache, false otherwise.
- */
-extern bool pool_track_table_mutation_get_cached_parse(
- uint64 hash, bool *is_write,
- char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
- int *num_tables);
-
-/*
- * Cache a parse result for a query.
- * hash: hash of normalized query
- * is_write: true if query is a write
- * table_names: array of table names
- * num_tables: number of tables
- */
-extern void pool_track_table_mutation_cache_parse(
- uint64 hash, bool is_write,
- const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
- int num_tables);
-
-/*
- * Normalize a query and compute its hash.
- * Strips comments, normalizes whitespace and literals.
- * query: input SQL query string
- * Returns: 64-bit hash of normalized query
- */
-extern uint64 pool_track_table_mutation_normalize_and_hash(const char *query);
-
/*
* Calculate required shared memory size for track table mutation.
*/
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 00132d534..ce9b92da0 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -509,8 +509,7 @@ backend_clustering_mode = streaming_replication
# - Track Table Mutation (used by dml_adaptive_global) -
# WARNING: dml_adaptive_global increases shared memory usage
- # Default settings require ~6.4 MB shared memory
- # (0.1 MB table tracking + 6.3 MB query cache)
+ # Default settings require ~80 KB shared memory for table tracking
#track_table_mutation_ttl_factor = 5.0
# TTL multiplier: TTL = replication_delay * factor
@@ -544,16 +543,6 @@ backend_clustering_mode = streaming_replication
# Range: 128-131072 (default: 2048)
# (change requires restart)
-#track_table_mutation_query_buckets = 2048
- # Number of hash buckets for query parse cache
- # Range: 64-65536 (default: 2048)
- # (change requires restart)
-
-#track_table_mutation_query_parse_cache_size = 10000
- # Maximum number of query parse results to cache
- # Range: 100-1000000 (default: 10000)
- # Memory usage: ~640 bytes per entry (~6.3 MB default, ~64 MB for 100000)
- # (change requires restart)
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0f1fa884c..467ec114c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -431,8 +431,6 @@ PublicationObjSpec
PublicationObjSpecType
PublicationTable
Query
-QueryParseCache
-QueryParseEntry
QuerySource
RELQTARGET_OPTION
RTEKind
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
index 9be46b28f..e7771e7bf 100644
--- a/src/utils/pool_track_table_mutation.c
+++ b/src/utils/pool_track_table_mutation.c
@@ -76,16 +76,6 @@ static bool cold_start_initialized = false;
sizeof(TrackTableMutationHashTable) + \
(map)->num_buckets * sizeof(int)))
-/* Get pointer to bucket array in parse cache */
-#define PARSE_CACHE_BUCKETS(cache) \
- ((int *)((char *)(cache) + sizeof(QueryParseCache)))
-
-/* Get pointer to entry array in parse cache */
-#define PARSE_CACHE_ENTRIES(cache) \
- ((QueryParseEntry *)((char *)(cache) + \
- sizeof(QueryParseCache) + \
- (cache)->num_buckets * sizeof(int)))
-
/* ----------------
* Semaphore lock helpers
* ----------------
@@ -103,18 +93,6 @@ table_map_unlock(void)
pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
}
-static inline void
-parse_cache_lock(void)
-{
- pool_semaphore_lock(TRACK_TABLE_MUTATION_QUERY_SEM);
-}
-
-static inline void
-parse_cache_unlock(void)
-{
- pool_semaphore_unlock(TRACK_TABLE_MUTATION_QUERY_SEM);
-}
-
/* ----------------
* Hash functions
* ----------------
@@ -144,25 +122,6 @@ fnv1a_hash_table_key(int table_oid, int dboid)
return hash;
}
-/*
- * FNV-1a hash for 64-bit value
- */
-static uint64
-fnv1a_hash_64(const char *str, size_t len)
-{
- /* FNV offset basis for 64-bit */
- uint64 hash = 14695981039346656037ULL;
- size_t i;
-
- for (i = 0; i < len; i++)
- {
- hash ^= (uint8) str[i];
- hash *= 1099511628211ULL; /* FNV prime */
- }
-
- return hash;
-}
-
/* ----------------
* Time utilities
* ----------------
@@ -514,334 +473,6 @@ table_map_cleanup_expired(
}
}
-/* ----------------
- * Parse cache operations
- * ----------------
- */
-
-/*
- * Initialize parse cache
- */
-static void
-parse_cache_init(QueryParseCache * cache,
- int num_buckets, int max_entries)
-{
- int *buckets;
- QueryParseEntry *entries;
- int i;
- int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
-
- cache->num_buckets = num_buckets;
- cache->max_entries = max_entries;
- cache->num_entries = 0;
- cache->free_list_head = 0;
- cache->lru_head = invalid;
- cache->lru_tail = invalid;
-
- buckets = PARSE_CACHE_BUCKETS(cache);
- entries = PARSE_CACHE_ENTRIES(cache);
-
- /* Initialize all buckets to empty */
- for (i = 0; i < num_buckets; i++)
- buckets[i] = invalid;
-
- /* Initialize free list */
- for (i = 0; i < max_entries; i++)
- {
- entries[i].in_use = false;
- entries[i].next = (i < max_entries - 1) ?
- i + 1 : invalid;
- entries[i].lru_prev = invalid;
- entries[i].lru_next = invalid;
- }
-
- ereport(DEBUG1,
- (errmsg("track_table_mutation: "
- "parse cache init %d buckets, "
- "%d max entries",
- num_buckets, max_entries)));
-}
-
-/*
- * Move entry to front of LRU list (most recently used)
- */
-static void
-parse_cache_lru_touch(QueryParseCache * cache, int idx)
-{
- QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
- int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
-
- /* Already at head? */
- if (cache->lru_head == idx)
- return;
-
- /* Remove from current position */
- if (entries[idx].lru_prev != invalid)
- entries[entries[idx].lru_prev].lru_next =
- entries[idx].lru_next;
- if (entries[idx].lru_next != invalid)
- entries[entries[idx].lru_next].lru_prev =
- entries[idx].lru_prev;
- if (cache->lru_tail == idx)
- cache->lru_tail = entries[idx].lru_prev;
-
- /* Insert at head */
- entries[idx].lru_prev = invalid;
- entries[idx].lru_next = cache->lru_head;
- if (cache->lru_head != invalid)
- entries[cache->lru_head].lru_prev = idx;
- cache->lru_head = idx;
- if (cache->lru_tail == invalid)
- cache->lru_tail = idx;
-}
-
-/*
- * Add entry to LRU list (at head)
- */
-static void
-parse_cache_lru_add(QueryParseCache * cache, int idx)
-{
- QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
- int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
-
- entries[idx].lru_prev = invalid;
- entries[idx].lru_next = cache->lru_head;
-
- if (cache->lru_head != invalid)
- entries[cache->lru_head].lru_prev = idx;
-
- cache->lru_head = idx;
-
- if (cache->lru_tail == invalid)
- cache->lru_tail = idx;
-}
-
-/*
- * Remove entry from LRU list
- */
-static void
-parse_cache_lru_remove(QueryParseCache * cache, int idx)
-{
- QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
- int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
-
- if (entries[idx].lru_prev != invalid)
- entries[entries[idx].lru_prev].lru_next =
- entries[idx].lru_next;
- else
- cache->lru_head = entries[idx].lru_next;
-
- if (entries[idx].lru_next != invalid)
- entries[entries[idx].lru_next].lru_prev =
- entries[idx].lru_prev;
- else
- cache->lru_tail = entries[idx].lru_prev;
-
- entries[idx].lru_prev = invalid;
- entries[idx].lru_next = invalid;
-}
-
-/*
- * Allocate entry from free list, evicting LRU if needed
- */
-static int
-parse_cache_alloc_entry(QueryParseCache * cache)
-{
- QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
- int *buckets = PARSE_CACHE_BUCKETS(cache);
- int idx;
- int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
-
- if (cache->free_list_head != invalid)
- {
- idx = cache->free_list_head;
- cache->free_list_head = entries[idx].next;
- entries[idx].in_use = true;
- entries[idx].next = invalid;
- cache->num_entries++;
- return idx;
- }
-
- /* No free entries - evict LRU */
- if (cache->lru_tail == invalid)
- return invalid;
-
- idx = cache->lru_tail;
-
- /* Remove from hash bucket */
- {
- int bucket;
- int *prev_ptr;
- int curr;
-
- bucket = entries[idx].query_hash %
- cache->num_buckets;
- prev_ptr = &buckets[bucket];
- curr = buckets[bucket];
-
- while (curr != invalid)
- {
- if (curr == idx)
- {
- *prev_ptr = entries[curr].next;
- break;
- }
- prev_ptr = &entries[curr].next;
- curr = entries[curr].next;
- }
- }
-
- /* Remove from LRU list */
- parse_cache_lru_remove(cache, idx);
-
- /* Reinitialize entry */
- entries[idx].in_use = true;
- entries[idx].next = invalid;
-
- return idx;
-}
-
-/*
- * Look up a query in the parse cache
- */
-static int
-parse_cache_lookup(QueryParseCache * cache, uint64 hash)
-{
- int *buckets = PARSE_CACHE_BUCKETS(cache);
- QueryParseEntry *entries = PARSE_CACHE_ENTRIES(cache);
- int bucket = hash % cache->num_buckets;
- int idx = buckets[bucket];
- int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
-
- while (idx != invalid)
- {
- if (entries[idx].query_hash == hash)
- return idx;
- idx = entries[idx].next;
- }
-
- return invalid;
-}
-
-/* ----------------
- * Query normalization
- * ----------------
- */
-
-/*
- * Simple query normalization:
- * - Strip comments (-- and C-style block comments)
- * - Collapse whitespace
- * - Convert to lowercase (except inside strings)
- * - Replace literal values with placeholders
- */
-static size_t
-normalize_query(const char *query, char *output,
- size_t output_size)
-{
- const char *src = query;
- char *dst = output;
- char *dst_end = output + output_size - 1;
- bool in_string = false;
- char string_char = 0;
- bool last_was_space = true;
-
- while (*src && dst < dst_end)
- {
- /* Handle string literals */
- if (in_string)
- {
- if (*src == string_char)
- {
- if (*(src + 1) == string_char)
- {
- /* Escaped quote */
- src += 2;
- continue;
- }
- in_string = false;
- /* Replace string with placeholder */
- *dst++ = '$';
- }
- src++;
- continue;
- }
-
- /* Check for string start */
- if (*src == '\'' || *src == '"')
- {
- in_string = true;
- string_char = *src;
- src++;
- continue;
- }
-
- /* Handle single-line comments */
- if (*src == '-' && *(src + 1) == '-')
- {
- while (*src && *src != '\n')
- src++;
- continue;
- }
-
- /* Handle multi-line comments */
- if (*src == '/' && *(src + 1) == '*')
- {
- src += 2;
- while (*src &&
- !(*src == '*' && *(src + 1) == '/'))
- src++;
- if (*src)
- src += 2;
- continue;
- }
-
- /* Handle whitespace */
- if (*src == ' ' || *src == '\t' ||
- *src == '\n' || *src == '\r')
- {
- if (!last_was_space)
- {
- *dst++ = ' ';
- last_was_space = true;
- }
- src++;
- continue;
- }
-
- /* Handle numbers - replace with placeholder */
- if ((*src >= '0' && *src <= '9') ||
- (*src == '.' && *(src + 1) >= '0' &&
- *(src + 1) <= '9'))
- {
- while (*src &&
- ((*src >= '0' && *src <= '9') ||
- *src == '.'))
- src++;
- if (!last_was_space &&
- dst > output && *(dst - 1) != '$')
- *dst++ = '$';
- last_was_space = false;
- continue;
- }
-
- /* Regular character - convert to lowercase */
- if (*src >= 'A' && *src <= 'Z')
- *dst++ = *src + 32;
- else
- *dst++ = *src;
-
- last_was_space = false;
- src++;
- }
-
- /* Remove trailing space */
- if (dst > output && *(dst - 1) == ' ')
- dst--;
-
- *dst = '\0';
- return dst - output;
-}
/* ----------------
* Public API implementation
@@ -858,13 +489,9 @@ pool_track_table_mutation_shmem_size(void)
Size size = 0;
int tbl_bkt;
int tbl_sz;
- int qry_bkt;
- int qry_sz;
tbl_bkt = pool_config->track_table_mutation_table_buckets;
tbl_sz = pool_config->track_table_mutation_table_size;
- qry_bkt = pool_config->track_table_mutation_query_buckets;
- qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
/* Main structure */
size += sizeof(TrackTableMutationShmem);
@@ -874,11 +501,6 @@ pool_track_table_mutation_shmem_size(void)
size += tbl_bkt * sizeof(int);
size += tbl_sz * sizeof(TrackTableMutationEntry);
- /* Parse cache */
- size += sizeof(QueryParseCache);
- size += qry_bkt * sizeof(int);
- size += qry_sz * sizeof(QueryParseEntry);
-
return size;
}
@@ -897,8 +519,6 @@ pool_track_table_mutation_init(void)
TrackTableMutationState *st;
int tbl_bkt;
int tbl_sz;
- int qry_bkt;
- int qry_sz;
if (pool_config->disable_load_balance_on_write !=
DLBOW_DML_ADAPTIVE_GLOBAL)
@@ -911,8 +531,6 @@ pool_track_table_mutation_init(void)
tbl_bkt = pool_config->track_table_mutation_table_buckets;
tbl_sz = pool_config->track_table_mutation_table_size;
- qry_bkt = pool_config->track_table_mutation_query_buckets;
- qry_sz = pool_config->track_table_mutation_query_parse_cache_size;
shmem_size = pool_track_table_mutation_shmem_size();
@@ -938,22 +556,12 @@ pool_track_table_mutation_init(void)
track_table_mutation_shmem->table_map =
(TrackTableMutationHashTable *) shmem_ptr;
- shmem_ptr += sizeof(TrackTableMutationHashTable);
- shmem_ptr += tbl_bkt * sizeof(int);
- shmem_ptr += tbl_sz * sizeof(TrackTableMutationEntry);
- track_table_mutation_shmem->query_cache =
- (QueryParseCache *) shmem_ptr;
-
- /* Initialize structures */
+ /* Initialize table map */
table_map_init(
track_table_mutation_shmem->table_map,
tbl_bkt, tbl_sz);
- parse_cache_init(
- track_table_mutation_shmem->query_cache,
- qry_bkt, qry_sz);
-
/* Initialize global state */
st = &track_table_mutation_shmem->state;
st->initialized = true;
@@ -1292,159 +900,3 @@ pool_track_table_mutation_update_ttl(uint64 delay_us)
(unsigned long) delay_us,
factor)));
}
-
-/*
- * Look up a cached parse result by query hash.
- * Returns true and fills output parameters if
- * the query was found in the parse cache.
- */
-bool
-pool_track_table_mutation_get_cached_parse(
- uint64 hash, bool *is_write,
- char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
- int *num_tables)
-{
- QueryParseCache *cache;
- int idx;
- bool found = false;
- int max_tables;
-
- if (TRACK_TABLE_MUTATION_DISABLED())
- return false;
-
- max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
- cache = track_table_mutation_shmem->query_cache;
-
- parse_cache_lock();
-
- idx = parse_cache_lookup(cache, hash);
- if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
- {
- QueryParseEntry *entries;
- int i;
- int namelen;
-
- entries = PARSE_CACHE_ENTRIES(cache);
- namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
- *is_write = entries[idx].is_write;
- *num_tables = entries[idx].num_tables;
-
- for (i = 0;
- i < entries[idx].num_tables &&
- i < max_tables;
- i++)
- {
- strlcpy(table_names[i],
- entries[idx].table_names[i],
- namelen);
- }
-
- /* Move to front of LRU */
- parse_cache_lru_touch(cache, idx);
- found = true;
- }
-
- parse_cache_unlock();
-
- return found;
-}
-
-/*
- * Store a parse result in the shared cache.
- * Evicts the LRU entry if the cache is full.
- */
-void
-pool_track_table_mutation_cache_parse(
- uint64 hash, bool is_write,
- const char table_names[][TRACK_TABLE_MUTATION_TABLE_NAME_LEN],
- int num_tables)
-{
- QueryParseCache *cache;
- int *buckets;
- QueryParseEntry *entries;
- int idx;
- int bucket;
- int max_tables;
- int namelen;
-
- if (TRACK_TABLE_MUTATION_DISABLED())
- return;
-
- max_tables = TRACK_TABLE_MUTATION_MAX_TABLES_PER_QUERY;
- namelen = TRACK_TABLE_MUTATION_TABLE_NAME_LEN;
- cache = track_table_mutation_shmem->query_cache;
-
- parse_cache_lock();
-
- /* Check if already exists */
- idx = parse_cache_lookup(cache, hash);
- if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
- {
- parse_cache_unlock();
- return;
- }
-
- /* Allocate new entry (may evict LRU) */
- idx = parse_cache_alloc_entry(cache);
- if (idx == TRACK_TABLE_MUTATION_INVALID_INDEX)
- {
- parse_cache_unlock();
- ereport(WARNING,
- (errmsg("track_table_mutation: "
- "parse cache alloc failed")));
- return;
- }
-
- entries = PARSE_CACHE_ENTRIES(cache);
- buckets = PARSE_CACHE_BUCKETS(cache);
-
- /* Fill in entry */
- entries[idx].query_hash = hash;
- entries[idx].is_write = is_write;
- entries[idx].num_tables =
- (num_tables > max_tables) ?
- max_tables : num_tables;
-
- {
- int i;
-
- for (i = 0; i < entries[idx].num_tables; i++)
- {
- strlcpy(entries[idx].table_names[i],
- table_names[i], namelen);
- }
- }
-
- /* Insert into hash bucket */
- bucket = hash % cache->num_buckets;
- entries[idx].next = buckets[bucket];
- buckets[bucket] = idx;
-
- /* Add to LRU list */
- parse_cache_lru_add(cache, idx);
-
- parse_cache_unlock();
-}
-
-/*
- * Normalize a SQL query and compute its 64-bit hash.
- * Strips comments, collapses whitespace, lowercases,
- * and replaces literals with placeholders.
- */
-uint64
-pool_track_table_mutation_normalize_and_hash(
- const char *query)
-{
- char normalized[8192];
- size_t len;
-
- if (query == NULL || query[0] == '\0')
- return 0;
-
- len = normalize_query(query, normalized,
- sizeof(normalized));
- if (len == 0)
- return 0;
-
- return fnv1a_hash_64(normalized, len);
-}
--
2.53.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-04-19 07:24 ` Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-04-19 07:24 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Hi Tatsuo,
>
> hank you for the detailed review. Attached patch addresses all items.
I guess the attached patch is on top of
v1-0001-Feature-load-balancing-control-by-table-tracking.patch. To
apply v2-0001-address-review.patch, we need to apply
v1-0001-Feature-load-balancing-control-by-table-tracking.patch first.
Unfortunately due to recent commit, it does not apply anymore. Can you
please provide v1 + v2 that are rebased against latest master branch?
Also 042 regression test is already used by recent commit. Can you
renumber 042.track_table_mutation and
043.track_table_mutation_watchdog to 043.track_table_mutation and
044.track_table_mutation_watchdog accordingly?
Looking forward to seeing new patch.
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> memqcache bug fix
> -----------------
>
> Good catch. The root cause: pool_set_writing_transaction() was
> explicitly skipping dml_adaptive_global, so
> pool_is_writing_transaction() always returned false in this mode.
> The query cache fetch guard at pool_proto_modules.c:270
> (!pool_is_writing_transaction()) then served stale cached results
> after DML in the same transaction.
>
> Fix: pool_set_writing_transaction() now sets the flag for
> dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
> ensures the query cache is properly bypassed after writes within
> the same transaction.
>
> Removed dead query parse cache code (~700 lines)
> -------------------------------------------------
>
> You're right -- pool_track_table_mutation_get_cached_parse,
> pool_track_table_mutation_cache_parse, and
> pool_track_table_mutation_normalize_and_hash were never called.
> These were leftover from an earlier design where we planned to
> cache SQL parse results in shared memory. The feature ended up
> using pgpool's existing parser directly, and this code was never
> wired up.
>
> Removed: QueryParseCache and QueryParseEntry structs, all related
> static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
> and the track_table_mutation_query_buckets /
> track_table_mutation_query_parse_cache_size configuration
> parameters. This also reduces shared memory usage from ~6.4 MB
> to ~80 KB with default settings.
>
> check_object_relationship_list scope
> -------------------------------------
>
> You're correct -- dml_adaptive_global does not use
> dml_adaptive_object_relationship_list. Changed
> check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
> only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
>
> Documentation fixes
> -------------------
>
> - Removed "(Lagless Replica Reads)" from section title and
> "lagless" language from description.
>
> - Described fallback behavior when neither
> replication_delay_source_cmd nor delay_threshold_by_time is
> configured (TTL stays at 100ms default minimum).
>
> - "query cache" references removed (the query parse cache is gone).
>
> - Added 128-table-per-SELECT limit to Limitations section
> (uses POOL_MAX_SELECT_OIDS).
>
> Code style fixes
> ----------------
>
> - DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
>
> - Split the long errmsg line in
> is_select_object_in_temp_write_list.
>
> - Removed redundant is_adaptive variable in
> is_select_object_in_temp_write_list (the check at function
> entry already guarantees it).
>
> Thanks!
>
> On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]> wrote:
>
>> Hi Nadav,
>>
>> > Hi Tatsuo,
>> >
>> > Looks good to me thanks!
>> >
>> > Please go ahead with your review. waiting to hear back from you.
>>
>> Here are the code review results.
>>
>> diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
>> index 9e1e7b39b..7384ce81a 100644
>> --- a/doc/src/sgml/loadbalance.sgml
>> +++ b/doc/src/sgml/loadbalance.sgml
>> :
>> + <sect2 id="runtime-config-table-mutation-map">
>> + <title>Table Mutation Map Configuration (Lagless Replica Reads)</title>
>>
>> "(Lagless Replica Reads)" sounds like an advertisement to me. It
>> should be removed.
>>
>> + <para>
>> + These parameters configure the track table mutation feature, which is
>> activated by setting
>> + <xref linkend="guc-disable-load-balance-on-write"> to
>> <literal>dml_adaptive_global</literal>.
>> + The feature tracks recently written tables to prevent stale reads from
>> replica nodes during
>> + replication lag, implementing the "lagless" architecture pattern for
>> distributed systems
>> + with read replicas.
>>
>> I think the feature does not guarantee "lagless" anytime, in all cases.
>>
>> + <para>
>> + This feature requires time-based replication delay monitoring. This
>> can be provided by either
>> + <xref linkend="guc-replication-delay-source-cmd"> (external command
>> mode) or by setting
>> + <xref linkend="guc-delay-threshold-by-time"> (which uses
>> <literal>pg_stat_replication.replay_lag</literal>
>> + from PostgreSQL 10+). At least one of these must be configured for the
>> TTL calculation to work.
>>
>> If one of these is not set, what happens? Error? Need to describe it.
>>
>> + </para>
>> +
>> + <warning>
>> + <para>
>> + Enabling <literal>dml_adaptive_global</literal> increases shared
>> memory consumption. With default settings,
>> + the feature requires approximately 6.4 MB of shared memory (0.1 MB
>> for table tracking + 6.3 MB for query cache).
>>
>> "query cache" should be "query parse cache".
>>
>> + Memory usage scales with configuration parameters:
>> + </para>
>> + <itemizedlist>
>> + <listitem>
>> + <para>
>> + Table tracking: <literal>track_table_mutation_table_size * 40
>> bytes</literal> (default: 2048 * 40 = ~80 KB)
>> + </para>
>> + </listitem>
>> + <listitem>
>> + <para>
>> + Query cache: <literal>track_table_mutation_query_parse_cache_size *
>> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
>>
>> "query cache" should be "query parse cache".
>>
>> + <title>Limitations</title>
>>
>> I think number of tables tacked in a SELECT is limited to 8. It should
>> be mentioned.
>>
>> diff --git a/src/context/pool_query_context.c
>> b/src/context/pool_query_context.c
>> index a056ac596..0190d3673 100644
>> --- a/src/context/pool_query_context.c
>> +++ b/src/context/pool_query_context.c
>> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
>> static bool
>> is_select_object_in_temp_write_list(Node *node, void *context)
>> {
>> - if (node == NULL || pool_config->disable_load_balance_on_write !=
>> DLBOW_DML_ADAPTIVE)
>> + if (node == NULL ||
>> + !DLBOW_IS_DML_ADAPTIVE(
>> +
>> pool_config->disable_load_balance_on_write))
>>
>> You don't need to split the line.
>>
>> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>> +
>> pool_config->disable_load_balance_on_write);
>>
>> You don't need to split the line.
>>
>> - if (pool_config->disable_load_balance_on_write ==
>> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
>> + if (is_adaptive &&
>> + session_context->is_in_transaction)
>> {
>> ereport(DEBUG1,
>>
>> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
>> \"%s\"", (char *) context, rgv->relname)));
>> This line is too long. Please split.
>>
>> @@ -1880,7 +1889,13 @@ static char
>> *get_associated_object_from_dml_adaptive_relations
>> void
>> check_object_relationship_list(char *name, bool is_func_name)
>> {
>> - if (pool_config->disable_load_balance_on_write ==
>> DLBOW_DML_ADAPTIVE &&
>> pool_config->parsed_dml_adaptive_object_relationship_list)
>> + bool is_adaptive;
>> +
>> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>> +
>> pool_config->disable_load_balance_on_write);
>>
>> I wrote in the commit message:
>>
>> modifications are only detected in the same transaction). Note,
>> however, you cannot use dml_adaptive_object_relationship_list to track
>> dependency among table and other objects.
>>
>> In my understanding the feature does not use
>> dml_adaptive_object_relationship_list. If this is correct, why
>> check_object_relationship_list() is called here in case
>> dml_adaptive_global? If the feature uses
>> dml_adaptive_object_relationship_list, test cases should be included.
>>
>> diff --git a/src/utils/pool_track_table_mutation.c
>> b/src/utils/pool_track_table_mutation.c
>> new file mode 100644
>> index 000000000..9be46b28f
>> --- /dev/null
>> +++ b/src/utils/pool_track_table_mutation.c
>>
>> It seems following functions are not used anywhere. I wonder if this
>> feature actually use "query parse cache".
>>
>> pool_track_table_mutation_get_cached_parse
>> pool_track_table_mutation_cache_parse
>> pool_track_table_mutation_normalize_and_hash
>>
>> Besides the code review, I mutated one of regression tests to check
>> whether the feature co exists with in the existing memory query cache
>> feature. After attached patch applied, I ran 006.memqcache and got the
>> following result.
>>
>> cd src/test/regression
>> ./regress.sh 006
>> creating pgpool-II temporary installation ...
>> moving pgpool_setup to temporary installation path ...
>> moving watchdog_setup to temporary installation path ...
>> using pgpool-II at
>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>> *************************
>> REGRESSION MODE : install
>> Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
>> Pgpool-II install path :
>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>> PostgreSQL bin : /usr/local/pgsql/bin
>> PostgreSQL Major version : 18
>> pgbench : /usr/local/pgsql/bin/pgbench
>> PostgreSQL jdbc :
>> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>> *************************
>> testing 006.memqcache...failed.
>> out of 1 ok:0 failed:1 timeout:0
>>
>> log/006.memqcache shows:
>>
>> ../expected.txt result.txt differ: char 1, line 1
>>
>> So I checked the test script and found the error was generated by a
>> Java program test.
>>
>> java jdbctest > result.txt 2>&1
>> cmp ../expected.txt result.txt
>> if [ $? != 0 ];then
>> ./shutdownall
>> exit 1
>> fi
>>
>> In jdbctest.java:
>>
>> /*
>> * Cache test in an explicit transaction
>> */
>> conn.setAutoCommit(false);
>> // execute DML. This should prevent SELECTs from using
>> query cache in the transaction.
>> sql = "UPDATE t1 SET i = 2;";
>> pst = conn.createStatement();
>> pst.executeUpdate(sql);
>> pst.close();
>> // should not use the cache and should return "2", rather
>> than "1"
>> prest = conn.prepareStatement("SELECT * FROM t1");
>> rs = prest.executeQuery();
>>
>> The expected file (expected.txt) has "2" but the result file
>> (testdir/result.txt) was "1". This is the reason why the test
>> failed. I wonder if there's something wrong with the feature when the
>> query cache is enabled. Can you look into this?
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-04-19 14:29 ` Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-04-19 14:29 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Rebased onto current master, renumbered the regression tests
(043/044 to avoid collision with the new 042.ssl_reload), and
combined everything into a single commit.
Attached: v2-0001-Feature-load-balancing-control-by-table-tracking.patch
Looking forward to your review.
On Sun, Apr 19, 2026 at 10:25 AM Tatsuo Ishii <[email protected]> wrote:
> > Hi Tatsuo,
> >
> > hank you for the detailed review. Attached patch addresses all items.
>
> I guess the attached patch is on top of
> v1-0001-Feature-load-balancing-control-by-table-tracking.patch. To
> apply v2-0001-address-review.patch, we need to apply
> v1-0001-Feature-load-balancing-control-by-table-tracking.patch first.
> Unfortunately due to recent commit, it does not apply anymore. Can you
> please provide v1 + v2 that are rebased against latest master branch?
> Also 042 regression test is already used by recent commit. Can you
> renumber 042.track_table_mutation and
> 043.track_table_mutation_watchdog to 043.track_table_mutation and
> 044.track_table_mutation_watchdog accordingly?
>
> Looking forward to seeing new patch.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
>
> > memqcache bug fix
> > -----------------
> >
> > Good catch. The root cause: pool_set_writing_transaction() was
> > explicitly skipping dml_adaptive_global, so
> > pool_is_writing_transaction() always returned false in this mode.
> > The query cache fetch guard at pool_proto_modules.c:270
> > (!pool_is_writing_transaction()) then served stale cached results
> > after DML in the same transaction.
> >
> > Fix: pool_set_writing_transaction() now sets the flag for
> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
> > ensures the query cache is properly bypassed after writes within
> > the same transaction.
> >
> > Removed dead query parse cache code (~700 lines)
> > -------------------------------------------------
> >
> > You're right -- pool_track_table_mutation_get_cached_parse,
> > pool_track_table_mutation_cache_parse, and
> > pool_track_table_mutation_normalize_and_hash were never called.
> > These were leftover from an earlier design where we planned to
> > cache SQL parse results in shared memory. The feature ended up
> > using pgpool's existing parser directly, and this code was never
> > wired up.
> >
> > Removed: QueryParseCache and QueryParseEntry structs, all related
> > static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
> > and the track_table_mutation_query_buckets /
> > track_table_mutation_query_parse_cache_size configuration
> > parameters. This also reduces shared memory usage from ~6.4 MB
> > to ~80 KB with default settings.
> >
> > check_object_relationship_list scope
> > -------------------------------------
> >
> > You're correct -- dml_adaptive_global does not use
> > dml_adaptive_object_relationship_list. Changed
> > check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
> > only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
> >
> > Documentation fixes
> > -------------------
> >
> > - Removed "(Lagless Replica Reads)" from section title and
> > "lagless" language from description.
> >
> > - Described fallback behavior when neither
> > replication_delay_source_cmd nor delay_threshold_by_time is
> > configured (TTL stays at 100ms default minimum).
> >
> > - "query cache" references removed (the query parse cache is gone).
> >
> > - Added 128-table-per-SELECT limit to Limitations section
> > (uses POOL_MAX_SELECT_OIDS).
> >
> > Code style fixes
> > ----------------
> >
> > - DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
> >
> > - Split the long errmsg line in
> > is_select_object_in_temp_write_list.
> >
> > - Removed redundant is_adaptive variable in
> > is_select_object_in_temp_write_list (the check at function
> > entry already guarantees it).
> >
> > Thanks!
> >
> > On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> Hi Nadav,
> >>
> >> > Hi Tatsuo,
> >> >
> >> > Looks good to me thanks!
> >> >
> >> > Please go ahead with your review. waiting to hear back from you.
> >>
> >> Here are the code review results.
> >>
> >> diff --git a/doc/src/sgml/loadbalance.sgml
> b/doc/src/sgml/loadbalance.sgml
> >> index 9e1e7b39b..7384ce81a 100644
> >> --- a/doc/src/sgml/loadbalance.sgml
> >> +++ b/doc/src/sgml/loadbalance.sgml
> >> :
> >> + <sect2 id="runtime-config-table-mutation-map">
> >> + <title>Table Mutation Map Configuration (Lagless Replica
> Reads)</title>
> >>
> >> "(Lagless Replica Reads)" sounds like an advertisement to me. It
> >> should be removed.
> >>
> >> + <para>
> >> + These parameters configure the track table mutation feature, which
> is
> >> activated by setting
> >> + <xref linkend="guc-disable-load-balance-on-write"> to
> >> <literal>dml_adaptive_global</literal>.
> >> + The feature tracks recently written tables to prevent stale reads
> from
> >> replica nodes during
> >> + replication lag, implementing the "lagless" architecture pattern for
> >> distributed systems
> >> + with read replicas.
> >>
> >> I think the feature does not guarantee "lagless" anytime, in all cases.
> >>
> >> + <para>
> >> + This feature requires time-based replication delay monitoring. This
> >> can be provided by either
> >> + <xref linkend="guc-replication-delay-source-cmd"> (external command
> >> mode) or by setting
> >> + <xref linkend="guc-delay-threshold-by-time"> (which uses
> >> <literal>pg_stat_replication.replay_lag</literal>
> >> + from PostgreSQL 10+). At least one of these must be configured for
> the
> >> TTL calculation to work.
> >>
> >> If one of these is not set, what happens? Error? Need to describe it.
> >>
> >> + </para>
> >> +
> >> + <warning>
> >> + <para>
> >> + Enabling <literal>dml_adaptive_global</literal> increases shared
> >> memory consumption. With default settings,
> >> + the feature requires approximately 6.4 MB of shared memory (0.1 MB
> >> for table tracking + 6.3 MB for query cache).
> >>
> >> "query cache" should be "query parse cache".
> >>
> >> + Memory usage scales with configuration parameters:
> >> + </para>
> >> + <itemizedlist>
> >> + <listitem>
> >> + <para>
> >> + Table tracking: <literal>track_table_mutation_table_size * 40
> >> bytes</literal> (default: 2048 * 40 = ~80 KB)
> >> + </para>
> >> + </listitem>
> >> + <listitem>
> >> + <para>
> >> + Query cache:
> <literal>track_table_mutation_query_parse_cache_size *
> >> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
> >>
> >> "query cache" should be "query parse cache".
> >>
> >> + <title>Limitations</title>
> >>
> >> I think number of tables tacked in a SELECT is limited to 8. It should
> >> be mentioned.
> >>
> >> diff --git a/src/context/pool_query_context.c
> >> b/src/context/pool_query_context.c
> >> index a056ac596..0190d3673 100644
> >> --- a/src/context/pool_query_context.c
> >> +++ b/src/context/pool_query_context.c
> >> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
> >> static bool
> >> is_select_object_in_temp_write_list(Node *node, void *context)
> >> {
> >> - if (node == NULL || pool_config->disable_load_balance_on_write
> !=
> >> DLBOW_DML_ADAPTIVE)
> >> + if (node == NULL ||
> >> + !DLBOW_IS_DML_ADAPTIVE(
> >> +
> >> pool_config->disable_load_balance_on_write))
> >>
> >> You don't need to split the line.
> >>
> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
> >> +
> >> pool_config->disable_load_balance_on_write);
> >>
> >> You don't need to split the line.
> >>
> >> - if (pool_config->disable_load_balance_on_write ==
> >> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
> >> + if (is_adaptive &&
> >> + session_context->is_in_transaction)
> >> {
> >> ereport(DEBUG1,
> >>
> >> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
> >> \"%s\"", (char *) context, rgv->relname)));
> >> This line is too long. Please split.
> >>
> >> @@ -1880,7 +1889,13 @@ static char
> >> *get_associated_object_from_dml_adaptive_relations
> >> void
> >> check_object_relationship_list(char *name, bool is_func_name)
> >> {
> >> - if (pool_config->disable_load_balance_on_write ==
> >> DLBOW_DML_ADAPTIVE &&
> >> pool_config->parsed_dml_adaptive_object_relationship_list)
> >> + bool is_adaptive;
> >> +
> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
> >> +
> >> pool_config->disable_load_balance_on_write);
> >>
> >> I wrote in the commit message:
> >>
> >> modifications are only detected in the same transaction). Note,
> >> however, you cannot use dml_adaptive_object_relationship_list to track
> >> dependency among table and other objects.
> >>
> >> In my understanding the feature does not use
> >> dml_adaptive_object_relationship_list. If this is correct, why
> >> check_object_relationship_list() is called here in case
> >> dml_adaptive_global? If the feature uses
> >> dml_adaptive_object_relationship_list, test cases should be included.
> >>
> >> diff --git a/src/utils/pool_track_table_mutation.c
> >> b/src/utils/pool_track_table_mutation.c
> >> new file mode 100644
> >> index 000000000..9be46b28f
> >> --- /dev/null
> >> +++ b/src/utils/pool_track_table_mutation.c
> >>
> >> It seems following functions are not used anywhere. I wonder if this
> >> feature actually use "query parse cache".
> >>
> >> pool_track_table_mutation_get_cached_parse
> >> pool_track_table_mutation_cache_parse
> >> pool_track_table_mutation_normalize_and_hash
> >>
> >> Besides the code review, I mutated one of regression tests to check
> >> whether the feature co exists with in the existing memory query cache
> >> feature. After attached patch applied, I ran 006.memqcache and got the
> >> following result.
> >>
> >> cd src/test/regression
> >> ./regress.sh 006
> >> creating pgpool-II temporary installation ...
> >> moving pgpool_setup to temporary installation path ...
> >> moving watchdog_setup to temporary installation path ...
> >> using pgpool-II at
> >>
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> >> *************************
> >> REGRESSION MODE : install
> >> Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
> >> Pgpool-II install path :
> >>
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> >> PostgreSQL bin : /usr/local/pgsql/bin
> >> PostgreSQL Major version : 18
> >> pgbench : /usr/local/pgsql/bin/pgbench
> >> PostgreSQL jdbc :
> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> >> *************************
> >> testing 006.memqcache...failed.
> >> out of 1 ok:0 failed:1 timeout:0
> >>
> >> log/006.memqcache shows:
> >>
> >> ../expected.txt result.txt differ: char 1, line 1
> >>
> >> So I checked the test script and found the error was generated by a
> >> Java program test.
> >>
> >> java jdbctest > result.txt 2>&1
> >> cmp ../expected.txt result.txt
> >> if [ $? != 0 ];then
> >> ./shutdownall
> >> exit 1
> >> fi
> >>
> >> In jdbctest.java:
> >>
> >> /*
> >> * Cache test in an explicit transaction
> >> */
> >> conn.setAutoCommit(false);
> >> // execute DML. This should prevent SELECTs from using
> >> query cache in the transaction.
> >> sql = "UPDATE t1 SET i = 2;";
> >> pst = conn.createStatement();
> >> pst.executeUpdate(sql);
> >> pst.close();
> >> // should not use the cache and should return "2",
> rather
> >> than "1"
> >> prest = conn.prepareStatement("SELECT * FROM t1");
> >> rs = prest.executeQuery();
> >>
> >> The expected file (expected.txt) has "2" but the result file
> >> (testdir/result.txt) was "1". This is the reason why the test
> >> failed. I wonder if there's something wrong with the feature when the
> >> query cache is enabled. Can you look into this?
> >>
> >> Regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] v2-0001-Feature-load-balancing-control-by-table-tracking.patch (89.7K, 3-v2-0001-Feature-load-balancing-control-by-table-tracking.patch)
download | inline diff:
From 1ad39659cf4cec0baeabfc3d02ea9b88163e9046 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 19 Apr 2026 17:10:24 +0300
Subject: [PATCH v2] Feature: load balancing control by table tracking.
Prevent routing of read only queries to standby if replication delay
of tables used in the query exceeds certain amount of value
collected by streaming replication process. To enable this feature,
set disable_load_balance_on_write to dml_adaptive_global.
In this mode, when tables are modified by
INSERT/UPDATE/DELETE/TRUNCATE/MERGE/data modification WITH, for
certain peoriod SELECTs using the tables are not load balanced:
i.e. routed to the primary PostgreSQL server to avoid the data
staleness by replication delay.
Unlike dml_adaptive mode, any table modifications decribed above are
detected even they happn in other sessions (in dml_adaptive, table
modifications are only detected in the same transaction). Note,
however, you cannot use dml_adaptive_object_relationship_list to track
dependency among table and other objects.
Besides dml_adaptive_global, there are some tuning knobs for the
feature:
- track_table_mutation_ttl_factor
Parameter to calculate TTL of each tracking data.
- track_table_mutation_max_staleness
Maximum duration in milliseconds that a single table entry can
continuously force queries to primary.
- track_table_mutation_cold_start_duration
Duration in milliseconds to route all queries to primary after a
child process starts.
- track_table_mutation_table_buckets
Number of hash buckets for the track table mutation hash table.
- track_table_mutation_table_size
Maximum number of tables that can be tracked simultaneously in the
track table mutation.
Author: Nadav Shatz <[email protected]>
Reviewed-by: Tatsuo Ishii <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/20260407.181009.1762204033074164841.ishii%40postgresql.org#58c139c1a7f8d5562865921d0733667b
---
doc/src/sgml/loadbalance.sgml | 288 ++++++
src/Makefile.am | 1 +
src/config/pool_config_variables.c | 65 ++
src/context/pool_query_context.c | 242 ++++-
src/context/pool_session_context.c | 17 +-
src/include/pool.h | 3 +-
src/include/pool_config.h | 24 +-
src/include/utils/pool_track_table_mutation.h | 167 ++++
src/main/pgpool_main.c | 29 +-
src/protocol/CommandComplete.c | 28 +
src/protocol/child.c | 8 +
src/protocol/pool_proto_modules.c | 6 +-
src/sample/pgpool.conf.sample-stream | 45 +
src/streaming_replication/pool_worker_child.c | 24 +
src/test/regression/libs.sh | 2 +
.../tests/043.track_table_mutation/test.sh | 354 +++++++
.../044.track_table_mutation_watchdog/test.sh | 184 ++++
src/tools/pgindent/typedefs.list | 4 +
src/utils/pool_track_table_mutation.c | 902 ++++++++++++++++++
19 files changed, 2368 insertions(+), 25 deletions(-)
create mode 100644 src/include/utils/pool_track_table_mutation.h
create mode 100755 src/test/regression/tests/043.track_table_mutation/test.sh
create mode 100755 src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
create mode 100644 src/utils/pool_track_table_mutation.c
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 9e1e7b39b..d4fbcf1a5 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1110,6 +1110,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-table-mutation-map"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1195,4 +1207,280 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Tracking Configuration</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires time-based replication delay monitoring. This can be provided by either
+ <xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
+ <xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
+ from PostgreSQL 10+). If neither is configured, the TTL remains at its default minimum value
+ (100 milliseconds) and is never updated based on actual replication delay, which may result
+ in suboptimal routing decisions.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 80 KB of shared memory for table tracking:
+ <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB).
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-max-staleness" xreflabel="track_table_mutation_max_staleness">
+ <term><varname>track_table_mutation_max_staleness</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_max_staleness</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum duration in milliseconds that a single table entry can continuously force queries to primary,
+ measured from when the table was first marked stale. When this cap is reached, the entry is expired
+ regardless of recent writes. If the table is written to again after expiry, a fresh tracking entry
+ is created.
+ </para>
+ <para>
+ This parameter bounds the cross-session impact of table mutation tracking. Even if a table is written
+ to in a tight loop, its effect on other sessions' load balancing is limited to this duration. For
+ legitimately busy tables, the gap between forced expiry and the next write re-marking the table is
+ negligible (typically milliseconds).
+ </para>
+ <para>
+ Set to 0 to disable the cap (not recommended for production).
+ Valid range: 0-3600000 ms. Default is <literal>60000</literal> (60 seconds).
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_max_staleness = 60000
+track_table_mutation_cold_start_duration = 2000
+
+# Option A: Use external command for replication delay
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Option B: Use pg_stat_replication replay_lag (PG 10+)
+# delay_threshold_by_time = 1000
+
+# Adjust table map size based on workload
+track_table_mutation_table_size = 4096
+ </programlisting>
+ <para>
+ Shared memory required for above configuration: approximately 160 KB for the table map.
+ Default configuration (2048 tables) requires approximately 80 KB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitations:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ A maximum of 128 tables can be tracked per SELECT query for staleness checking.
+ This limit is shared with the query cache subsystem
+ (<literal>POOL_MAX_SELECT_OIDS</literal>).
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+
+ <para>
+ <emphasis>Cross-Session Impact and Safety Bounds:</emphasis>
+ Unlike <literal>dml_adaptive</literal> (which only affects the session that issued the write),
+ <literal>dml_adaptive_global</literal> affects all sessions reading the same table in the same database.
+ The following safety mechanisms bound this cross-session impact:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis>Maximum staleness cap:</emphasis> The <xref linkend="guc-track-table-mutation-max-staleness">
+ parameter (default: 60 seconds) limits how long any single table entry can continuously force primary
+ routing. Even under sustained writes, the entry expires after this period and is only renewed by
+ subsequent committed writes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Database isolation:</emphasis> Table staleness tracking is scoped by database OID. Writes
+ in one database never affect load balancing decisions for sessions connected to a different database.
+ In multi-tenant deployments where tenants use separate databases, one tenant's write activity cannot
+ influence another tenant's query routing.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Committed writes only:</emphasis> Only committed transactions mark tables as stale.
+ Rolled-back transactions have no effect on the shared tracking state.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Bounded table map size:</emphasis> The shared memory table map has a fixed maximum size
+ (<xref linkend="guc-track-table-mutation-table-size">). At most this many tables can be marked stale
+ simultaneously, providing a natural ceiling on the feature's impact.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab530..39588af58 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index b775b2106..3039e32f0 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1777,6 +1778,19 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation "
+ "(TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2397,6 +2411,57 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_max_staleness",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Maximum duration in milliseconds that a "
+ "table can be marked stale from its first "
+ "write. 0 disables the cap.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_max_staleness,
+ 60000, /* 60 seconds */
+ 0, 3600000, /* 0 to 1 hour */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_cold_start_duration",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries "
+ "to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index a056ac596..c20a3a420 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,20 +1829,26 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
- POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
+ POOL_SESSION_CONTEXT *session_context;
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ session_context = pool_get_session_context(false);
+
+ if (session_context->is_in_transaction)
{
ereport(DEBUG1,
- (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
+ (errmsg("is_select_object_in_temp_write_list:"
+ " \"%s\", found relation \"%s\"",
+ (char *) context, rgv->relname)));
- return is_in_list(rgv->relname, session_context->transaction_temp_write_list);
+ return is_in_list(rgv->relname,
+ session_context->transaction_temp_write_list);
}
}
@@ -1880,15 +1887,22 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive =
+ (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE);
+
+ if (is_adaptive &&
+ pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
if (session_context->is_in_transaction)
{
char *right_token =
- get_associated_object_from_dml_adaptive_relations
- (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
+ get_associated_object_from_dml_adaptive_relations
+ (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
if (right_token)
{
@@ -1947,7 +1961,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1966,6 +1980,45 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush the accumulated
+ * table writes to shared memory. On ROLLBACK, skip -- the
+ * writes never committed so no stale-read risk exists. This
+ * prevents polluting the table map with rolled-back
+ * transactions.
+ */
+ int dlbow =
+ pool_config->disable_load_balance_on_write;
+ List *wlist =
+ session_context->transaction_temp_write_list;
+
+ if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ wlist != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, wlist)
+ {
+ char *tname;
+ int toid;
+
+ tname = (char *) lfirst(cell);
+ toid =
+ pool_table_name_to_oid(tname);
+
+ if (toid > 0)
+ pool_track_table_mutation_mark_table_written(
+ toid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2008,7 +2061,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context = pool_get_session_context(false);
backend = session_context->backend;
- /*
+ /*
* Collect/discard information for disable_load_balance_on_write =
* dml_adaptive case.
*/
@@ -2022,6 +2075,20 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache. This
+ * avoids potential hangs in CommandComplete where we shouldn't be
+ * running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2151,6 +2218,153 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+
+ /*
+ * Check track table mutation for recently written tables. If
+ * in cold start or any table was recently written, route to
+ * primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+ int lb_node;
+ POOL_QUERY_CONTEXT *qctx =
+ session_context->query_context;
+
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance"
+ " because of track table"
+ " mutation cold start"),
+ errdetail("destination = PRIMARY"
+ " for query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids =
+ pool_extract_table_oids_from_select_stmt(
+ node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " database oid was"
+ " unavailable"),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ bool stale;
+
+ stale =
+ pool_track_table_mutation_table_is_stale(
+ ctx.table_oids[i],
+ dboid);
+ if (stale)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " table \"%s\" was"
+ " recently written",
+ ctx.table_names[i]),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ else
+ {
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id =
+ select_load_balancing_node();
+ }
+
+ /*
+ * If replication delay is too much, and
+ * prefer_lower_delay_standby is true then elect the
+ * lowest-delayed node, otherwise send to primary.
+ */
+ lb_node =
+ session_context->load_balance_node_id;
+ if (STREAM &&
+ check_replication_delay(lb_node))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because of"
+ " too much replication"
+ " delay"),
+ errdetail("destination"
+ " = %d for"
+ " query= \"%s\"",
+ dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ lb_node =
+ select_load_balancing_node();
+ session_context->load_balance_node_id =
+ lb_node;
+ qctx->load_balance_node_id =
+ lb_node;
+ pool_set_node_to_be_sent(
+ query_context,
+ lb_node);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ qctx->load_balance_node_id =
+ session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(
+ query_context,
+ qctx->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
@@ -2171,7 +2385,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
errdetail("destination = %d for query= \"%s\"", dest, query)));
/*
- * If prefer_lower_delay_standby is on, choose lower delay standby.
+ * If prefer_lower_delay_standby is on, choose lower
+ * delay standby.
*/
if (pool_config->prefer_lower_delay_standby)
{
@@ -2181,7 +2396,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
}
- else /* delay is too much. prefer to send to primary */
+ else /* delay is too much. prefer to send to
+ * primary */
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
@@ -2191,7 +2407,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
* Not streaming replication mode, or delay_threshold is 0
* or replication delay is small enough.
*/
- else
+ else
{
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context,
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc..be30f1a7c 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,9 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write) &&
+ session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +740,15 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive', then never
+ * turn on writing transaction flag. For dml_adaptive_global we do set it
+ * so that the query cache (memqcache) is properly skipped after DML
+ * within the same transaction.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_OFF &&
+ pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE)
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index 65907dcf1..79d7988fc 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 9
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,7 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9a397d166..b8abadd50 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -363,8 +367,22 @@ typedef struct
char *sr_check_password; /* password for sr_check_user */
char *sr_check_database; /* PostgreSQL database name for streaming
* replication check */
- char *replication_delay_source_cmd; /* external command for replication delay */
- int replication_delay_source_timeout; /* timeout for external command in seconds */
+ char *replication_delay_source_cmd; /* external command for
+ * replication delay */
+ int replication_delay_source_timeout; /* timeout for external
+ * command in seconds */
+
+ /* Track table mutation configuration */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for
+ * replication delay */
+ int track_table_mutation_max_staleness; /* max staleness duration
+ * ms */
+ int track_table_mutation_cold_start_duration; /* cold start duration
+ * ms */
+ int track_table_mutation_table_buckets; /* hash buckets for table
+ * map */
+ int track_table_mutation_table_size; /* max table map entries */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 000000000..dfbac666d
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,167 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of
+ * recently written tables to prevent stale reads.
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval first_write_time; /* When the entry was first created */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next in collision chain */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+
+ /*
+ * Flexible array members follow in shared memory: int
+ * buckets[num_buckets]; TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Shmem initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ uint32 stats_queries_checked; /* Queries checked */
+ uint32 stats_forced_primary; /* Forced to primary */
+ uint32 stats_allowed_replica; /* Allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index 32bcb0a1f..e41c575be 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1501,11 +1502,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1519,6 +1523,12 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR &&
+ pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3084,6 +3094,16 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1,
+ "track_table_mutation: %zu bytes requested"
+ " for shared memory",
+ MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3200,6 +3220,13 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for recently written tables */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea1..f445f268b 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,32 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature. For autocommit
+ * statements (not in explicit transaction), mark tables immediately. For
+ * explicit transactions, marking is deferred to COMMIT in dml_adaptive()
+ * so that ROLLBACKed writes don't pollute the shared memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL &&
+ !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(
+ oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index 761876f53..4a527c84c 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,13 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index f9458bb55..74ee00d16 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1461,7 +1461,9 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write)
+ && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1806,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 1ac982907..ce9b92da0 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,43 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~80 KB shared memory for table tracking
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_max_staleness = 60000
+ # Maximum duration (ms) a table can be marked stale
+ # from its first write. Bounds cross-session impact:
+ # even under continuous writes, staleness expires
+ # after this period and is only renewed by new writes.
+ # 0 disables the cap. Range: 0-3600000 (default: 60000 = 60s)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b63865..cdd570396 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -419,6 +420,7 @@ check_replication_time_lag(void)
BackendInfo *bkinfo;
uint64 lag;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0;
ErrorContextCallback callback;
int active_standby_node;
bool replication_delay_by_time;
@@ -643,6 +645,10 @@ check_replication_time_lag(void)
* seconds to micro
* seconds */
+ /* Track max delay for mutation TTL */
+ if (lag > max_delay_us)
+ max_delay_us = lag;
+
/* Log delay if necessary */
if ((pool_config->log_standby_delay == LSD_ALWAYS && lag > 0) ||
(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -668,6 +674,13 @@ check_replication_time_lag(void)
}
}
+ /*
+ * Update track table mutation TTL from the max observed time-based
+ * replication delay.
+ */
+ if (replication_delay_by_time && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
error_context_stack = callback.previous;
}
@@ -695,6 +708,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track max delay for mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1017,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1039,12 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation TTL based on max observed delay */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c182..1c8ae392d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/043.track_table_mutation/test.sh b/src/test/regression/tests/043.track_table_mutation/test.sh
new file mode 100755
index 000000000..8b4dd17b8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
new file mode 100755
index 000000000..c50c213d6
--- /dev/null
+++ b/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,184 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# Test script for track table mutation global cold start
+# on watchdog leader change.
+#
+# Uses $WATCHDOG_SETUP to create a 2-node watchdog cluster,
+# then verifies that when the leader is stopped the new
+# leader triggers a global cold start.
+#-------------------------------------------------------------------
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+success_count=0
+
+dir=`pwd`
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# Create 2-node watchdog cluster
+$WATCHDOG_SETUP -wn 2 || exit 1
+
+# Ensure per-node scripts are executable
+# (sed -i in watchdog_setup can strip permissions)
+chmod 755 pgpool*/startall pgpool*/shutdownall
+
+# Append track_table_mutation config to both nodes
+for i in 0 1
+do
+ cat >> pgpool${i}/etc/pgpool.conf <<EOF
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+enable_consensus_with_half_votes = on
+log_min_messages = debug1
+EOF
+done
+
+./startall
+export PCPPASSFILE=$dir/$TESTDIR/pgpool0/pcppass
+
+# Wait for watchdog lifecheck on node 0
+echo -n "waiting for watchdog node 0 starting up..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "lifecheck started" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ break
+ fi
+ sleep 2
+done
+echo "done."
+
+# Test 1: Verify leader came up
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 2: Verify standby joined cluster
+echo "=== Test 2: Waiting for standby to join... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby joined."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation initialized
+echo "=== Test 3: Verify feature initialized ==="
+if grep -a "track_table_mutation: initialized" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: Feature initialized."
+else
+ echo "Test 3 FAILED: Feature not initialized"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader (pgpool0) to trigger failover
+echo "=== Test 4: Stopping leader... ==="
+cd pgpool0
+source ./bashrc.ports
+$PGPOOL_INSTALL_DIR/bin/pgpool \
+ -f etc/pgpool.conf -m f stop
+cd ..
+
+echo "Checking standby detected shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "is shutting down" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Shutdown not detected"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby became new leader
+echo "=== Test 5: Checking standby takes over... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered
+echo "=== Test 6: Checking global cold start... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: global cold start" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 939200965..467ec114c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -519,6 +519,10 @@ TableLikeClause
TableSampleClause
TargetEntry
TokenizedLine
+TrackTableMutationEntry
+TrackTableMutationHashTable
+TrackTableMutationShmem
+TrackTableMutationState
TransactionId
TransactionStmt
TransactionStmtKind
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 000000000..e7771e7bf
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,902 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently
+ * written tables to prevent stale reads from replicas.
+ *
+ * Based on the "lagless" architecture from Tailor Brands.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY \
+ "SELECT oid FROM pg_catalog.pg_database" \
+ " WHERE datname = '%s'"
+
+/*
+ * Helper macro: true when the feature is not active.
+ */
+#define TRACK_TABLE_MUTATION_DISABLED() \
+ (pool_config->disable_load_balance_on_write != \
+ DLBOW_DML_ADAPTIVE_GLOBAL || \
+ track_table_mutation_shmem == NULL)
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64) (end->tv_sec - start->tv_sec) * 1000000)
+ + (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL ||
+ MAIN_CONNECTION(backend) == NULL ||
+ MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(
+ pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "error creating relcache")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(
+ relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "table map init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ if (map->free_list_head == invalid)
+ return invalid;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map,
+ int idx)
+{
+ TrackTableMutationEntry *entries;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table.
+ * Returns entry index or INVALID_INDEX if not found.
+ * Must be called with lock held.
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ while (idx != invalid)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/*
+ * Insert or update a table entry.
+ * Must be called with lock held.
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash,
+ struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != invalid)
+ {
+ /* Update last write time; keep first_write_time */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == invalid)
+ {
+ int b;
+
+ /* Table is full - evict first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != invalid)
+ {
+ int victim = buckets[b];
+
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == invalid)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "failed to allocate entry "
+ "for oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].first_write_time = *write_time;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "marked oid %d (dboid %d) written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map.
+ * Must be called with lock held.
+ */
+static void
+table_map_cleanup_expired(
+ TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ struct timeval now;
+ int64 max_stale_us;
+ int removed = 0;
+ int b;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness * 1000LL;
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != invalid)
+ {
+ int64 age;
+ int64 total_age;
+ bool expired;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ expired = (age > (int64) ttl_us);
+
+ /*
+ * Also evict entries that exceed max_staleness from first write.
+ */
+ if (!expired && max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ expired = (total_age >= max_stale_us);
+ }
+
+ if (expired)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "cleaned up %d expired entries",
+ removed)));
+ }
+}
+
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+/*
+ * Calculate the total shared memory size required
+ * for the track table mutation feature.
+ */
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int tbl_bkt;
+ int tbl_sz;
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += tbl_bkt * sizeof(int);
+ size += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ return size;
+}
+
+/*
+ * Initialize shared memory structures for the
+ * track table mutation feature. Allocates and sets
+ * up the table map and parse cache in shared memory.
+ * Called once from pgpool main process at startup.
+ */
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+ TrackTableMutationState *st;
+ int tbl_bkt;
+ int tbl_sz;
+
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "feature disabled")));
+ return;
+ }
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment. Memory is zeroed by
+ * initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(
+ shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: "
+ "failed to allocate %zu bytes",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers within shared memory */
+ track_table_mutation_shmem =
+ (TrackTableMutationShmem *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map =
+ (TrackTableMutationHashTable *) shmem_ptr;
+
+ /* Initialize table map */
+ table_map_init(
+ track_table_mutation_shmem->table_map,
+ tbl_bkt, tbl_sz);
+
+ /* Initialize global state */
+ st = &track_table_mutation_shmem->state;
+ st->initialized = true;
+ st->current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&st->ttl_last_updated);
+ get_current_time(&st->last_cleanup_time);
+ st->global_cold_start_until.tv_sec = 0;
+ st->global_cold_start_until.tv_usec = 0;
+ st->stats_queries_checked = 0;
+ st->stats_forced_primary = 0;
+ st->stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "initialized with %zu bytes shmem",
+ shmem_size)));
+#endif
+}
+
+/*
+ * Initialize per-child process state.
+ * Records the process start time for cold start
+ * period tracking. Called when a child process starts.
+ */
+void
+pool_track_table_mutation_child_init(void)
+{
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+ dur = pool_config->track_table_mutation_cold_start_duration;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "child init, cold start %d ms",
+ dur)));
+}
+
+/*
+ * Check if the process is in cold start period.
+ * During cold start, all queries are routed to
+ * primary to avoid stale reads. Checks both
+ * per-process and global (watchdog) cold start.
+ */
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+ int dur;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return false;
+
+ get_current_time(&now);
+ st = &track_table_mutation_shmem->state;
+
+ /* Check watchdog-triggered global cold start */
+ if (st->global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now,
+ &st->global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < dur)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "cold start (%ld/%d ms)",
+ (long) elapsed_ms, dur)));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Trigger a global cold start for all processes.
+ * Sets the cold start end time in shared memory.
+ * Called after watchdog leader change to force all
+ * queries to primary during the transition.
+ */
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ struct timeval *until;
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return;
+
+ get_current_time(&now);
+ until = &track_table_mutation_shmem->state
+ .global_cold_start_until;
+ *until = now;
+ until->tv_sec += dur / 1000;
+ until->tv_usec += (dur % 1000) * 1000;
+ if (until->tv_usec >= 1000000)
+ {
+ until->tv_sec += until->tv_usec / 1000000;
+ until->tv_usec %= 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "global cold start for %d ms",
+ dur)));
+}
+
+/*
+ * Check if a table was recently written (is "stale").
+ * Returns true if reads should go to primary because
+ * the table was written within the current TTL window.
+ */
+bool
+pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries;
+ int64 age;
+ int64 total_age;
+ int64 max_stale_us;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state
+ .current_ttl_us;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ is_stale = (age < (int64) ttl_us);
+
+ /*
+ * Enforce max_staleness hard cap: no entry can force primary routing
+ * longer than max_staleness from its first write.
+ */
+ if (is_stale)
+ {
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness
+ * 1000LL;
+ if (max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ if (total_age >= max_stale_us)
+ is_stale = false;
+ }
+ }
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "oid %d dboid %d "
+ "elapsed=%ld ttl=%lu stale=%d",
+ table_oid, dboid,
+ (long) age,
+ (unsigned long) ttl_us,
+ is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics using semaphore */
+ if (track_table_mutation_shmem != NULL)
+ {
+ TrackTableMutationState *st;
+
+ table_map_lock();
+ st = &track_table_mutation_shmem->state;
+ st->stats_queries_checked++;
+ if (is_stale)
+ st->stats_forced_primary++;
+ else
+ st->stats_allowed_replica++;
+ table_map_unlock();
+ }
+
+ return is_stale;
+}
+
+/*
+ * Mark multiple tables as recently written.
+ * Called after DML queries complete to record
+ * which tables were modified.
+ */
+void
+pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ TrackTableMutationState *st;
+ struct timeval now;
+ int i;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL ||
+ dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ st = &track_table_mutation_shmem->state;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ int64 since_cleanup;
+
+ since_cleanup = elapsed_us(
+ &st->last_cleanup_time, &now);
+ /* 100ms interval */
+ if (since_cleanup > 100000)
+ {
+ table_map_cleanup_expired(
+ map, st->current_ttl_us);
+ st->last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(
+ table_oid, dboid);
+ table_map_insert(map, table_oid,
+ dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Mark a single table as recently written.
+ */
+void
+pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = {table_oid};
+
+ pool_track_table_mutation_mark_tables_written(
+ tables, 1, dboid);
+ }
+}
+
+/*
+ * Update the staleness TTL based on observed
+ * replication delay. New TTL = delay * factor,
+ * clamped to [default_ttl, 1 hour].
+ */
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+ double factor;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ factor = pool_config->track_table_mutation_ttl_factor;
+ new_ttl = (uint64) (delay_us * factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ st = &track_table_mutation_shmem->state;
+ st->current_ttl_us = new_ttl;
+ get_current_time(&st->ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "TTL=%lu us (delay=%lu factor=%.1f)",
+ (unsigned long) new_ttl,
+ (unsigned long) delay_us,
+ factor)));
+}
--
2.53.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-04-23 08:14 ` Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-04-23 08:14 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
Unfortunately the mutated 006.memqcache failed (timeout).
>> > memqcache bug fix
>> > -----------------
>> >
>> > Good catch. The root cause: pool_set_writing_transaction() was
>> > explicitly skipping dml_adaptive_global, so
>> > pool_is_writing_transaction() always returned false in this mode.
>> > The query cache fetch guard at pool_proto_modules.c:270
>> > (!pool_is_writing_transaction()) then served stale cached results
>> > after DML in the same transaction.
>> >
>> > Fix: pool_set_writing_transaction() now sets the flag for
>> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
>> > ensures the query cache is properly bypassed after writes within
>> > the same transaction.
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> Hi Tatsuo,
>
> Rebased onto current master, renumbered the regression tests
> (043/044 to avoid collision with the new 042.ssl_reload), and
> combined everything into a single commit.
>
> Attached: v2-0001-Feature-load-balancing-control-by-table-tracking.patch
>
> Looking forward to your review.
>
>
> On Sun, Apr 19, 2026 at 10:25 AM Tatsuo Ishii <[email protected]> wrote:
>
>> > Hi Tatsuo,
>> >
>> > hank you for the detailed review. Attached patch addresses all items.
>>
>> I guess the attached patch is on top of
>> v1-0001-Feature-load-balancing-control-by-table-tracking.patch. To
>> apply v2-0001-address-review.patch, we need to apply
>> v1-0001-Feature-load-balancing-control-by-table-tracking.patch first.
>> Unfortunately due to recent commit, it does not apply anymore. Can you
>> please provide v1 + v2 that are rebased against latest master branch?
>> Also 042 regression test is already used by recent commit. Can you
>> renumber 042.track_table_mutation and
>> 043.track_table_mutation_watchdog to 043.track_table_mutation and
>> 044.track_table_mutation_watchdog accordingly?
>>
>> Looking forward to seeing new patch.
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>>
>> > memqcache bug fix
>> > -----------------
>> >
>> > Good catch. The root cause: pool_set_writing_transaction() was
>> > explicitly skipping dml_adaptive_global, so
>> > pool_is_writing_transaction() always returned false in this mode.
>> > The query cache fetch guard at pool_proto_modules.c:270
>> > (!pool_is_writing_transaction()) then served stale cached results
>> > after DML in the same transaction.
>> >
>> > Fix: pool_set_writing_transaction() now sets the flag for
>> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
>> > ensures the query cache is properly bypassed after writes within
>> > the same transaction.
>> >
>> > Removed dead query parse cache code (~700 lines)
>> > -------------------------------------------------
>> >
>> > You're right -- pool_track_table_mutation_get_cached_parse,
>> > pool_track_table_mutation_cache_parse, and
>> > pool_track_table_mutation_normalize_and_hash were never called.
>> > These were leftover from an earlier design where we planned to
>> > cache SQL parse results in shared memory. The feature ended up
>> > using pgpool's existing parser directly, and this code was never
>> > wired up.
>> >
>> > Removed: QueryParseCache and QueryParseEntry structs, all related
>> > static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
>> > and the track_table_mutation_query_buckets /
>> > track_table_mutation_query_parse_cache_size configuration
>> > parameters. This also reduces shared memory usage from ~6.4 MB
>> > to ~80 KB with default settings.
>> >
>> > check_object_relationship_list scope
>> > -------------------------------------
>> >
>> > You're correct -- dml_adaptive_global does not use
>> > dml_adaptive_object_relationship_list. Changed
>> > check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
>> > only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
>> >
>> > Documentation fixes
>> > -------------------
>> >
>> > - Removed "(Lagless Replica Reads)" from section title and
>> > "lagless" language from description.
>> >
>> > - Described fallback behavior when neither
>> > replication_delay_source_cmd nor delay_threshold_by_time is
>> > configured (TTL stays at 100ms default minimum).
>> >
>> > - "query cache" references removed (the query parse cache is gone).
>> >
>> > - Added 128-table-per-SELECT limit to Limitations section
>> > (uses POOL_MAX_SELECT_OIDS).
>> >
>> > Code style fixes
>> > ----------------
>> >
>> > - DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
>> >
>> > - Split the long errmsg line in
>> > is_select_object_in_temp_write_list.
>> >
>> > - Removed redundant is_adaptive variable in
>> > is_select_object_in_temp_write_list (the check at function
>> > entry already guarantees it).
>> >
>> > Thanks!
>> >
>> > On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> Hi Nadav,
>> >>
>> >> > Hi Tatsuo,
>> >> >
>> >> > Looks good to me thanks!
>> >> >
>> >> > Please go ahead with your review. waiting to hear back from you.
>> >>
>> >> Here are the code review results.
>> >>
>> >> diff --git a/doc/src/sgml/loadbalance.sgml
>> b/doc/src/sgml/loadbalance.sgml
>> >> index 9e1e7b39b..7384ce81a 100644
>> >> --- a/doc/src/sgml/loadbalance.sgml
>> >> +++ b/doc/src/sgml/loadbalance.sgml
>> >> :
>> >> + <sect2 id="runtime-config-table-mutation-map">
>> >> + <title>Table Mutation Map Configuration (Lagless Replica
>> Reads)</title>
>> >>
>> >> "(Lagless Replica Reads)" sounds like an advertisement to me. It
>> >> should be removed.
>> >>
>> >> + <para>
>> >> + These parameters configure the track table mutation feature, which
>> is
>> >> activated by setting
>> >> + <xref linkend="guc-disable-load-balance-on-write"> to
>> >> <literal>dml_adaptive_global</literal>.
>> >> + The feature tracks recently written tables to prevent stale reads
>> from
>> >> replica nodes during
>> >> + replication lag, implementing the "lagless" architecture pattern for
>> >> distributed systems
>> >> + with read replicas.
>> >>
>> >> I think the feature does not guarantee "lagless" anytime, in all cases.
>> >>
>> >> + <para>
>> >> + This feature requires time-based replication delay monitoring. This
>> >> can be provided by either
>> >> + <xref linkend="guc-replication-delay-source-cmd"> (external command
>> >> mode) or by setting
>> >> + <xref linkend="guc-delay-threshold-by-time"> (which uses
>> >> <literal>pg_stat_replication.replay_lag</literal>
>> >> + from PostgreSQL 10+). At least one of these must be configured for
>> the
>> >> TTL calculation to work.
>> >>
>> >> If one of these is not set, what happens? Error? Need to describe it.
>> >>
>> >> + </para>
>> >> +
>> >> + <warning>
>> >> + <para>
>> >> + Enabling <literal>dml_adaptive_global</literal> increases shared
>> >> memory consumption. With default settings,
>> >> + the feature requires approximately 6.4 MB of shared memory (0.1 MB
>> >> for table tracking + 6.3 MB for query cache).
>> >>
>> >> "query cache" should be "query parse cache".
>> >>
>> >> + Memory usage scales with configuration parameters:
>> >> + </para>
>> >> + <itemizedlist>
>> >> + <listitem>
>> >> + <para>
>> >> + Table tracking: <literal>track_table_mutation_table_size * 40
>> >> bytes</literal> (default: 2048 * 40 = ~80 KB)
>> >> + </para>
>> >> + </listitem>
>> >> + <listitem>
>> >> + <para>
>> >> + Query cache:
>> <literal>track_table_mutation_query_parse_cache_size *
>> >> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
>> >>
>> >> "query cache" should be "query parse cache".
>> >>
>> >> + <title>Limitations</title>
>> >>
>> >> I think number of tables tacked in a SELECT is limited to 8. It should
>> >> be mentioned.
>> >>
>> >> diff --git a/src/context/pool_query_context.c
>> >> b/src/context/pool_query_context.c
>> >> index a056ac596..0190d3673 100644
>> >> --- a/src/context/pool_query_context.c
>> >> +++ b/src/context/pool_query_context.c
>> >> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
>> >> static bool
>> >> is_select_object_in_temp_write_list(Node *node, void *context)
>> >> {
>> >> - if (node == NULL || pool_config->disable_load_balance_on_write
>> !=
>> >> DLBOW_DML_ADAPTIVE)
>> >> + if (node == NULL ||
>> >> + !DLBOW_IS_DML_ADAPTIVE(
>> >> +
>> >> pool_config->disable_load_balance_on_write))
>> >>
>> >> You don't need to split the line.
>> >>
>> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>> >> +
>> >> pool_config->disable_load_balance_on_write);
>> >>
>> >> You don't need to split the line.
>> >>
>> >> - if (pool_config->disable_load_balance_on_write ==
>> >> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
>> >> + if (is_adaptive &&
>> >> + session_context->is_in_transaction)
>> >> {
>> >> ereport(DEBUG1,
>> >>
>> >> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
>> >> \"%s\"", (char *) context, rgv->relname)));
>> >> This line is too long. Please split.
>> >>
>> >> @@ -1880,7 +1889,13 @@ static char
>> >> *get_associated_object_from_dml_adaptive_relations
>> >> void
>> >> check_object_relationship_list(char *name, bool is_func_name)
>> >> {
>> >> - if (pool_config->disable_load_balance_on_write ==
>> >> DLBOW_DML_ADAPTIVE &&
>> >> pool_config->parsed_dml_adaptive_object_relationship_list)
>> >> + bool is_adaptive;
>> >> +
>> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>> >> +
>> >> pool_config->disable_load_balance_on_write);
>> >>
>> >> I wrote in the commit message:
>> >>
>> >> modifications are only detected in the same transaction). Note,
>> >> however, you cannot use dml_adaptive_object_relationship_list to track
>> >> dependency among table and other objects.
>> >>
>> >> In my understanding the feature does not use
>> >> dml_adaptive_object_relationship_list. If this is correct, why
>> >> check_object_relationship_list() is called here in case
>> >> dml_adaptive_global? If the feature uses
>> >> dml_adaptive_object_relationship_list, test cases should be included.
>> >>
>> >> diff --git a/src/utils/pool_track_table_mutation.c
>> >> b/src/utils/pool_track_table_mutation.c
>> >> new file mode 100644
>> >> index 000000000..9be46b28f
>> >> --- /dev/null
>> >> +++ b/src/utils/pool_track_table_mutation.c
>> >>
>> >> It seems following functions are not used anywhere. I wonder if this
>> >> feature actually use "query parse cache".
>> >>
>> >> pool_track_table_mutation_get_cached_parse
>> >> pool_track_table_mutation_cache_parse
>> >> pool_track_table_mutation_normalize_and_hash
>> >>
>> >> Besides the code review, I mutated one of regression tests to check
>> >> whether the feature co exists with in the existing memory query cache
>> >> feature. After attached patch applied, I ran 006.memqcache and got the
>> >> following result.
>> >>
>> >> cd src/test/regression
>> >> ./regress.sh 006
>> >> creating pgpool-II temporary installation ...
>> >> moving pgpool_setup to temporary installation path ...
>> >> moving watchdog_setup to temporary installation path ...
>> >> using pgpool-II at
>> >>
>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>> >> *************************
>> >> REGRESSION MODE : install
>> >> Pgpool-II version : pgpool-II version 4.8devel (mitsukakeboshi)
>> >> Pgpool-II install path :
>> >>
>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>> >> PostgreSQL bin : /usr/local/pgsql/bin
>> >> PostgreSQL Major version : 18
>> >> pgbench : /usr/local/pgsql/bin/pgbench
>> >> PostgreSQL jdbc :
>> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>> >> *************************
>> >> testing 006.memqcache...failed.
>> >> out of 1 ok:0 failed:1 timeout:0
>> >>
>> >> log/006.memqcache shows:
>> >>
>> >> ../expected.txt result.txt differ: char 1, line 1
>> >>
>> >> So I checked the test script and found the error was generated by a
>> >> Java program test.
>> >>
>> >> java jdbctest > result.txt 2>&1
>> >> cmp ../expected.txt result.txt
>> >> if [ $? != 0 ];then
>> >> ./shutdownall
>> >> exit 1
>> >> fi
>> >>
>> >> In jdbctest.java:
>> >>
>> >> /*
>> >> * Cache test in an explicit transaction
>> >> */
>> >> conn.setAutoCommit(false);
>> >> // execute DML. This should prevent SELECTs from using
>> >> query cache in the transaction.
>> >> sql = "UPDATE t1 SET i = 2;";
>> >> pst = conn.createStatement();
>> >> pst.executeUpdate(sql);
>> >> pst.close();
>> >> // should not use the cache and should return "2",
>> rather
>> >> than "1"
>> >> prest = conn.prepareStatement("SELECT * FROM t1");
>> >> rs = prest.executeQuery();
>> >>
>> >> The expected file (expected.txt) has "2" but the result file
>> >> (testdir/result.txt) was "1". This is the reason why the test
>> >> failed. I wonder if there's something wrong with the feature when the
>> >> query cache is enabled. Can you look into this?
>> >>
>> >> Regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-04-23 14:16 ` Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-04-23 14:16 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Good catch on the 006.memqcache timeout. My previous fix had
wrong side effects -- setting writing_transaction for
dml_adaptive_global also changed routing behavior (it forced the
whole transaction to primary, effectively reducing the feature to
'transaction' mode). That's what caused the hang.
Fixed properly in v3: instead of touching writing_transaction,
added a memqcache-specific guard that checks whether the current
dml_adaptive* session has tracked writes in the current
transaction, and skips the cache fetch if so.
Attached: v3-0001-Feature-load-balancing-control-by-table-tracking.patch
Changes in v3 vs v2:
- pool_set_writing_transaction() reverted to original behavior
(dml_adaptive_global no longer sets writing_transaction, so
routing stays per-table as intended).
- Added new helper pool_has_dml_adaptive_write_in_transaction()
in pool_session_context.c. Returns true when the current session
is in dml_adaptive* mode, is inside an explicit transaction, and
has already tracked at least one write (via
transaction_temp_write_list).
- The two memqcache fetch guards in pool_proto_modules.c
(simple query at line 270, extended query at line 1028) now
also call !pool_has_dml_adaptive_write_in_transaction().
Autocommit writes in dml_adaptive_global are still handled by
the existing pool_invalidate_query_cache() at COMMIT time --
no change needed there.
Verified locally by mutating 006.memqcache with
disable_load_balance_on_write = 'dml_adaptive_global' in the
streaming replication mode (the only mode where dml_adaptive
applies) and the jdbctest now correctly returns "2" instead of
the stale cached "1". Both 006.memqcache and 043.track_table_mutation
pass.
Thanks!
On Thu, Apr 23, 2026 at 11:14 AM Tatsuo Ishii <[email protected]> wrote:
> Hi Nadav,
>
> Unfortunately the mutated 006.memqcache failed (timeout).
>
> >> > memqcache bug fix
> >> > -----------------
> >> >
> >> > Good catch. The root cause: pool_set_writing_transaction() was
> >> > explicitly skipping dml_adaptive_global, so
> >> > pool_is_writing_transaction() always returned false in this mode.
> >> > The query cache fetch guard at pool_proto_modules.c:270
> >> > (!pool_is_writing_transaction()) then served stale cached results
> >> > after DML in the same transaction.
> >> >
> >> > Fix: pool_set_writing_transaction() now sets the flag for
> >> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
> >> > ensures the query cache is properly bypassed after writes within
> >> > the same transaction.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Hi Tatsuo,
> >
> > Rebased onto current master, renumbered the regression tests
> > (043/044 to avoid collision with the new 042.ssl_reload), and
> > combined everything into a single commit.
> >
> > Attached: v2-0001-Feature-load-balancing-control-by-table-tracking.patch
> >
> > Looking forward to your review.
> >
> >
> > On Sun, Apr 19, 2026 at 10:25 AM Tatsuo Ishii <[email protected]>
> wrote:
> >
> >> > Hi Tatsuo,
> >> >
> >> > hank you for the detailed review. Attached patch addresses all items.
> >>
> >> I guess the attached patch is on top of
> >> v1-0001-Feature-load-balancing-control-by-table-tracking.patch. To
> >> apply v2-0001-address-review.patch, we need to apply
> >> v1-0001-Feature-load-balancing-control-by-table-tracking.patch first.
> >> Unfortunately due to recent commit, it does not apply anymore. Can you
> >> please provide v1 + v2 that are rebased against latest master branch?
> >> Also 042 regression test is already used by recent commit. Can you
> >> renumber 042.track_table_mutation and
> >> 043.track_table_mutation_watchdog to 043.track_table_mutation and
> >> 044.track_table_mutation_watchdog accordingly?
> >>
> >> Looking forward to seeing new patch.
> >>
> >> Regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS K.K.
> >> English: http://www.sraoss.co.jp/index_en/
> >> Japanese:http://www.sraoss.co.jp
> >>
> >>
> >> > memqcache bug fix
> >> > -----------------
> >> >
> >> > Good catch. The root cause: pool_set_writing_transaction() was
> >> > explicitly skipping dml_adaptive_global, so
> >> > pool_is_writing_transaction() always returned false in this mode.
> >> > The query cache fetch guard at pool_proto_modules.c:270
> >> > (!pool_is_writing_transaction()) then served stale cached results
> >> > after DML in the same transaction.
> >> >
> >> > Fix: pool_set_writing_transaction() now sets the flag for
> >> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
> >> > ensures the query cache is properly bypassed after writes within
> >> > the same transaction.
> >> >
> >> > Removed dead query parse cache code (~700 lines)
> >> > -------------------------------------------------
> >> >
> >> > You're right -- pool_track_table_mutation_get_cached_parse,
> >> > pool_track_table_mutation_cache_parse, and
> >> > pool_track_table_mutation_normalize_and_hash were never called.
> >> > These were leftover from an earlier design where we planned to
> >> > cache SQL parse results in shared memory. The feature ended up
> >> > using pgpool's existing parser directly, and this code was never
> >> > wired up.
> >> >
> >> > Removed: QueryParseCache and QueryParseEntry structs, all related
> >> > static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
> >> > and the track_table_mutation_query_buckets /
> >> > track_table_mutation_query_parse_cache_size configuration
> >> > parameters. This also reduces shared memory usage from ~6.4 MB
> >> > to ~80 KB with default settings.
> >> >
> >> > check_object_relationship_list scope
> >> > -------------------------------------
> >> >
> >> > You're correct -- dml_adaptive_global does not use
> >> > dml_adaptive_object_relationship_list. Changed
> >> > check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
> >> > only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
> >> >
> >> > Documentation fixes
> >> > -------------------
> >> >
> >> > - Removed "(Lagless Replica Reads)" from section title and
> >> > "lagless" language from description.
> >> >
> >> > - Described fallback behavior when neither
> >> > replication_delay_source_cmd nor delay_threshold_by_time is
> >> > configured (TTL stays at 100ms default minimum).
> >> >
> >> > - "query cache" references removed (the query parse cache is gone).
> >> >
> >> > - Added 128-table-per-SELECT limit to Limitations section
> >> > (uses POOL_MAX_SELECT_OIDS).
> >> >
> >> > Code style fixes
> >> > ----------------
> >> >
> >> > - DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
> >> >
> >> > - Split the long errmsg line in
> >> > is_select_object_in_temp_write_list.
> >> >
> >> > - Removed redundant is_adaptive variable in
> >> > is_select_object_in_temp_write_list (the check at function
> >> > entry already guarantees it).
> >> >
> >> > Thanks!
> >> >
> >> > On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]>
> >> wrote:
> >> >
> >> >> Hi Nadav,
> >> >>
> >> >> > Hi Tatsuo,
> >> >> >
> >> >> > Looks good to me thanks!
> >> >> >
> >> >> > Please go ahead with your review. waiting to hear back from you.
> >> >>
> >> >> Here are the code review results.
> >> >>
> >> >> diff --git a/doc/src/sgml/loadbalance.sgml
> >> b/doc/src/sgml/loadbalance.sgml
> >> >> index 9e1e7b39b..7384ce81a 100644
> >> >> --- a/doc/src/sgml/loadbalance.sgml
> >> >> +++ b/doc/src/sgml/loadbalance.sgml
> >> >> :
> >> >> + <sect2 id="runtime-config-table-mutation-map">
> >> >> + <title>Table Mutation Map Configuration (Lagless Replica
> >> Reads)</title>
> >> >>
> >> >> "(Lagless Replica Reads)" sounds like an advertisement to me. It
> >> >> should be removed.
> >> >>
> >> >> + <para>
> >> >> + These parameters configure the track table mutation feature,
> which
> >> is
> >> >> activated by setting
> >> >> + <xref linkend="guc-disable-load-balance-on-write"> to
> >> >> <literal>dml_adaptive_global</literal>.
> >> >> + The feature tracks recently written tables to prevent stale reads
> >> from
> >> >> replica nodes during
> >> >> + replication lag, implementing the "lagless" architecture pattern
> for
> >> >> distributed systems
> >> >> + with read replicas.
> >> >>
> >> >> I think the feature does not guarantee "lagless" anytime, in all
> cases.
> >> >>
> >> >> + <para>
> >> >> + This feature requires time-based replication delay monitoring.
> This
> >> >> can be provided by either
> >> >> + <xref linkend="guc-replication-delay-source-cmd"> (external
> command
> >> >> mode) or by setting
> >> >> + <xref linkend="guc-delay-threshold-by-time"> (which uses
> >> >> <literal>pg_stat_replication.replay_lag</literal>
> >> >> + from PostgreSQL 10+). At least one of these must be configured
> for
> >> the
> >> >> TTL calculation to work.
> >> >>
> >> >> If one of these is not set, what happens? Error? Need to describe it.
> >> >>
> >> >> + </para>
> >> >> +
> >> >> + <warning>
> >> >> + <para>
> >> >> + Enabling <literal>dml_adaptive_global</literal> increases shared
> >> >> memory consumption. With default settings,
> >> >> + the feature requires approximately 6.4 MB of shared memory (0.1
> MB
> >> >> for table tracking + 6.3 MB for query cache).
> >> >>
> >> >> "query cache" should be "query parse cache".
> >> >>
> >> >> + Memory usage scales with configuration parameters:
> >> >> + </para>
> >> >> + <itemizedlist>
> >> >> + <listitem>
> >> >> + <para>
> >> >> + Table tracking: <literal>track_table_mutation_table_size * 40
> >> >> bytes</literal> (default: 2048 * 40 = ~80 KB)
> >> >> + </para>
> >> >> + </listitem>
> >> >> + <listitem>
> >> >> + <para>
> >> >> + Query cache:
> >> <literal>track_table_mutation_query_parse_cache_size *
> >> >> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
> >> >>
> >> >> "query cache" should be "query parse cache".
> >> >>
> >> >> + <title>Limitations</title>
> >> >>
> >> >> I think number of tables tacked in a SELECT is limited to 8. It
> should
> >> >> be mentioned.
> >> >>
> >> >> diff --git a/src/context/pool_query_context.c
> >> >> b/src/context/pool_query_context.c
> >> >> index a056ac596..0190d3673 100644
> >> >> --- a/src/context/pool_query_context.c
> >> >> +++ b/src/context/pool_query_context.c
> >> >> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
> >> >> static bool
> >> >> is_select_object_in_temp_write_list(Node *node, void *context)
> >> >> {
> >> >> - if (node == NULL ||
> pool_config->disable_load_balance_on_write
> >> !=
> >> >> DLBOW_DML_ADAPTIVE)
> >> >> + if (node == NULL ||
> >> >> + !DLBOW_IS_DML_ADAPTIVE(
> >> >> +
> >> >> pool_config->disable_load_balance_on_write))
> >> >>
> >> >> You don't need to split the line.
> >> >>
> >> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
> >> >> +
> >> >> pool_config->disable_load_balance_on_write);
> >> >>
> >> >> You don't need to split the line.
> >> >>
> >> >> - if (pool_config->disable_load_balance_on_write ==
> >> >> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
> >> >> + if (is_adaptive &&
> >> >> + session_context->is_in_transaction)
> >> >> {
> >> >> ereport(DEBUG1,
> >> >>
> >> >> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
> >> >> \"%s\"", (char *) context, rgv->relname)));
> >> >> This line is too long. Please split.
> >> >>
> >> >> @@ -1880,7 +1889,13 @@ static char
> >> >> *get_associated_object_from_dml_adaptive_relations
> >> >> void
> >> >> check_object_relationship_list(char *name, bool is_func_name)
> >> >> {
> >> >> - if (pool_config->disable_load_balance_on_write ==
> >> >> DLBOW_DML_ADAPTIVE &&
> >> >> pool_config->parsed_dml_adaptive_object_relationship_list)
> >> >> + bool is_adaptive;
> >> >> +
> >> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
> >> >> +
> >> >> pool_config->disable_load_balance_on_write);
> >> >>
> >> >> I wrote in the commit message:
> >> >>
> >> >> modifications are only detected in the same transaction). Note,
> >> >> however, you cannot use dml_adaptive_object_relationship_list to
> track
> >> >> dependency among table and other objects.
> >> >>
> >> >> In my understanding the feature does not use
> >> >> dml_adaptive_object_relationship_list. If this is correct, why
> >> >> check_object_relationship_list() is called here in case
> >> >> dml_adaptive_global? If the feature uses
> >> >> dml_adaptive_object_relationship_list, test cases should be included.
> >> >>
> >> >> diff --git a/src/utils/pool_track_table_mutation.c
> >> >> b/src/utils/pool_track_table_mutation.c
> >> >> new file mode 100644
> >> >> index 000000000..9be46b28f
> >> >> --- /dev/null
> >> >> +++ b/src/utils/pool_track_table_mutation.c
> >> >>
> >> >> It seems following functions are not used anywhere. I wonder if this
> >> >> feature actually use "query parse cache".
> >> >>
> >> >> pool_track_table_mutation_get_cached_parse
> >> >> pool_track_table_mutation_cache_parse
> >> >> pool_track_table_mutation_normalize_and_hash
> >> >>
> >> >> Besides the code review, I mutated one of regression tests to check
> >> >> whether the feature co exists with in the existing memory query cache
> >> >> feature. After attached patch applied, I ran 006.memqcache and got
> the
> >> >> following result.
> >> >>
> >> >> cd src/test/regression
> >> >> ./regress.sh 006
> >> >> creating pgpool-II temporary installation ...
> >> >> moving pgpool_setup to temporary installation path ...
> >> >> moving watchdog_setup to temporary installation path ...
> >> >> using pgpool-II at
> >> >>
> >>
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> >> >> *************************
> >> >> REGRESSION MODE : install
> >> >> Pgpool-II version : pgpool-II version 4.8devel
> (mitsukakeboshi)
> >> >> Pgpool-II install path :
> >> >>
> >>
> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
> >> >> PostgreSQL bin : /usr/local/pgsql/bin
> >> >> PostgreSQL Major version : 18
> >> >> pgbench : /usr/local/pgsql/bin/pgbench
> >> >> PostgreSQL jdbc :
> >> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
> >> >> *************************
> >> >> testing 006.memqcache...failed.
> >> >> out of 1 ok:0 failed:1 timeout:0
> >> >>
> >> >> log/006.memqcache shows:
> >> >>
> >> >> ../expected.txt result.txt differ: char 1, line 1
> >> >>
> >> >> So I checked the test script and found the error was generated by a
> >> >> Java program test.
> >> >>
> >> >> java jdbctest > result.txt 2>&1
> >> >> cmp ../expected.txt result.txt
> >> >> if [ $? != 0 ];then
> >> >> ./shutdownall
> >> >> exit 1
> >> >> fi
> >> >>
> >> >> In jdbctest.java:
> >> >>
> >> >> /*
> >> >> * Cache test in an explicit transaction
> >> >> */
> >> >> conn.setAutoCommit(false);
> >> >> // execute DML. This should prevent SELECTs from
> using
> >> >> query cache in the transaction.
> >> >> sql = "UPDATE t1 SET i = 2;";
> >> >> pst = conn.createStatement();
> >> >> pst.executeUpdate(sql);
> >> >> pst.close();
> >> >> // should not use the cache and should return "2",
> >> rather
> >> >> than "1"
> >> >> prest = conn.prepareStatement("SELECT * FROM t1");
> >> >> rs = prest.executeQuery();
> >> >>
> >> >> The expected file (expected.txt) has "2" but the result file
> >> >> (testdir/result.txt) was "1". This is the reason why the test
> >> >> failed. I wonder if there's something wrong with the feature when the
> >> >> query cache is enabled. Can you look into this?
> >> >>
> >> >> Regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >
> >> >
> >> > --
> >> > Nadav Shatz
> >> > Tailor Brands | CTO
> >>
> >
> >
> > --
> > Nadav Shatz
> > Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] v3-0001-Feature-load-balancing-control-by-table-tracking.patch (91.9K, 3-v3-0001-Feature-load-balancing-control-by-table-tracking.patch)
download | inline diff:
From 4842ee89551faba04082219e5ed62169b164008e Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 19 Apr 2026 17:10:24 +0300
Subject: [PATCH v3] Feature: load balancing control by table tracking.
Prevent routing of read only queries to standby if replication delay
of tables used in the query exceeds certain amount of value
collected by streaming replication process. To enable this feature,
set disable_load_balance_on_write to dml_adaptive_global.
In this mode, when tables are modified by
INSERT/UPDATE/DELETE/TRUNCATE/MERGE/data modification WITH, for
certain peoriod SELECTs using the tables are not load balanced:
i.e. routed to the primary PostgreSQL server to avoid the data
staleness by replication delay.
Unlike dml_adaptive mode, any table modifications decribed above are
detected even they happn in other sessions (in dml_adaptive, table
modifications are only detected in the same transaction). Note,
however, you cannot use dml_adaptive_object_relationship_list to track
dependency among table and other objects.
Besides dml_adaptive_global, there are some tuning knobs for the
feature:
- track_table_mutation_ttl_factor
Parameter to calculate TTL of each tracking data.
- track_table_mutation_max_staleness
Maximum duration in milliseconds that a single table entry can
continuously force queries to primary.
- track_table_mutation_cold_start_duration
Duration in milliseconds to route all queries to primary after a
child process starts.
- track_table_mutation_table_buckets
Number of hash buckets for the track table mutation hash table.
- track_table_mutation_table_size
Maximum number of tables that can be tracked simultaneously in the
track table mutation.
Author: Nadav Shatz <[email protected]>
Reviewed-by: Tatsuo Ishii <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/20260407.181009.1762204033074164841.ishii%40postgresql.org#58c139c1a7f8d5562865921d0733667b
---
doc/src/sgml/loadbalance.sgml | 288 ++++++
src/Makefile.am | 1 +
src/config/pool_config_variables.c | 65 ++
src/context/pool_query_context.c | 242 ++++-
src/context/pool_session_context.c | 37 +-
src/include/context/pool_session_context.h | 1 +
src/include/pool.h | 3 +-
src/include/pool_config.h | 24 +-
src/include/utils/pool_track_table_mutation.h | 167 ++++
src/main/pgpool_main.c | 29 +-
src/protocol/CommandComplete.c | 28 +
src/protocol/child.c | 8 +
src/protocol/pool_proto_modules.c | 8 +-
src/sample/pgpool.conf.sample-stream | 45 +
src/streaming_replication/pool_worker_child.c | 24 +
src/test/regression/libs.sh | 2 +
.../tests/043.track_table_mutation/test.sh | 354 +++++++
.../044.track_table_mutation_watchdog/test.sh | 184 ++++
src/tools/pgindent/typedefs.list | 4 +
src/utils/pool_track_table_mutation.c | 902 ++++++++++++++++++
20 files changed, 2391 insertions(+), 25 deletions(-)
create mode 100644 src/include/utils/pool_track_table_mutation.h
create mode 100755 src/test/regression/tests/043.track_table_mutation/test.sh
create mode 100755 src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
create mode 100644 src/utils/pool_track_table_mutation.c
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 9e1e7b39b..d4fbcf1a5 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1110,6 +1110,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-table-mutation-map"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1195,4 +1207,280 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Tracking Configuration</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires time-based replication delay monitoring. This can be provided by either
+ <xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
+ <xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
+ from PostgreSQL 10+). If neither is configured, the TTL remains at its default minimum value
+ (100 milliseconds) and is never updated based on actual replication delay, which may result
+ in suboptimal routing decisions.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 80 KB of shared memory for table tracking:
+ <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB).
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-max-staleness" xreflabel="track_table_mutation_max_staleness">
+ <term><varname>track_table_mutation_max_staleness</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_max_staleness</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum duration in milliseconds that a single table entry can continuously force queries to primary,
+ measured from when the table was first marked stale. When this cap is reached, the entry is expired
+ regardless of recent writes. If the table is written to again after expiry, a fresh tracking entry
+ is created.
+ </para>
+ <para>
+ This parameter bounds the cross-session impact of table mutation tracking. Even if a table is written
+ to in a tight loop, its effect on other sessions' load balancing is limited to this duration. For
+ legitimately busy tables, the gap between forced expiry and the next write re-marking the table is
+ negligible (typically milliseconds).
+ </para>
+ <para>
+ Set to 0 to disable the cap (not recommended for production).
+ Valid range: 0-3600000 ms. Default is <literal>60000</literal> (60 seconds).
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_max_staleness = 60000
+track_table_mutation_cold_start_duration = 2000
+
+# Option A: Use external command for replication delay
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Option B: Use pg_stat_replication replay_lag (PG 10+)
+# delay_threshold_by_time = 1000
+
+# Adjust table map size based on workload
+track_table_mutation_table_size = 4096
+ </programlisting>
+ <para>
+ Shared memory required for above configuration: approximately 160 KB for the table map.
+ Default configuration (2048 tables) requires approximately 80 KB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitations:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ A maximum of 128 tables can be tracked per SELECT query for staleness checking.
+ This limit is shared with the query cache subsystem
+ (<literal>POOL_MAX_SELECT_OIDS</literal>).
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+
+ <para>
+ <emphasis>Cross-Session Impact and Safety Bounds:</emphasis>
+ Unlike <literal>dml_adaptive</literal> (which only affects the session that issued the write),
+ <literal>dml_adaptive_global</literal> affects all sessions reading the same table in the same database.
+ The following safety mechanisms bound this cross-session impact:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis>Maximum staleness cap:</emphasis> The <xref linkend="guc-track-table-mutation-max-staleness">
+ parameter (default: 60 seconds) limits how long any single table entry can continuously force primary
+ routing. Even under sustained writes, the entry expires after this period and is only renewed by
+ subsequent committed writes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Database isolation:</emphasis> Table staleness tracking is scoped by database OID. Writes
+ in one database never affect load balancing decisions for sessions connected to a different database.
+ In multi-tenant deployments where tenants use separate databases, one tenant's write activity cannot
+ influence another tenant's query routing.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Committed writes only:</emphasis> Only committed transactions mark tables as stale.
+ Rolled-back transactions have no effect on the shared tracking state.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Bounded table map size:</emphasis> The shared memory table map has a fixed maximum size
+ (<xref linkend="guc-track-table-mutation-table-size">). At most this many tables can be marked stale
+ simultaneously, providing a natural ceiling on the feature's impact.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab530..39588af58 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index b775b2106..3039e32f0 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1777,6 +1778,19 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation "
+ "(TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2397,6 +2411,57 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_max_staleness",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Maximum duration in milliseconds that a "
+ "table can be marked stale from its first "
+ "write. 0 disables the cap.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_max_staleness,
+ 60000, /* 60 seconds */
+ 0, 3600000, /* 0 to 1 hour */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_cold_start_duration",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries "
+ "to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index a056ac596..c20a3a420 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,20 +1829,26 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
- POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
+ POOL_SESSION_CONTEXT *session_context;
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ session_context = pool_get_session_context(false);
+
+ if (session_context->is_in_transaction)
{
ereport(DEBUG1,
- (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
+ (errmsg("is_select_object_in_temp_write_list:"
+ " \"%s\", found relation \"%s\"",
+ (char *) context, rgv->relname)));
- return is_in_list(rgv->relname, session_context->transaction_temp_write_list);
+ return is_in_list(rgv->relname,
+ session_context->transaction_temp_write_list);
}
}
@@ -1880,15 +1887,22 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive =
+ (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE);
+
+ if (is_adaptive &&
+ pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
if (session_context->is_in_transaction)
{
char *right_token =
- get_associated_object_from_dml_adaptive_relations
- (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
+ get_associated_object_from_dml_adaptive_relations
+ (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
if (right_token)
{
@@ -1947,7 +1961,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1966,6 +1980,45 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush the accumulated
+ * table writes to shared memory. On ROLLBACK, skip -- the
+ * writes never committed so no stale-read risk exists. This
+ * prevents polluting the table map with rolled-back
+ * transactions.
+ */
+ int dlbow =
+ pool_config->disable_load_balance_on_write;
+ List *wlist =
+ session_context->transaction_temp_write_list;
+
+ if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ wlist != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, wlist)
+ {
+ char *tname;
+ int toid;
+
+ tname = (char *) lfirst(cell);
+ toid =
+ pool_table_name_to_oid(tname);
+
+ if (toid > 0)
+ pool_track_table_mutation_mark_table_written(
+ toid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2008,7 +2061,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context = pool_get_session_context(false);
backend = session_context->backend;
- /*
+ /*
* Collect/discard information for disable_load_balance_on_write =
* dml_adaptive case.
*/
@@ -2022,6 +2075,20 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache. This
+ * avoids potential hangs in CommandComplete where we shouldn't be
+ * running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2151,6 +2218,153 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+
+ /*
+ * Check track table mutation for recently written tables. If
+ * in cold start or any table was recently written, route to
+ * primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+ int lb_node;
+ POOL_QUERY_CONTEXT *qctx =
+ session_context->query_context;
+
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance"
+ " because of track table"
+ " mutation cold start"),
+ errdetail("destination = PRIMARY"
+ " for query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids =
+ pool_extract_table_oids_from_select_stmt(
+ node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " database oid was"
+ " unavailable"),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ bool stale;
+
+ stale =
+ pool_track_table_mutation_table_is_stale(
+ ctx.table_oids[i],
+ dboid);
+ if (stale)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " table \"%s\" was"
+ " recently written",
+ ctx.table_names[i]),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ else
+ {
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id =
+ select_load_balancing_node();
+ }
+
+ /*
+ * If replication delay is too much, and
+ * prefer_lower_delay_standby is true then elect the
+ * lowest-delayed node, otherwise send to primary.
+ */
+ lb_node =
+ session_context->load_balance_node_id;
+ if (STREAM &&
+ check_replication_delay(lb_node))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because of"
+ " too much replication"
+ " delay"),
+ errdetail("destination"
+ " = %d for"
+ " query= \"%s\"",
+ dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ lb_node =
+ select_load_balancing_node();
+ session_context->load_balance_node_id =
+ lb_node;
+ qctx->load_balance_node_id =
+ lb_node;
+ pool_set_node_to_be_sent(
+ query_context,
+ lb_node);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ qctx->load_balance_node_id =
+ session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(
+ query_context,
+ qctx->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
@@ -2171,7 +2385,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
errdetail("destination = %d for query= \"%s\"", dest, query)));
/*
- * If prefer_lower_delay_standby is on, choose lower delay standby.
+ * If prefer_lower_delay_standby is on, choose lower
+ * delay standby.
*/
if (pool_config->prefer_lower_delay_standby)
{
@@ -2181,7 +2396,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
}
- else /* delay is too much. prefer to send to primary */
+ else /* delay is too much. prefer to send to
+ * primary */
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
@@ -2191,7 +2407,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
* Not streaming replication mode, or delay_threshold is 0
* or replication delay is small enough.
*/
- else
+ else
{
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context,
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc..1e777b983 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -532,7 +532,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -542,7 +542,9 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write) &&
+ session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -738,10 +740,13 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_OFF &&
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
@@ -758,6 +763,28 @@ pool_is_writing_transaction(void)
return pool_get_session_context(false)->writing_transaction;
}
+/*
+ * Do we have a DML write in this transaction tracked by dml_adaptive
+ * or dml_adaptive_global mode? Used to bypass the query cache when
+ * those modes are active, since they do not set writing_transaction.
+ */
+bool
+pool_has_dml_adaptive_write_in_transaction(void)
+{
+ POOL_SESSION_CONTEXT *s;
+
+ if (!DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
+ return false;
+
+ s = pool_get_session_context(true);
+ if (s == NULL)
+ return false;
+
+ return s->is_in_transaction &&
+ s->transaction_temp_write_list != NIL;
+}
+
/*
* Error doesn't occur in this transaction yet.
*/
diff --git a/src/include/context/pool_session_context.h b/src/include/context/pool_session_context.h
index 446357de3..5d43eac37 100644
--- a/src/include/context/pool_session_context.h
+++ b/src/include/context/pool_session_context.h
@@ -380,6 +380,7 @@ extern POOL_SENT_MESSAGE *pool_get_sent_message_by_query_context(POOL_QUERY_CONT
extern void pool_unset_writing_transaction(void);
extern void pool_set_writing_transaction(void);
extern bool pool_is_writing_transaction(void);
+extern bool pool_has_dml_adaptive_write_in_transaction(void);
extern void pool_unset_failed_transaction(void);
extern void pool_set_failed_transaction(void);
extern bool pool_is_failed_transaction(void);
diff --git a/src/include/pool.h b/src/include/pool.h
index 65907dcf1..79d7988fc 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 9
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,7 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9a397d166..b8abadd50 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -363,8 +367,22 @@ typedef struct
char *sr_check_password; /* password for sr_check_user */
char *sr_check_database; /* PostgreSQL database name for streaming
* replication check */
- char *replication_delay_source_cmd; /* external command for replication delay */
- int replication_delay_source_timeout; /* timeout for external command in seconds */
+ char *replication_delay_source_cmd; /* external command for
+ * replication delay */
+ int replication_delay_source_timeout; /* timeout for external
+ * command in seconds */
+
+ /* Track table mutation configuration */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for
+ * replication delay */
+ int track_table_mutation_max_staleness; /* max staleness duration
+ * ms */
+ int track_table_mutation_cold_start_duration; /* cold start duration
+ * ms */
+ int track_table_mutation_table_buckets; /* hash buckets for table
+ * map */
+ int track_table_mutation_table_size; /* max table map entries */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 000000000..dfbac666d
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,167 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of
+ * recently written tables to prevent stale reads.
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval first_write_time; /* When the entry was first created */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next in collision chain */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+
+ /*
+ * Flexible array members follow in shared memory: int
+ * buckets[num_buckets]; TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Shmem initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ uint32 stats_queries_checked; /* Queries checked */
+ uint32 stats_forced_primary; /* Forced to primary */
+ uint32 stats_allowed_replica; /* Allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index 32bcb0a1f..e41c575be 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1501,11 +1502,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1519,6 +1523,12 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR &&
+ pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3084,6 +3094,16 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1,
+ "track_table_mutation: %zu bytes requested"
+ " for shared memory",
+ MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3200,6 +3220,13 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
wd_ipc_initialize_data();
}
+ /* Initialize track table mutation for recently written tables */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
+
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea1..f445f268b 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,32 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature. For autocommit
+ * statements (not in explicit transaction), mark tables immediately. For
+ * explicit transactions, marking is deferred to COMMIT in dml_adaptive()
+ * so that ROLLBACKed writes don't pollute the shared memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ node != NULL &&
+ !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(
+ oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index 761876f53..4a527c84c 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,13 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index f9458bb55..5bee63a15 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -268,6 +268,7 @@ SimpleQuery(POOL_CONNECTION *frontend,
*/
if (pool_config->memory_cache_enabled && is_likely_select &&
!pool_is_writing_transaction() &&
+ !pool_has_dml_adaptive_write_in_transaction() &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) != 'E' &&
!query_cache_disabled())
{
@@ -1025,6 +1026,7 @@ Execute(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
* partial_fetch is true, cannot use cache.
*/
if (pool_config->memory_cache_enabled && !pool_is_writing_transaction() &&
+ !pool_has_dml_adaptive_write_in_transaction() &&
(TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) != 'E')
&& pool_is_likely_select(query) && !query_cache_disabled() &&
(query_context->atEnd || num_rows == 0) &&
@@ -1461,7 +1463,9 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write)
+ && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1804,7 +1808,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 1ac982907..ce9b92da0 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,43 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~80 KB shared memory for table tracking
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_max_staleness = 60000
+ # Maximum duration (ms) a table can be marked stale
+ # from its first write. Bounds cross-session impact:
+ # even under continuous writes, staleness expires
+ # after this period and is only renewed by new writes.
+ # 0 disables the cap. Range: 0-3600000 (default: 60000 = 60s)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b63865..cdd570396 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -419,6 +420,7 @@ check_replication_time_lag(void)
BackendInfo *bkinfo;
uint64 lag;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0;
ErrorContextCallback callback;
int active_standby_node;
bool replication_delay_by_time;
@@ -643,6 +645,10 @@ check_replication_time_lag(void)
* seconds to micro
* seconds */
+ /* Track max delay for mutation TTL */
+ if (lag > max_delay_us)
+ max_delay_us = lag;
+
/* Log delay if necessary */
if ((pool_config->log_standby_delay == LSD_ALWAYS && lag > 0) ||
(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -668,6 +674,13 @@ check_replication_time_lag(void)
}
}
+ /*
+ * Update track table mutation TTL from the max observed time-based
+ * replication delay.
+ */
+ if (replication_delay_by_time && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
error_context_stack = callback.previous;
}
@@ -695,6 +708,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track max delay for mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1017,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1039,12 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation TTL based on max observed delay */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c182..1c8ae392d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/043.track_table_mutation/test.sh b/src/test/regression/tests/043.track_table_mutation/test.sh
new file mode 100755
index 000000000..8b4dd17b8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
new file mode 100755
index 000000000..c50c213d6
--- /dev/null
+++ b/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,184 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# Test script for track table mutation global cold start
+# on watchdog leader change.
+#
+# Uses $WATCHDOG_SETUP to create a 2-node watchdog cluster,
+# then verifies that when the leader is stopped the new
+# leader triggers a global cold start.
+#-------------------------------------------------------------------
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+success_count=0
+
+dir=`pwd`
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# Create 2-node watchdog cluster
+$WATCHDOG_SETUP -wn 2 || exit 1
+
+# Ensure per-node scripts are executable
+# (sed -i in watchdog_setup can strip permissions)
+chmod 755 pgpool*/startall pgpool*/shutdownall
+
+# Append track_table_mutation config to both nodes
+for i in 0 1
+do
+ cat >> pgpool${i}/etc/pgpool.conf <<EOF
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+enable_consensus_with_half_votes = on
+log_min_messages = debug1
+EOF
+done
+
+./startall
+export PCPPASSFILE=$dir/$TESTDIR/pgpool0/pcppass
+
+# Wait for watchdog lifecheck on node 0
+echo -n "waiting for watchdog node 0 starting up..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "lifecheck started" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ break
+ fi
+ sleep 2
+done
+echo "done."
+
+# Test 1: Verify leader came up
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 2: Verify standby joined cluster
+echo "=== Test 2: Waiting for standby to join... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby joined."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation initialized
+echo "=== Test 3: Verify feature initialized ==="
+if grep -a "track_table_mutation: initialized" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: Feature initialized."
+else
+ echo "Test 3 FAILED: Feature not initialized"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader (pgpool0) to trigger failover
+echo "=== Test 4: Stopping leader... ==="
+cd pgpool0
+source ./bashrc.ports
+$PGPOOL_INSTALL_DIR/bin/pgpool \
+ -f etc/pgpool.conf -m f stop
+cd ..
+
+echo "Checking standby detected shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "is shutting down" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Shutdown not detected"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby became new leader
+echo "=== Test 5: Checking standby takes over... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered
+echo "=== Test 6: Checking global cold start... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: global cold start" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 939200965..467ec114c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -519,6 +519,10 @@ TableLikeClause
TableSampleClause
TargetEntry
TokenizedLine
+TrackTableMutationEntry
+TrackTableMutationHashTable
+TrackTableMutationShmem
+TrackTableMutationState
TransactionId
TransactionStmt
TransactionStmtKind
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 000000000..e7771e7bf
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,902 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently
+ * written tables to prevent stale reads from replicas.
+ *
+ * Based on the "lagless" architecture from Tailor Brands.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY \
+ "SELECT oid FROM pg_catalog.pg_database" \
+ " WHERE datname = '%s'"
+
+/*
+ * Helper macro: true when the feature is not active.
+ */
+#define TRACK_TABLE_MUTATION_DISABLED() \
+ (pool_config->disable_load_balance_on_write != \
+ DLBOW_DML_ADAPTIVE_GLOBAL || \
+ track_table_mutation_shmem == NULL)
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64) (end->tv_sec - start->tv_sec) * 1000000)
+ + (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL ||
+ MAIN_CONNECTION(backend) == NULL ||
+ MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(
+ pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "error creating relcache")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(
+ relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "table map init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ if (map->free_list_head == invalid)
+ return invalid;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map,
+ int idx)
+{
+ TrackTableMutationEntry *entries;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table.
+ * Returns entry index or INVALID_INDEX if not found.
+ * Must be called with lock held.
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ while (idx != invalid)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/*
+ * Insert or update a table entry.
+ * Must be called with lock held.
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash,
+ struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != invalid)
+ {
+ /* Update last write time; keep first_write_time */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == invalid)
+ {
+ int b;
+
+ /* Table is full - evict first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != invalid)
+ {
+ int victim = buckets[b];
+
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == invalid)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "failed to allocate entry "
+ "for oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].first_write_time = *write_time;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "marked oid %d (dboid %d) written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map.
+ * Must be called with lock held.
+ */
+static void
+table_map_cleanup_expired(
+ TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ struct timeval now;
+ int64 max_stale_us;
+ int removed = 0;
+ int b;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness * 1000LL;
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != invalid)
+ {
+ int64 age;
+ int64 total_age;
+ bool expired;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ expired = (age > (int64) ttl_us);
+
+ /*
+ * Also evict entries that exceed max_staleness from first write.
+ */
+ if (!expired && max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ expired = (total_age >= max_stale_us);
+ }
+
+ if (expired)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "cleaned up %d expired entries",
+ removed)));
+ }
+}
+
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+/*
+ * Calculate the total shared memory size required
+ * for the track table mutation feature.
+ */
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int tbl_bkt;
+ int tbl_sz;
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += tbl_bkt * sizeof(int);
+ size += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ return size;
+}
+
+/*
+ * Initialize shared memory structures for the
+ * track table mutation feature. Allocates and sets
+ * up the table map and parse cache in shared memory.
+ * Called once from pgpool main process at startup.
+ */
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+ TrackTableMutationState *st;
+ int tbl_bkt;
+ int tbl_sz;
+
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "feature disabled")));
+ return;
+ }
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment. Memory is zeroed by
+ * initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(
+ shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: "
+ "failed to allocate %zu bytes",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers within shared memory */
+ track_table_mutation_shmem =
+ (TrackTableMutationShmem *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map =
+ (TrackTableMutationHashTable *) shmem_ptr;
+
+ /* Initialize table map */
+ table_map_init(
+ track_table_mutation_shmem->table_map,
+ tbl_bkt, tbl_sz);
+
+ /* Initialize global state */
+ st = &track_table_mutation_shmem->state;
+ st->initialized = true;
+ st->current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&st->ttl_last_updated);
+ get_current_time(&st->last_cleanup_time);
+ st->global_cold_start_until.tv_sec = 0;
+ st->global_cold_start_until.tv_usec = 0;
+ st->stats_queries_checked = 0;
+ st->stats_forced_primary = 0;
+ st->stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "initialized with %zu bytes shmem",
+ shmem_size)));
+#endif
+}
+
+/*
+ * Initialize per-child process state.
+ * Records the process start time for cold start
+ * period tracking. Called when a child process starts.
+ */
+void
+pool_track_table_mutation_child_init(void)
+{
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+ dur = pool_config->track_table_mutation_cold_start_duration;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "child init, cold start %d ms",
+ dur)));
+}
+
+/*
+ * Check if the process is in cold start period.
+ * During cold start, all queries are routed to
+ * primary to avoid stale reads. Checks both
+ * per-process and global (watchdog) cold start.
+ */
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+ int dur;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return false;
+
+ get_current_time(&now);
+ st = &track_table_mutation_shmem->state;
+
+ /* Check watchdog-triggered global cold start */
+ if (st->global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now,
+ &st->global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < dur)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "cold start (%ld/%d ms)",
+ (long) elapsed_ms, dur)));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Trigger a global cold start for all processes.
+ * Sets the cold start end time in shared memory.
+ * Called after watchdog leader change to force all
+ * queries to primary during the transition.
+ */
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ struct timeval *until;
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return;
+
+ get_current_time(&now);
+ until = &track_table_mutation_shmem->state
+ .global_cold_start_until;
+ *until = now;
+ until->tv_sec += dur / 1000;
+ until->tv_usec += (dur % 1000) * 1000;
+ if (until->tv_usec >= 1000000)
+ {
+ until->tv_sec += until->tv_usec / 1000000;
+ until->tv_usec %= 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "global cold start for %d ms",
+ dur)));
+}
+
+/*
+ * Check if a table was recently written (is "stale").
+ * Returns true if reads should go to primary because
+ * the table was written within the current TTL window.
+ */
+bool
+pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries;
+ int64 age;
+ int64 total_age;
+ int64 max_stale_us;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state
+ .current_ttl_us;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ is_stale = (age < (int64) ttl_us);
+
+ /*
+ * Enforce max_staleness hard cap: no entry can force primary routing
+ * longer than max_staleness from its first write.
+ */
+ if (is_stale)
+ {
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness
+ * 1000LL;
+ if (max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ if (total_age >= max_stale_us)
+ is_stale = false;
+ }
+ }
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "oid %d dboid %d "
+ "elapsed=%ld ttl=%lu stale=%d",
+ table_oid, dboid,
+ (long) age,
+ (unsigned long) ttl_us,
+ is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics using semaphore */
+ if (track_table_mutation_shmem != NULL)
+ {
+ TrackTableMutationState *st;
+
+ table_map_lock();
+ st = &track_table_mutation_shmem->state;
+ st->stats_queries_checked++;
+ if (is_stale)
+ st->stats_forced_primary++;
+ else
+ st->stats_allowed_replica++;
+ table_map_unlock();
+ }
+
+ return is_stale;
+}
+
+/*
+ * Mark multiple tables as recently written.
+ * Called after DML queries complete to record
+ * which tables were modified.
+ */
+void
+pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ TrackTableMutationState *st;
+ struct timeval now;
+ int i;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL ||
+ dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ st = &track_table_mutation_shmem->state;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ int64 since_cleanup;
+
+ since_cleanup = elapsed_us(
+ &st->last_cleanup_time, &now);
+ /* 100ms interval */
+ if (since_cleanup > 100000)
+ {
+ table_map_cleanup_expired(
+ map, st->current_ttl_us);
+ st->last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(
+ table_oid, dboid);
+ table_map_insert(map, table_oid,
+ dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Mark a single table as recently written.
+ */
+void
+pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = {table_oid};
+
+ pool_track_table_mutation_mark_tables_written(
+ tables, 1, dboid);
+ }
+}
+
+/*
+ * Update the staleness TTL based on observed
+ * replication delay. New TTL = delay * factor,
+ * clamped to [default_ttl, 1 hour].
+ */
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+ double factor;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ factor = pool_config->track_table_mutation_ttl_factor;
+ new_ttl = (uint64) (delay_us * factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ st = &track_table_mutation_shmem->state;
+ st->current_ttl_us = new_ttl;
+ get_current_time(&st->ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "TTL=%lu us (delay=%lu factor=%.1f)",
+ (unsigned long) new_ttl,
+ (unsigned long) delay_us,
+ factor)));
+}
--
2.54.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-05-18 09:54 ` Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-05-18 09:54 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Any update on this proposal?
Thanks
On Thu, Apr 23, 2026 at 5:16 PM Nadav Shatz <[email protected]> wrote:
> Hi Tatsuo,
>
> Good catch on the 006.memqcache timeout. My previous fix had
> wrong side effects -- setting writing_transaction for
> dml_adaptive_global also changed routing behavior (it forced the
> whole transaction to primary, effectively reducing the feature to
> 'transaction' mode). That's what caused the hang.
>
> Fixed properly in v3: instead of touching writing_transaction,
> added a memqcache-specific guard that checks whether the current
> dml_adaptive* session has tracked writes in the current
> transaction, and skips the cache fetch if so.
>
> Attached: v3-0001-Feature-load-balancing-control-by-table-tracking.patch
>
> Changes in v3 vs v2:
>
> - pool_set_writing_transaction() reverted to original behavior
> (dml_adaptive_global no longer sets writing_transaction, so
> routing stays per-table as intended).
>
> - Added new helper pool_has_dml_adaptive_write_in_transaction()
> in pool_session_context.c. Returns true when the current session
> is in dml_adaptive* mode, is inside an explicit transaction, and
> has already tracked at least one write (via
> transaction_temp_write_list).
>
> - The two memqcache fetch guards in pool_proto_modules.c
> (simple query at line 270, extended query at line 1028) now
> also call !pool_has_dml_adaptive_write_in_transaction().
> Autocommit writes in dml_adaptive_global are still handled by
> the existing pool_invalidate_query_cache() at COMMIT time --
> no change needed there.
>
> Verified locally by mutating 006.memqcache with
> disable_load_balance_on_write = 'dml_adaptive_global' in the
> streaming replication mode (the only mode where dml_adaptive
> applies) and the jdbctest now correctly returns "2" instead of
> the stale cached "1". Both 006.memqcache and 043.track_table_mutation
> pass.
>
> Thanks!
>
> On Thu, Apr 23, 2026 at 11:14 AM Tatsuo Ishii <[email protected]>
> wrote:
>
>> Hi Nadav,
>>
>> Unfortunately the mutated 006.memqcache failed (timeout).
>>
>> >> > memqcache bug fix
>> >> > -----------------
>> >> >
>> >> > Good catch. The root cause: pool_set_writing_transaction() was
>> >> > explicitly skipping dml_adaptive_global, so
>> >> > pool_is_writing_transaction() always returned false in this mode.
>> >> > The query cache fetch guard at pool_proto_modules.c:270
>> >> > (!pool_is_writing_transaction()) then served stale cached results
>> >> > after DML in the same transaction.
>> >> >
>> >> > Fix: pool_set_writing_transaction() now sets the flag for
>> >> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
>> >> > ensures the query cache is properly bypassed after writes within
>> >> > the same transaction.
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi Tatsuo,
>> >
>> > Rebased onto current master, renumbered the regression tests
>> > (043/044 to avoid collision with the new 042.ssl_reload), and
>> > combined everything into a single commit.
>> >
>> > Attached: v2-0001-Feature-load-balancing-control-by-table-tracking.patch
>> >
>> > Looking forward to your review.
>> >
>> >
>> > On Sun, Apr 19, 2026 at 10:25 AM Tatsuo Ishii <[email protected]>
>> wrote:
>> >
>> >> > Hi Tatsuo,
>> >> >
>> >> > hank you for the detailed review. Attached patch addresses all items.
>> >>
>> >> I guess the attached patch is on top of
>> >> v1-0001-Feature-load-balancing-control-by-table-tracking.patch. To
>> >> apply v2-0001-address-review.patch, we need to apply
>> >> v1-0001-Feature-load-balancing-control-by-table-tracking.patch first.
>> >> Unfortunately due to recent commit, it does not apply anymore. Can you
>> >> please provide v1 + v2 that are rebased against latest master branch?
>> >> Also 042 regression test is already used by recent commit. Can you
>> >> renumber 042.track_table_mutation and
>> >> 043.track_table_mutation_watchdog to 043.track_table_mutation and
>> >> 044.track_table_mutation_watchdog accordingly?
>> >>
>> >> Looking forward to seeing new patch.
>> >>
>> >> Regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >>
>> >> > memqcache bug fix
>> >> > -----------------
>> >> >
>> >> > Good catch. The root cause: pool_set_writing_transaction() was
>> >> > explicitly skipping dml_adaptive_global, so
>> >> > pool_is_writing_transaction() always returned false in this mode.
>> >> > The query cache fetch guard at pool_proto_modules.c:270
>> >> > (!pool_is_writing_transaction()) then served stale cached results
>> >> > after DML in the same transaction.
>> >> >
>> >> > Fix: pool_set_writing_transaction() now sets the flag for
>> >> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
>> >> > ensures the query cache is properly bypassed after writes within
>> >> > the same transaction.
>> >> >
>> >> > Removed dead query parse cache code (~700 lines)
>> >> > -------------------------------------------------
>> >> >
>> >> > You're right -- pool_track_table_mutation_get_cached_parse,
>> >> > pool_track_table_mutation_cache_parse, and
>> >> > pool_track_table_mutation_normalize_and_hash were never called.
>> >> > These were leftover from an earlier design where we planned to
>> >> > cache SQL parse results in shared memory. The feature ended up
>> >> > using pgpool's existing parser directly, and this code was never
>> >> > wired up.
>> >> >
>> >> > Removed: QueryParseCache and QueryParseEntry structs, all related
>> >> > static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
>> >> > and the track_table_mutation_query_buckets /
>> >> > track_table_mutation_query_parse_cache_size configuration
>> >> > parameters. This also reduces shared memory usage from ~6.4 MB
>> >> > to ~80 KB with default settings.
>> >> >
>> >> > check_object_relationship_list scope
>> >> > -------------------------------------
>> >> >
>> >> > You're correct -- dml_adaptive_global does not use
>> >> > dml_adaptive_object_relationship_list. Changed
>> >> > check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
>> >> > only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
>> >> >
>> >> > Documentation fixes
>> >> > -------------------
>> >> >
>> >> > - Removed "(Lagless Replica Reads)" from section title and
>> >> > "lagless" language from description.
>> >> >
>> >> > - Described fallback behavior when neither
>> >> > replication_delay_source_cmd nor delay_threshold_by_time is
>> >> > configured (TTL stays at 100ms default minimum).
>> >> >
>> >> > - "query cache" references removed (the query parse cache is gone).
>> >> >
>> >> > - Added 128-table-per-SELECT limit to Limitations section
>> >> > (uses POOL_MAX_SELECT_OIDS).
>> >> >
>> >> > Code style fixes
>> >> > ----------------
>> >> >
>> >> > - DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
>> >> >
>> >> > - Split the long errmsg line in
>> >> > is_select_object_in_temp_write_list.
>> >> >
>> >> > - Removed redundant is_adaptive variable in
>> >> > is_select_object_in_temp_write_list (the check at function
>> >> > entry already guarantees it).
>> >> >
>> >> > Thanks!
>> >> >
>> >> > On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]>
>> >> wrote:
>> >> >
>> >> >> Hi Nadav,
>> >> >>
>> >> >> > Hi Tatsuo,
>> >> >> >
>> >> >> > Looks good to me thanks!
>> >> >> >
>> >> >> > Please go ahead with your review. waiting to hear back from you.
>> >> >>
>> >> >> Here are the code review results.
>> >> >>
>> >> >> diff --git a/doc/src/sgml/loadbalance.sgml
>> >> b/doc/src/sgml/loadbalance.sgml
>> >> >> index 9e1e7b39b..7384ce81a 100644
>> >> >> --- a/doc/src/sgml/loadbalance.sgml
>> >> >> +++ b/doc/src/sgml/loadbalance.sgml
>> >> >> :
>> >> >> + <sect2 id="runtime-config-table-mutation-map">
>> >> >> + <title>Table Mutation Map Configuration (Lagless Replica
>> >> Reads)</title>
>> >> >>
>> >> >> "(Lagless Replica Reads)" sounds like an advertisement to me. It
>> >> >> should be removed.
>> >> >>
>> >> >> + <para>
>> >> >> + These parameters configure the track table mutation feature,
>> which
>> >> is
>> >> >> activated by setting
>> >> >> + <xref linkend="guc-disable-load-balance-on-write"> to
>> >> >> <literal>dml_adaptive_global</literal>.
>> >> >> + The feature tracks recently written tables to prevent stale
>> reads
>> >> from
>> >> >> replica nodes during
>> >> >> + replication lag, implementing the "lagless" architecture
>> pattern for
>> >> >> distributed systems
>> >> >> + with read replicas.
>> >> >>
>> >> >> I think the feature does not guarantee "lagless" anytime, in all
>> cases.
>> >> >>
>> >> >> + <para>
>> >> >> + This feature requires time-based replication delay monitoring.
>> This
>> >> >> can be provided by either
>> >> >> + <xref linkend="guc-replication-delay-source-cmd"> (external
>> command
>> >> >> mode) or by setting
>> >> >> + <xref linkend="guc-delay-threshold-by-time"> (which uses
>> >> >> <literal>pg_stat_replication.replay_lag</literal>
>> >> >> + from PostgreSQL 10+). At least one of these must be configured
>> for
>> >> the
>> >> >> TTL calculation to work.
>> >> >>
>> >> >> If one of these is not set, what happens? Error? Need to describe
>> it.
>> >> >>
>> >> >> + </para>
>> >> >> +
>> >> >> + <warning>
>> >> >> + <para>
>> >> >> + Enabling <literal>dml_adaptive_global</literal> increases
>> shared
>> >> >> memory consumption. With default settings,
>> >> >> + the feature requires approximately 6.4 MB of shared memory
>> (0.1 MB
>> >> >> for table tracking + 6.3 MB for query cache).
>> >> >>
>> >> >> "query cache" should be "query parse cache".
>> >> >>
>> >> >> + Memory usage scales with configuration parameters:
>> >> >> + </para>
>> >> >> + <itemizedlist>
>> >> >> + <listitem>
>> >> >> + <para>
>> >> >> + Table tracking: <literal>track_table_mutation_table_size * 40
>> >> >> bytes</literal> (default: 2048 * 40 = ~80 KB)
>> >> >> + </para>
>> >> >> + </listitem>
>> >> >> + <listitem>
>> >> >> + <para>
>> >> >> + Query cache:
>> >> <literal>track_table_mutation_query_parse_cache_size *
>> >> >> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
>> >> >>
>> >> >> "query cache" should be "query parse cache".
>> >> >>
>> >> >> + <title>Limitations</title>
>> >> >>
>> >> >> I think number of tables tacked in a SELECT is limited to 8. It
>> should
>> >> >> be mentioned.
>> >> >>
>> >> >> diff --git a/src/context/pool_query_context.c
>> >> >> b/src/context/pool_query_context.c
>> >> >> index a056ac596..0190d3673 100644
>> >> >> --- a/src/context/pool_query_context.c
>> >> >> +++ b/src/context/pool_query_context.c
>> >> >> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
>> >> >> static bool
>> >> >> is_select_object_in_temp_write_list(Node *node, void *context)
>> >> >> {
>> >> >> - if (node == NULL ||
>> pool_config->disable_load_balance_on_write
>> >> !=
>> >> >> DLBOW_DML_ADAPTIVE)
>> >> >> + if (node == NULL ||
>> >> >> + !DLBOW_IS_DML_ADAPTIVE(
>> >> >> +
>> >> >> pool_config->disable_load_balance_on_write))
>> >> >>
>> >> >> You don't need to split the line.
>> >> >>
>> >> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>> >> >> +
>> >> >> pool_config->disable_load_balance_on_write);
>> >> >>
>> >> >> You don't need to split the line.
>> >> >>
>> >> >> - if (pool_config->disable_load_balance_on_write ==
>> >> >> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
>> >> >> + if (is_adaptive &&
>> >> >> + session_context->is_in_transaction)
>> >> >> {
>> >> >> ereport(DEBUG1,
>> >> >>
>> >> >> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
>> >> >> \"%s\"", (char *) context, rgv->relname)));
>> >> >> This line is too long. Please split.
>> >> >>
>> >> >> @@ -1880,7 +1889,13 @@ static char
>> >> >> *get_associated_object_from_dml_adaptive_relations
>> >> >> void
>> >> >> check_object_relationship_list(char *name, bool is_func_name)
>> >> >> {
>> >> >> - if (pool_config->disable_load_balance_on_write ==
>> >> >> DLBOW_DML_ADAPTIVE &&
>> >> >> pool_config->parsed_dml_adaptive_object_relationship_list)
>> >> >> + bool is_adaptive;
>> >> >> +
>> >> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>> >> >> +
>> >> >> pool_config->disable_load_balance_on_write);
>> >> >>
>> >> >> I wrote in the commit message:
>> >> >>
>> >> >> modifications are only detected in the same transaction). Note,
>> >> >> however, you cannot use dml_adaptive_object_relationship_list to
>> track
>> >> >> dependency among table and other objects.
>> >> >>
>> >> >> In my understanding the feature does not use
>> >> >> dml_adaptive_object_relationship_list. If this is correct, why
>> >> >> check_object_relationship_list() is called here in case
>> >> >> dml_adaptive_global? If the feature uses
>> >> >> dml_adaptive_object_relationship_list, test cases should be
>> included.
>> >> >>
>> >> >> diff --git a/src/utils/pool_track_table_mutation.c
>> >> >> b/src/utils/pool_track_table_mutation.c
>> >> >> new file mode 100644
>> >> >> index 000000000..9be46b28f
>> >> >> --- /dev/null
>> >> >> +++ b/src/utils/pool_track_table_mutation.c
>> >> >>
>> >> >> It seems following functions are not used anywhere. I wonder if this
>> >> >> feature actually use "query parse cache".
>> >> >>
>> >> >> pool_track_table_mutation_get_cached_parse
>> >> >> pool_track_table_mutation_cache_parse
>> >> >> pool_track_table_mutation_normalize_and_hash
>> >> >>
>> >> >> Besides the code review, I mutated one of regression tests to check
>> >> >> whether the feature co exists with in the existing memory query
>> cache
>> >> >> feature. After attached patch applied, I ran 006.memqcache and got
>> the
>> >> >> following result.
>> >> >>
>> >> >> cd src/test/regression
>> >> >> ./regress.sh 006
>> >> >> creating pgpool-II temporary installation ...
>> >> >> moving pgpool_setup to temporary installation path ...
>> >> >> moving watchdog_setup to temporary installation path ...
>> >> >> using pgpool-II at
>> >> >>
>> >>
>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>> >> >> *************************
>> >> >> REGRESSION MODE : install
>> >> >> Pgpool-II version : pgpool-II version 4.8devel
>> (mitsukakeboshi)
>> >> >> Pgpool-II install path :
>> >> >>
>> >>
>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>> >> >> PostgreSQL bin : /usr/local/pgsql/bin
>> >> >> PostgreSQL Major version : 18
>> >> >> pgbench : /usr/local/pgsql/bin/pgbench
>> >> >> PostgreSQL jdbc :
>> >> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>> >> >> *************************
>> >> >> testing 006.memqcache...failed.
>> >> >> out of 1 ok:0 failed:1 timeout:0
>> >> >>
>> >> >> log/006.memqcache shows:
>> >> >>
>> >> >> ../expected.txt result.txt differ: char 1, line 1
>> >> >>
>> >> >> So I checked the test script and found the error was generated by a
>> >> >> Java program test.
>> >> >>
>> >> >> java jdbctest > result.txt 2>&1
>> >> >> cmp ../expected.txt result.txt
>> >> >> if [ $? != 0 ];then
>> >> >> ./shutdownall
>> >> >> exit 1
>> >> >> fi
>> >> >>
>> >> >> In jdbctest.java:
>> >> >>
>> >> >> /*
>> >> >> * Cache test in an explicit transaction
>> >> >> */
>> >> >> conn.setAutoCommit(false);
>> >> >> // execute DML. This should prevent SELECTs from
>> using
>> >> >> query cache in the transaction.
>> >> >> sql = "UPDATE t1 SET i = 2;";
>> >> >> pst = conn.createStatement();
>> >> >> pst.executeUpdate(sql);
>> >> >> pst.close();
>> >> >> // should not use the cache and should return "2",
>> >> rather
>> >> >> than "1"
>> >> >> prest = conn.prepareStatement("SELECT * FROM t1");
>> >> >> rs = prest.executeQuery();
>> >> >>
>> >> >> The expected file (expected.txt) has "2" but the result file
>> >> >> (testdir/result.txt) was "1". This is the reason why the test
>> >> >> failed. I wonder if there's something wrong with the feature when
>> the
>> >> >> query cache is enabled. Can you look into this?
>> >> >>
>> >> >> Regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Nadav Shatz
>> >> > Tailor Brands | CTO
>> >>
>> >
>> >
>> > --
>> > Nadav Shatz
>> > Tailor Brands | CTO
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
>
--
Nadav Shatz
Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-05-18 10:11 ` Tatsuo Ishii <[email protected]>
2026-05-20 04:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-05-18 10:11 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
Sorry, I missed your last email.
Will check & test tomorrow.
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
> Hi Tatsuo,
>
> Any update on this proposal?
>
> Thanks
>
> On Thu, Apr 23, 2026 at 5:16 PM Nadav Shatz <[email protected]> wrote:
>
>> Hi Tatsuo,
>>
>> Good catch on the 006.memqcache timeout. My previous fix had
>> wrong side effects -- setting writing_transaction for
>> dml_adaptive_global also changed routing behavior (it forced the
>> whole transaction to primary, effectively reducing the feature to
>> 'transaction' mode). That's what caused the hang.
>>
>> Fixed properly in v3: instead of touching writing_transaction,
>> added a memqcache-specific guard that checks whether the current
>> dml_adaptive* session has tracked writes in the current
>> transaction, and skips the cache fetch if so.
>>
>> Attached: v3-0001-Feature-load-balancing-control-by-table-tracking.patch
>>
>> Changes in v3 vs v2:
>>
>> - pool_set_writing_transaction() reverted to original behavior
>> (dml_adaptive_global no longer sets writing_transaction, so
>> routing stays per-table as intended).
>>
>> - Added new helper pool_has_dml_adaptive_write_in_transaction()
>> in pool_session_context.c. Returns true when the current session
>> is in dml_adaptive* mode, is inside an explicit transaction, and
>> has already tracked at least one write (via
>> transaction_temp_write_list).
>>
>> - The two memqcache fetch guards in pool_proto_modules.c
>> (simple query at line 270, extended query at line 1028) now
>> also call !pool_has_dml_adaptive_write_in_transaction().
>> Autocommit writes in dml_adaptive_global are still handled by
>> the existing pool_invalidate_query_cache() at COMMIT time --
>> no change needed there.
>>
>> Verified locally by mutating 006.memqcache with
>> disable_load_balance_on_write = 'dml_adaptive_global' in the
>> streaming replication mode (the only mode where dml_adaptive
>> applies) and the jdbctest now correctly returns "2" instead of
>> the stale cached "1". Both 006.memqcache and 043.track_table_mutation
>> pass.
>>
>> Thanks!
>>
>> On Thu, Apr 23, 2026 at 11:14 AM Tatsuo Ishii <[email protected]>
>> wrote:
>>
>>> Hi Nadav,
>>>
>>> Unfortunately the mutated 006.memqcache failed (timeout).
>>>
>>> >> > memqcache bug fix
>>> >> > -----------------
>>> >> >
>>> >> > Good catch. The root cause: pool_set_writing_transaction() was
>>> >> > explicitly skipping dml_adaptive_global, so
>>> >> > pool_is_writing_transaction() always returned false in this mode.
>>> >> > The query cache fetch guard at pool_proto_modules.c:270
>>> >> > (!pool_is_writing_transaction()) then served stale cached results
>>> >> > after DML in the same transaction.
>>> >> >
>>> >> > Fix: pool_set_writing_transaction() now sets the flag for
>>> >> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
>>> >> > ensures the query cache is properly bypassed after writes within
>>> >> > the same transaction.
>>>
>>> Regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS K.K.
>>> English: http://www.sraoss.co.jp/index_en/
>>> Japanese:http://www.sraoss.co.jp
>>>
>>> > Hi Tatsuo,
>>> >
>>> > Rebased onto current master, renumbered the regression tests
>>> > (043/044 to avoid collision with the new 042.ssl_reload), and
>>> > combined everything into a single commit.
>>> >
>>> > Attached: v2-0001-Feature-load-balancing-control-by-table-tracking.patch
>>> >
>>> > Looking forward to your review.
>>> >
>>> >
>>> > On Sun, Apr 19, 2026 at 10:25 AM Tatsuo Ishii <[email protected]>
>>> wrote:
>>> >
>>> >> > Hi Tatsuo,
>>> >> >
>>> >> > hank you for the detailed review. Attached patch addresses all items.
>>> >>
>>> >> I guess the attached patch is on top of
>>> >> v1-0001-Feature-load-balancing-control-by-table-tracking.patch. To
>>> >> apply v2-0001-address-review.patch, we need to apply
>>> >> v1-0001-Feature-load-balancing-control-by-table-tracking.patch first.
>>> >> Unfortunately due to recent commit, it does not apply anymore. Can you
>>> >> please provide v1 + v2 that are rebased against latest master branch?
>>> >> Also 042 regression test is already used by recent commit. Can you
>>> >> renumber 042.track_table_mutation and
>>> >> 043.track_table_mutation_watchdog to 043.track_table_mutation and
>>> >> 044.track_table_mutation_watchdog accordingly?
>>> >>
>>> >> Looking forward to seeing new patch.
>>> >>
>>> >> Regards,
>>> >> --
>>> >> Tatsuo Ishii
>>> >> SRA OSS K.K.
>>> >> English: http://www.sraoss.co.jp/index_en/
>>> >> Japanese:http://www.sraoss.co.jp
>>> >>
>>> >>
>>> >> > memqcache bug fix
>>> >> > -----------------
>>> >> >
>>> >> > Good catch. The root cause: pool_set_writing_transaction() was
>>> >> > explicitly skipping dml_adaptive_global, so
>>> >> > pool_is_writing_transaction() always returned false in this mode.
>>> >> > The query cache fetch guard at pool_proto_modules.c:270
>>> >> > (!pool_is_writing_transaction()) then served stale cached results
>>> >> > after DML in the same transaction.
>>> >> >
>>> >> > Fix: pool_set_writing_transaction() now sets the flag for
>>> >> > dml_adaptive_global (only 'off' and 'dml_adaptive' skip it). This
>>> >> > ensures the query cache is properly bypassed after writes within
>>> >> > the same transaction.
>>> >> >
>>> >> > Removed dead query parse cache code (~700 lines)
>>> >> > -------------------------------------------------
>>> >> >
>>> >> > You're right -- pool_track_table_mutation_get_cached_parse,
>>> >> > pool_track_table_mutation_cache_parse, and
>>> >> > pool_track_table_mutation_normalize_and_hash were never called.
>>> >> > These were leftover from an earlier design where we planned to
>>> >> > cache SQL parse results in shared memory. The feature ended up
>>> >> > using pgpool's existing parser directly, and this code was never
>>> >> > wired up.
>>> >> >
>>> >> > Removed: QueryParseCache and QueryParseEntry structs, all related
>>> >> > static functions, the TRACK_TABLE_MUTATION_QUERY_SEM semaphore,
>>> >> > and the track_table_mutation_query_buckets /
>>> >> > track_table_mutation_query_parse_cache_size configuration
>>> >> > parameters. This also reduces shared memory usage from ~6.4 MB
>>> >> > to ~80 KB with default settings.
>>> >> >
>>> >> > check_object_relationship_list scope
>>> >> > -------------------------------------
>>> >> >
>>> >> > You're correct -- dml_adaptive_global does not use
>>> >> > dml_adaptive_object_relationship_list. Changed
>>> >> > check_object_relationship_list() to check for DLBOW_DML_ADAPTIVE
>>> >> > only, not DLBOW_IS_DML_ADAPTIVE (which includes global).
>>> >> >
>>> >> > Documentation fixes
>>> >> > -------------------
>>> >> >
>>> >> > - Removed "(Lagless Replica Reads)" from section title and
>>> >> > "lagless" language from description.
>>> >> >
>>> >> > - Described fallback behavior when neither
>>> >> > replication_delay_source_cmd nor delay_threshold_by_time is
>>> >> > configured (TTL stays at 100ms default minimum).
>>> >> >
>>> >> > - "query cache" references removed (the query parse cache is gone).
>>> >> >
>>> >> > - Added 128-table-per-SELECT limit to Limitations section
>>> >> > (uses POOL_MAX_SELECT_OIDS).
>>> >> >
>>> >> > Code style fixes
>>> >> > ----------------
>>> >> >
>>> >> > - DLBOW_IS_DML_ADAPTIVE() calls no longer split across lines.
>>> >> >
>>> >> > - Split the long errmsg line in
>>> >> > is_select_object_in_temp_write_list.
>>> >> >
>>> >> > - Removed redundant is_adaptive variable in
>>> >> > is_select_object_in_temp_write_list (the check at function
>>> >> > entry already guarantees it).
>>> >> >
>>> >> > Thanks!
>>> >> >
>>> >> > On Wed, Apr 15, 2026 at 1:43 AM Tatsuo Ishii <[email protected]>
>>> >> wrote:
>>> >> >
>>> >> >> Hi Nadav,
>>> >> >>
>>> >> >> > Hi Tatsuo,
>>> >> >> >
>>> >> >> > Looks good to me thanks!
>>> >> >> >
>>> >> >> > Please go ahead with your review. waiting to hear back from you.
>>> >> >>
>>> >> >> Here are the code review results.
>>> >> >>
>>> >> >> diff --git a/doc/src/sgml/loadbalance.sgml
>>> >> b/doc/src/sgml/loadbalance.sgml
>>> >> >> index 9e1e7b39b..7384ce81a 100644
>>> >> >> --- a/doc/src/sgml/loadbalance.sgml
>>> >> >> +++ b/doc/src/sgml/loadbalance.sgml
>>> >> >> :
>>> >> >> + <sect2 id="runtime-config-table-mutation-map">
>>> >> >> + <title>Table Mutation Map Configuration (Lagless Replica
>>> >> Reads)</title>
>>> >> >>
>>> >> >> "(Lagless Replica Reads)" sounds like an advertisement to me. It
>>> >> >> should be removed.
>>> >> >>
>>> >> >> + <para>
>>> >> >> + These parameters configure the track table mutation feature,
>>> which
>>> >> is
>>> >> >> activated by setting
>>> >> >> + <xref linkend="guc-disable-load-balance-on-write"> to
>>> >> >> <literal>dml_adaptive_global</literal>.
>>> >> >> + The feature tracks recently written tables to prevent stale
>>> reads
>>> >> from
>>> >> >> replica nodes during
>>> >> >> + replication lag, implementing the "lagless" architecture
>>> pattern for
>>> >> >> distributed systems
>>> >> >> + with read replicas.
>>> >> >>
>>> >> >> I think the feature does not guarantee "lagless" anytime, in all
>>> cases.
>>> >> >>
>>> >> >> + <para>
>>> >> >> + This feature requires time-based replication delay monitoring.
>>> This
>>> >> >> can be provided by either
>>> >> >> + <xref linkend="guc-replication-delay-source-cmd"> (external
>>> command
>>> >> >> mode) or by setting
>>> >> >> + <xref linkend="guc-delay-threshold-by-time"> (which uses
>>> >> >> <literal>pg_stat_replication.replay_lag</literal>
>>> >> >> + from PostgreSQL 10+). At least one of these must be configured
>>> for
>>> >> the
>>> >> >> TTL calculation to work.
>>> >> >>
>>> >> >> If one of these is not set, what happens? Error? Need to describe
>>> it.
>>> >> >>
>>> >> >> + </para>
>>> >> >> +
>>> >> >> + <warning>
>>> >> >> + <para>
>>> >> >> + Enabling <literal>dml_adaptive_global</literal> increases
>>> shared
>>> >> >> memory consumption. With default settings,
>>> >> >> + the feature requires approximately 6.4 MB of shared memory
>>> (0.1 MB
>>> >> >> for table tracking + 6.3 MB for query cache).
>>> >> >>
>>> >> >> "query cache" should be "query parse cache".
>>> >> >>
>>> >> >> + Memory usage scales with configuration parameters:
>>> >> >> + </para>
>>> >> >> + <itemizedlist>
>>> >> >> + <listitem>
>>> >> >> + <para>
>>> >> >> + Table tracking: <literal>track_table_mutation_table_size * 40
>>> >> >> bytes</literal> (default: 2048 * 40 = ~80 KB)
>>> >> >> + </para>
>>> >> >> + </listitem>
>>> >> >> + <listitem>
>>> >> >> + <para>
>>> >> >> + Query cache:
>>> >> <literal>track_table_mutation_query_parse_cache_size *
>>> >> >> 640 bytes</literal> (default: 10000 * 640 = ~6.3 MB)
>>> >> >>
>>> >> >> "query cache" should be "query parse cache".
>>> >> >>
>>> >> >> + <title>Limitations</title>
>>> >> >>
>>> >> >> I think number of tables tacked in a SELECT is limited to 8. It
>>> should
>>> >> >> be mentioned.
>>> >> >>
>>> >> >> diff --git a/src/context/pool_query_context.c
>>> >> >> b/src/context/pool_query_context.c
>>> >> >> index a056ac596..0190d3673 100644
>>> >> >> --- a/src/context/pool_query_context.c
>>> >> >> +++ b/src/context/pool_query_context.c
>>> >> >> @@ -1828,15 +1829,23 @@ is_in_list(char *name, List *list)
>>> >> >> static bool
>>> >> >> is_select_object_in_temp_write_list(Node *node, void *context)
>>> >> >> {
>>> >> >> - if (node == NULL ||
>>> pool_config->disable_load_balance_on_write
>>> >> !=
>>> >> >> DLBOW_DML_ADAPTIVE)
>>> >> >> + if (node == NULL ||
>>> >> >> + !DLBOW_IS_DML_ADAPTIVE(
>>> >> >> +
>>> >> >> pool_config->disable_load_balance_on_write))
>>> >> >>
>>> >> >> You don't need to split the line.
>>> >> >>
>>> >> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>>> >> >> +
>>> >> >> pool_config->disable_load_balance_on_write);
>>> >> >>
>>> >> >> You don't need to split the line.
>>> >> >>
>>> >> >> - if (pool_config->disable_load_balance_on_write ==
>>> >> >> DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
>>> >> >> + if (is_adaptive &&
>>> >> >> + session_context->is_in_transaction)
>>> >> >> {
>>> >> >> ereport(DEBUG1,
>>> >> >>
>>> >> >> (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation
>>> >> >> \"%s\"", (char *) context, rgv->relname)));
>>> >> >> This line is too long. Please split.
>>> >> >>
>>> >> >> @@ -1880,7 +1889,13 @@ static char
>>> >> >> *get_associated_object_from_dml_adaptive_relations
>>> >> >> void
>>> >> >> check_object_relationship_list(char *name, bool is_func_name)
>>> >> >> {
>>> >> >> - if (pool_config->disable_load_balance_on_write ==
>>> >> >> DLBOW_DML_ADAPTIVE &&
>>> >> >> pool_config->parsed_dml_adaptive_object_relationship_list)
>>> >> >> + bool is_adaptive;
>>> >> >> +
>>> >> >> + is_adaptive = DLBOW_IS_DML_ADAPTIVE(
>>> >> >> +
>>> >> >> pool_config->disable_load_balance_on_write);
>>> >> >>
>>> >> >> I wrote in the commit message:
>>> >> >>
>>> >> >> modifications are only detected in the same transaction). Note,
>>> >> >> however, you cannot use dml_adaptive_object_relationship_list to
>>> track
>>> >> >> dependency among table and other objects.
>>> >> >>
>>> >> >> In my understanding the feature does not use
>>> >> >> dml_adaptive_object_relationship_list. If this is correct, why
>>> >> >> check_object_relationship_list() is called here in case
>>> >> >> dml_adaptive_global? If the feature uses
>>> >> >> dml_adaptive_object_relationship_list, test cases should be
>>> included.
>>> >> >>
>>> >> >> diff --git a/src/utils/pool_track_table_mutation.c
>>> >> >> b/src/utils/pool_track_table_mutation.c
>>> >> >> new file mode 100644
>>> >> >> index 000000000..9be46b28f
>>> >> >> --- /dev/null
>>> >> >> +++ b/src/utils/pool_track_table_mutation.c
>>> >> >>
>>> >> >> It seems following functions are not used anywhere. I wonder if this
>>> >> >> feature actually use "query parse cache".
>>> >> >>
>>> >> >> pool_track_table_mutation_get_cached_parse
>>> >> >> pool_track_table_mutation_cache_parse
>>> >> >> pool_track_table_mutation_normalize_and_hash
>>> >> >>
>>> >> >> Besides the code review, I mutated one of regression tests to check
>>> >> >> whether the feature co exists with in the existing memory query
>>> cache
>>> >> >> feature. After attached patch applied, I ran 006.memqcache and got
>>> the
>>> >> >> following result.
>>> >> >>
>>> >> >> cd src/test/regression
>>> >> >> ./regress.sh 006
>>> >> >> creating pgpool-II temporary installation ...
>>> >> >> moving pgpool_setup to temporary installation path ...
>>> >> >> moving watchdog_setup to temporary installation path ...
>>> >> >> using pgpool-II at
>>> >> >>
>>> >>
>>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>>> >> >> *************************
>>> >> >> REGRESSION MODE : install
>>> >> >> Pgpool-II version : pgpool-II version 4.8devel
>>> (mitsukakeboshi)
>>> >> >> Pgpool-II install path :
>>> >> >>
>>> >>
>>> /home/t-ishii/work/Pgpool-II/current/pgpool2/src/test/regression/temp/installed
>>> >> >> PostgreSQL bin : /usr/local/pgsql/bin
>>> >> >> PostgreSQL Major version : 18
>>> >> >> pgbench : /usr/local/pgsql/bin/pgbench
>>> >> >> PostgreSQL jdbc :
>>> >> >> /usr/local/pgsql/share/postgresql-9.2-1003.jdbc4.jar
>>> >> >> *************************
>>> >> >> testing 006.memqcache...failed.
>>> >> >> out of 1 ok:0 failed:1 timeout:0
>>> >> >>
>>> >> >> log/006.memqcache shows:
>>> >> >>
>>> >> >> ../expected.txt result.txt differ: char 1, line 1
>>> >> >>
>>> >> >> So I checked the test script and found the error was generated by a
>>> >> >> Java program test.
>>> >> >>
>>> >> >> java jdbctest > result.txt 2>&1
>>> >> >> cmp ../expected.txt result.txt
>>> >> >> if [ $? != 0 ];then
>>> >> >> ./shutdownall
>>> >> >> exit 1
>>> >> >> fi
>>> >> >>
>>> >> >> In jdbctest.java:
>>> >> >>
>>> >> >> /*
>>> >> >> * Cache test in an explicit transaction
>>> >> >> */
>>> >> >> conn.setAutoCommit(false);
>>> >> >> // execute DML. This should prevent SELECTs from
>>> using
>>> >> >> query cache in the transaction.
>>> >> >> sql = "UPDATE t1 SET i = 2;";
>>> >> >> pst = conn.createStatement();
>>> >> >> pst.executeUpdate(sql);
>>> >> >> pst.close();
>>> >> >> // should not use the cache and should return "2",
>>> >> rather
>>> >> >> than "1"
>>> >> >> prest = conn.prepareStatement("SELECT * FROM t1");
>>> >> >> rs = prest.executeQuery();
>>> >> >>
>>> >> >> The expected file (expected.txt) has "2" but the result file
>>> >> >> (testdir/result.txt) was "1". This is the reason why the test
>>> >> >> failed. I wonder if there's something wrong with the feature when
>>> the
>>> >> >> query cache is enabled. Can you look into this?
>>> >> >>
>>> >> >> Regards,
>>> >> >> --
>>> >> >> Tatsuo Ishii
>>> >> >> SRA OSS K.K.
>>> >> >> English: http://www.sraoss.co.jp/index_en/
>>> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >>
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Nadav Shatz
>>> >> > Tailor Brands | CTO
>>> >>
>>> >
>>> >
>>> > --
>>> > Nadav Shatz
>>> > Tailor Brands | CTO
>>>
>>
>>
>> --
>> Nadav Shatz
>> Tailor Brands | CTO
>>
>
>
> --
> Nadav Shatz
> Tailor Brands | CTO
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-05-20 04:28 ` Tatsuo Ishii <[email protected]>
2026-05-20 12:25 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-05-20 04:28 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Hi Nadav,
>
> Sorry, I missed your last email.
> Will check & test tomorrow.
I finally got a chance to test your v3 patch.
Unfortunately the test failed with timeout again.
testing 006.memqcache...timeout.
out of 1 ok:0 failed:0 timeout:1
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 04:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-05-20 12:25 ` Nadav Shatz <[email protected]>
2026-05-21 09:50 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Nadav Shatz @ 2026-05-20 12:25 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
--0000000000005b422006523ee630
Content-Type: multipart/alternative; boundary="0000000000005b421f06523ee62e"
--0000000000005b421f06523ee62e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Hi Tatsuo,
Thanks for checking the V3, sorry for missing the test issue.
I reproduced the timeout locally. Found and fixed the root cause.
Root cause
----------
In CommandComplete.c, the autocommit write-tracking code was
gated only on session_context->is_in_transaction, not on the
cluster mode.
In native replication and snapshot isolation modes,
dml_adaptive() is never called (it lives inside
where_to_send_main_replica), so is_in_transaction is never set
to true even inside an explicit BEGIN/COMMIT block. That meant
every DML in those modes was treated as autocommit by the
write-tracking code, triggering
pool_track_table_mutation_get_database_oid() =E2=80=94 which does a
relcache do_query =E2=80=94 while a transaction was actually in flight
on the backend connection. The do_query conflicts with the
in-flight transaction and hangs the session. Subsequent
shutdown then hangs in terminate_all_childrens / waitpid.
Fix
---
Gate the autocommit write-tracking in CommandComplete.c on
MAIN_REPLICA in addition to the existing checks.
dml_adaptive_global is only meaningful in streaming replication
mode anyway (the matching routing logic in
where_to_send_main_replica is already SR-only), so this just
makes the autocommit path consistent.
Also broadened the query cache bypass to all dml_adaptive*
modes. The new helper pool_has_dml_adaptive_write_in_transaction()
checks the existing memqcache DML oid buffer (oidbufp via the
new pool_has_dml_table_oids()), which is populated for any DML
in any cluster mode and reset on transaction boundary. This
fixes the original "SELECT returns stale 1 instead of 2 after
UPDATE" regression in streaming replication and avoids the same
class of bug in plain dml_adaptive too.
Verified
--------
- 006.memqcache with disable_load_balance_on_write =3D
'dml_adaptive_global' appended in all three modes: PASS
- 043.track_table_mutation: PASS
Attached: v4-0001-Feature-load-balancing-control-by-table-tracking.patch
Thanks!
On Wed, May 20, 2026 at 7:28=E2=80=AFAM Tatsuo Ishii <[email protected]>=
wrote:
> > Hi Nadav,
> >
> > Sorry, I missed your last email.
> > Will check & test tomorrow.
>
> I finally got a chance to test your v3 patch.
> Unfortunately the test failed with timeout again.
>
> testing 006.memqcache...timeout.
> out of 1 ok:0 failed:0 timeout:1
>
> From src/test/regression/log/006.memqcache:
>
> 2026-05-20 13:08:33.798: main pid 3562591: LOG: stop request sent to
> pgpool (pid: 3561918). waiting for termination...
> .....2026-05-20 13:08:38.799: main pid 3562591: LOG: stop request sent t=
o
> pgpool (pid: 3561918). waiting for termination...
> .....2026-05-20 13:08:43.801: main pid 3562591: LOG: stop request sent t=
o
> pgpool (pid: 3561918). waiting for termination...
>
> It seems pgpool main process won't stop.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--=20
Nadav Shatz
Tailor Brands | CTO
--0000000000005b421f06523ee62e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr">Hi Tatsuo,<br><br>Thanks for checking the V3, sorry for mi=
ssing the test issue.<div><br>I reproduced the timeout locally. Found and f=
ixed the root cause.<br><br>Root cause<br>----------<br><br>In CommandCompl=
ete.c, the autocommit write-tracking code was<br>gated only on session_cont=
ext->is_in_transaction, not on the<br>cluster mode.<br><br>In native rep=
lication and snapshot isolation modes,<br>dml_adaptive() is never called (i=
t lives inside<br>where_to_send_main_replica), so is_in_transaction is neve=
r set<br>to true even inside an explicit BEGIN/COMMIT block.=C2=A0 That mea=
nt<br>every DML in those modes was treated as autocommit by the<br>write-tr=
acking code, triggering<br>pool_track_table_mutation_get_database_oid() =E2=
=80=94 which does a<br>relcache do_query =E2=80=94 while a transaction was =
actually in flight<br>on the backend connection.=C2=A0 The do_query conflic=
ts with the<br>in-flight transaction and hangs the session.=C2=A0 Subsequen=
t<br>shutdown then hangs in terminate_all_childrens / waitpid.<br><br>Fix<b=
r>---<br><br>Gate the autocommit write-tracking in CommandComplete.c on<br>=
MAIN_REPLICA in addition to the existing checks.<br>dml_adaptive_global is =
only meaningful in streaming replication<br>mode anyway (the matching routi=
ng logic in<br>where_to_send_main_replica is already SR-only), so this just=
<br>makes the autocommit path consistent.<br><br>Also broadened the query c=
ache bypass to all dml_adaptive*<br>modes.=C2=A0 The new helper pool_has_dm=
l_adaptive_write_in_transaction()<br>checks the existing memqcache DML oid =
buffer (oidbufp via the<br>new pool_has_dml_table_oids()), which is populat=
ed for any DML<br>in any cluster mode and reset on transaction boundary.=C2=
=A0 This<br>fixes the original "SELECT returns stale 1 instead of 2 af=
ter<br>UPDATE" regression in streaming replication and avoids the same=
<br>class of bug in plain dml_adaptive too.<br><br>Verified<br>--------<br>=
<br>- 006.memqcache with disable_load_balance_on_write =3D<br>=C2=A0 'd=
ml_adaptive_global' appended in all three modes: PASS<br>- 043.track_ta=
ble_mutation: PASS<br><br>Attached: v4-0001-Feature-load-balancing-control-=
by-table-tracking.patch<br><br>Thanks!<div></div></div></div><br><div class=
=3D"gmail_quote gmail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr=
">On Wed, May 20, 2026 at 7:28=E2=80=AFAM Tatsuo Ishii <<a href=3D"mailt=
o:[email protected]">[email protected]</a>> wrote:<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex">> Hi Nadav,<br>
> <br>
> Sorry, I missed your last email.<br>
> Will check & test tomorrow.<br>
<br>
I finally got a chance to test your v3 patch.<br>
Unfortunately the test failed with timeout again.<br>
<br>
testing 006.memqcache...timeout.<br>
out of 1 ok:0 failed:0 timeout:1<br>
<br>
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 04:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 12:25 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-05-21 09:50 ` Tatsuo Ishii <[email protected]>
2026-05-23 11:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-05-21 09:50 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
Hi Nadav,
> Hi Tatsuo,
>
> Thanks for checking the V3, sorry for missing the test issue.
>
> I reproduced the timeout locally. Found and fixed the root cause.
>
> Root cause
> ----------
>
> In CommandComplete.c, the autocommit write-tracking code was
> gated only on session_context->is_in_transaction, not on the
> cluster mode.
I think you are talking about the logic to judge whether we are in an
explicite transaction or not here. Current dml_adaptive checks
supplied query is a transaction starting command like BEGIN. IMO this
is fundamentaly wrong because the command may fail by various reasons.
The correct way is checking transaction state by using TSTATE
macro. Note that the macro can only be used at leat one ready for
query response returned from backend (simple query protocol case), or
command complete response is returned from backend (extended query
protocol case).
> In native replication and snapshot isolation modes,
> dml_adaptive() is never called (it lives inside
> where_to_send_main_replica), so is_in_transaction is never set
> to true even inside an explicit BEGIN/COMMIT block. That meant
> every DML in those modes was treated as autocommit by the
> write-tracking code, triggering
> pool_track_table_mutation_get_database_oid() ― which does a
> relcache do_query ― while a transaction was actually in flight
> on the backend connection. The do_query conflicts with the
> in-flight transaction and hangs the session.
Assuming "a transaction was actually in flight" means a transaction
was open (explicit transaction), not really. do_query can be called
inside or outside of an explicit transaction.
Anyway, I found dml_adaptive is completely broken (it brings wrong
results if query cache enabled). Unless there are users for the
feature, maybe we should remove dml_adaptive entirely?
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 04:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 12:25 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-21 09:50 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-05-23 11:18 ` Tatsuo Ishii <[email protected]>
2026-05-24 17:00 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
0 siblings, 1 reply; 44+ messages in thread
From: Tatsuo Ishii @ 2026-05-23 11:18 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> I think you are talking about the logic to judge whether we are in an
> explicite transaction or not here. Current dml_adaptive checks
> supplied query is a transaction starting command like BEGIN. IMO this
> is fundamentaly wrong because the command may fail by various reasons.
> The correct way is checking transaction state by using TSTATE
> macro. Note that the macro can only be used at leat one ready for
> query response returned from backend (simple query protocol case), or
> command complete response is returned from backend (extended query
> protocol case).
>
>> In native replication and snapshot isolation modes,
>> dml_adaptive() is never called (it lives inside
>> where_to_send_main_replica), so is_in_transaction is never set
>> to true even inside an explicit BEGIN/COMMIT block. That meant
>> every DML in those modes was treated as autocommit by the
>> write-tracking code, triggering
>> pool_track_table_mutation_get_database_oid() ― which does a
>> relcache do_query ― while a transaction was actually in flight
>> on the backend connection. The do_query conflicts with the
>> in-flight transaction and hangs the session.
>
> Assuming "a transaction was actually in flight" means a transaction
> was open (explicit transaction), not really. do_query can be called
> inside or outside of an explicit transaction.
>
> Anyway, I found dml_adaptive is completely broken (it brings wrong
> results if query cache enabled). Unless there are users for the
> feature, maybe we should remove dml_adaptive entirely?
It appears that other options of disable_load_balance_on_write are all
broken too, except "transaction". I don't want to discard all of them,
so I come up with attached patch.
The query cache relies on is_writing_transaction of session context to
judge whether cache can be safely used. However,
disable_load_balance_on_write overrides it to true when it should not,
and vice versa for its own purpose. To fix this, a new session context
variable "really_writing_transaction" is introduced. It is almost same
as existing writing_transaction, but it faithfully tracks whether a
writing query appears in an explicit transaction. The query cache uses
it instead of writing_transaction variable.
Currently, master branch is broken because of commit 2ae004a48. If
you want to try the patch, I recommend to checkout 48e1d6d3c, then
apply the patch.
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-26 07:47 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 04:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-20 12:25 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-05-21 09:50 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-05-23 11:18 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
@ 2026-05-24 17:00 ` Nadav Shatz <[email protected]>
0 siblings, 0 replies; 44+ messages in thread
From: Nadav Shatz @ 2026-05-24 17:00 UTC (permalink / raw)
To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]
Hi Tatsuo,
Your really_writing_transaction approach is the right fix -- it
addresses the root cause across all DLBOW modes, not just ours.
Thanks for digging into it.
I applied your v1 patch and rebased our feature on top. Attaching
both patches separately so they can land independently in the
order you prefer:
v5-0001-Fix-disable_load_balance_on_write-and-query-cache.patch
-- your patch unchanged (just rebased to apply cleanly on
current master without our feature underneath).
v5-0002-Feature-load-balancing-control-by-table-tracking.patch
-- our feature, on top of your fix.
Changes in v5-0002 vs v4:
- Dropped pool_has_dml_adaptive_write_in_transaction() helper and
the matching pool_has_dml_table_oids() exposure. The cache
fetch guards in pool_proto_modules.c now correctly use
pool_is_really_writing_transaction() from your patch, so the
helper became redundant.
- Kept the MAIN_REPLICA gate in CommandComplete.c for the
autocommit mark-stale branch. dml_adaptive_global is only
meaningful in streaming replication mode (matches the routing
logic in where_to_send_main_replica), and gating prevents the
hang we saw in native_replication where the autocommit branch
could run while an explicit transaction was actually in flight
on the backend.
I tried to run 006.memqcache with the mutation against the
combined branch but local master is currently broken (commit
2ae004a48 as you noted), so the standby setup fails before
reaching the jdbctest part. Both patches build cleanly and our
043.track_table_mutation passes on an earlier base. Will retest
once master is unbroken.
Thanks!
On Sat, May 23, 2026 at 2:18 PM Tatsuo Ishii <[email protected]> wrote:
> > I think you are talking about the logic to judge whether we are in an
> > explicite transaction or not here. Current dml_adaptive checks
> > supplied query is a transaction starting command like BEGIN. IMO this
> > is fundamentaly wrong because the command may fail by various reasons.
> > The correct way is checking transaction state by using TSTATE
> > macro. Note that the macro can only be used at leat one ready for
> > query response returned from backend (simple query protocol case), or
> > command complete response is returned from backend (extended query
> > protocol case).
> >
> >> In native replication and snapshot isolation modes,
> >> dml_adaptive() is never called (it lives inside
> >> where_to_send_main_replica), so is_in_transaction is never set
> >> to true even inside an explicit BEGIN/COMMIT block. That meant
> >> every DML in those modes was treated as autocommit by the
> >> write-tracking code, triggering
> >> pool_track_table_mutation_get_database_oid() ― which does a
> >> relcache do_query ― while a transaction was actually in flight
> >> on the backend connection. The do_query conflicts with the
> >> in-flight transaction and hangs the session.
> >
> > Assuming "a transaction was actually in flight" means a transaction
> > was open (explicit transaction), not really. do_query can be called
> > inside or outside of an explicit transaction.
> >
> > Anyway, I found dml_adaptive is completely broken (it brings wrong
> > results if query cache enabled). Unless there are users for the
> > feature, maybe we should remove dml_adaptive entirely?
>
> It appears that other options of disable_load_balance_on_write are all
> broken too, except "transaction". I don't want to discard all of them,
> so I come up with attached patch.
>
> The query cache relies on is_writing_transaction of session context to
> judge whether cache can be safely used. However,
> disable_load_balance_on_write overrides it to true when it should not,
> and vice versa for its own purpose. To fix this, a new session context
> variable "really_writing_transaction" is introduced. It is almost same
> as existing writing_transaction, but it faithfully tracks whether a
> writing query appears in an explicit transaction. The query cache uses
> it instead of writing_transaction variable.
>
> Currently, master branch is broken because of commit 2ae004a48. If
> you want to try the patch, I recommend to checkout 48e1d6d3c, then
> apply the patch.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
--
Nadav Shatz
Tailor Brands | CTO
Attachments:
[application/octet-stream] v5-0001-Fix-disable_load_balance_on_write-and-query-cache.patch (8.3K, 3-v5-0001-Fix-disable_load_balance_on_write-and-query-cache.patch)
download | inline diff:
From 7f9a3bcb13f30b0ceffd1448f0fac98a9a6e713a Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii <[email protected]>
Date: Sun, 24 May 2026 19:25:10 +0300
Subject: [PATCH v5 1/2] Fix disable_load_balance_on_write and query cache.
The disable_load_balance_on_write accepts for options:
transaction (the default)
trans_transaction
dml_adaptive
always
It appeared that except "transaction", all other options break query
cache feature. Sometimes a query result is cached even there's a write
query in a transaction, sometimes query is not cached even when it
should be.
The query cache relies on is_writing_transaction of session context to
judge whether cache can be safely used. However,
disable_load_balance_on_write overrides it to true when it should not,
and vice versa for its own purpose. To fix this new session context
variable "really_writing_transaction" is introduced. It is almost same
as existing writing_transaction, but it faithfully tracks whether a
writing query appears in an explicit transaction. The query cache uses
it instead of writing_transaction variable.
Author: Tatsuo Ishii <[email protected]>
Discussion:
Backpatch-through: v4.3
---
src/context/pool_session_context.c | 22 ++++++++++++++++++++++
src/include/context/pool_session_context.h | 19 ++++++++++++++++++-
src/protocol/CommandComplete.c | 1 +
src/protocol/pool_process_query.c | 1 +
src/protocol/pool_proto_modules.c | 17 +++++++++++++----
5 files changed, 55 insertions(+), 5 deletions(-)
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index ded41c7fc..a87cce164 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -125,6 +125,7 @@ pool_init_session_context(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backe
/* We don't have a write query in this transaction yet */
pool_unset_writing_transaction();
+ pool_unset_really_writing_transaction();
/* Error doesn't occur in this transaction yet */
pool_unset_failed_transaction();
@@ -731,6 +732,12 @@ pool_unset_writing_transaction(void)
}
}
+void
+pool_unset_really_writing_transaction(void)
+{
+ pool_get_session_context(false)->really_writing_transaction = false;
+}
+
/*
* We have a write query in this transaction.
*/
@@ -749,6 +756,12 @@ pool_set_writing_transaction(void)
}
}
+void
+pool_set_really_writing_transaction(void)
+{
+ pool_get_session_context(false)->really_writing_transaction = true;
+}
+
/*
* Do we have a write query in this transaction?
*/
@@ -758,6 +771,15 @@ pool_is_writing_transaction(void)
return pool_get_session_context(false)->writing_transaction;
}
+/*
+ * Do we really have a write query in this transaction?
+ */
+bool
+pool_is_really_writing_transaction(void)
+{
+ return pool_get_session_context(false)->really_writing_transaction;
+}
+
/*
* Error doesn't occur in this transaction yet.
*/
diff --git a/src/include/context/pool_session_context.h b/src/include/context/pool_session_context.h
index eba56982b..a5098e16a 100644
--- a/src/include/context/pool_session_context.h
+++ b/src/include/context/pool_session_context.h
@@ -209,9 +209,23 @@ typedef struct
/* If true, the command in progress has finished successfully. */
bool command_success;
- /* If true, write query has been appeared in this transaction */
+ /*
+ * If true, write query has been appeared in this transaction. Note that
+ * the flag may not be turned off even if a transaction is started or
+ * committed if disable_load_balance_on_write is other than "transaction".
+ * Also if disable_load_balance_on_write is "dml_adaptive", the flag is
+ * never be turned on.
+ */
bool writing_transaction;
+ /*
+ * Unlike "writing_transaction", this flag is turned on whenever writing
+ * query is issued in an explicit transaction, and is turned off when the
+ * transaction is closed. Of course turned off when new transaction
+ * starts. This flag is referenced by query cache.
+ */
+ bool really_writing_transaction;
+
/* If true, error occurred in this transaction */
bool failed_transaction;
@@ -384,8 +398,11 @@ extern void pool_set_sent_message_state(POOL_SENT_MESSAGE *message);
extern void pool_zap_query_context_in_sent_messages(POOL_QUERY_CONTEXT *query_context);
extern POOL_SENT_MESSAGE *pool_get_sent_message_by_query_context(POOL_QUERY_CONTEXT *query_context);
extern void pool_unset_writing_transaction(void);
+extern void pool_unset_really_writing_transaction(void);
extern void pool_set_writing_transaction(void);
+extern void pool_set_really_writing_transaction(void);
extern bool pool_is_writing_transaction(void);
+extern bool pool_is_really_writing_transaction(void);
extern void pool_unset_failed_transaction(void);
extern void pool_set_failed_transaction(void);
extern bool pool_is_failed_transaction(void);
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index a3b8f0ea1..1f63a0e8d 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -370,6 +370,7 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
if (pool_config->disable_load_balance_on_write != DLBOW_TRANS_TRANSACTION)
pool_unset_writing_transaction();
+ pool_unset_really_writing_transaction();
pool_unset_failed_transaction();
pool_unset_transaction_isolation();
diff --git a/src/protocol/pool_process_query.c b/src/protocol/pool_process_query.c
index dacaa9d5a..fdc8d97e0 100644
--- a/src/protocol/pool_process_query.c
+++ b/src/protocol/pool_process_query.c
@@ -4187,6 +4187,7 @@ start_internal_transaction(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *back
/* Mark that we started new transaction */
INTERNAL_TRANSACTION_STARTED(backend, i) = true;
pool_unset_writing_transaction();
+ pool_unset_really_writing_transaction();
}
}
}
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index 65ed190ef..86fb5f8a8 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -270,7 +270,7 @@ SimpleQuery(POOL_CONNECTION *frontend,
* query cache.
*/
if (pool_config->memory_cache_enabled && is_likely_select &&
- !pool_is_writing_transaction() &&
+ !pool_is_really_writing_transaction() &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) != 'E' &&
!query_cache_disabled())
{
@@ -1029,7 +1029,7 @@ Execute(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
* message has 0 row argument, we maybe able to use cache. If
* partial_fetch is true, cannot use cache.
*/
- if (pool_config->memory_cache_enabled && !pool_is_writing_transaction() &&
+ if (pool_config->memory_cache_enabled && !pool_is_really_writing_transaction() &&
(TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) != 'E')
&& pool_is_likely_select(query) && !query_cache_disabled() &&
(query_context->atEnd || num_rows == 0) &&
@@ -1276,6 +1276,8 @@ Execute(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
if (!pool_is_transaction_read_only(node))
{
pool_set_writing_transaction();
+ if (TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
+ pool_set_really_writing_transaction();
}
}
}
@@ -4745,7 +4747,7 @@ pool_at_command_success(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend
{
if (pool_config->disable_load_balance_on_write != DLBOW_TRANS_TRANSACTION)
pool_unset_writing_transaction();
-
+ pool_unset_really_writing_transaction();
pool_unset_failed_transaction();
pool_unset_transaction_isolation();
}
@@ -4759,7 +4761,7 @@ pool_at_command_success(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend
{
if (pool_config->disable_load_balance_on_write != DLBOW_TRANS_TRANSACTION)
pool_unset_writing_transaction();
-
+ pool_unset_really_writing_transaction();
pool_unset_failed_transaction();
pool_unset_transaction_isolation();
}
@@ -4804,6 +4806,13 @@ pool_at_command_success(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend
(errmsg("not SET TRANSACTION READ ONLY")));
pool_set_writing_transaction();
+
+ /*
+ * In case in transaction, we need to
+ * really_writing_transaction so that query cache is disabled.
+ */
+ if (TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
+ pool_set_really_writing_transaction();
}
}
--
2.54.0
[application/octet-stream] v5-0002-Feature-load-balancing-control-by-table-tracking.patch (90.1K, 4-v5-0002-Feature-load-balancing-control-by-table-tracking.patch)
download | inline diff:
From 8f6f731934925c795ab880abc64bf2796a511c89 Mon Sep 17 00:00:00 2001
From: Nadav Shatz <[email protected]>
Date: Sun, 19 Apr 2026 17:10:24 +0300
Subject: [PATCH v5 2/2] Feature: load balancing control by table tracking.
Prevent routing of read only queries to standby if replication delay
of tables used in the query exceeds certain amount of value
collected by streaming replication process. To enable this feature,
set disable_load_balance_on_write to dml_adaptive_global.
In this mode, when tables are modified by
INSERT/UPDATE/DELETE/TRUNCATE/MERGE/data modification WITH, for
certain peoriod SELECTs using the tables are not load balanced:
i.e. routed to the primary PostgreSQL server to avoid the data
staleness by replication delay.
Unlike dml_adaptive mode, any table modifications decribed above are
detected even they happn in other sessions (in dml_adaptive, table
modifications are only detected in the same transaction). Note,
however, you cannot use dml_adaptive_object_relationship_list to track
dependency among table and other objects.
Besides dml_adaptive_global, there are some tuning knobs for the
feature:
- track_table_mutation_ttl_factor
Parameter to calculate TTL of each tracking data.
- track_table_mutation_max_staleness
Maximum duration in milliseconds that a single table entry can
continuously force queries to primary.
- track_table_mutation_cold_start_duration
Duration in milliseconds to route all queries to primary after a
child process starts.
- track_table_mutation_table_buckets
Number of hash buckets for the track table mutation hash table.
- track_table_mutation_table_size
Maximum number of tables that can be tracked simultaneously in the
track table mutation.
Author: Nadav Shatz <[email protected]>
Reviewed-by: Tatsuo Ishii <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/20260407.181009.1762204033074164841.ishii%40postgresql.org#58c139c1a7f8d5562865921d0733667b
---
doc/src/sgml/loadbalance.sgml | 288 ++++++
src/Makefile.am | 1 +
src/config/pool_config_variables.c | 65 ++
src/context/pool_query_context.c | 242 ++++-
src/context/pool_session_context.c | 15 +-
src/include/pool.h | 3 +-
src/include/pool_config.h | 24 +-
src/include/utils/pool_track_table_mutation.h | 167 ++++
src/main/pgpool_main.c | 29 +-
src/protocol/CommandComplete.c | 36 +
src/protocol/child.c | 8 +
src/protocol/pool_proto_modules.c | 6 +-
src/sample/pgpool.conf.sample-stream | 45 +
src/streaming_replication/pool_worker_child.c | 24 +
src/test/regression/libs.sh | 2 +
.../tests/043.track_table_mutation/test.sh | 354 +++++++
.../044.track_table_mutation_watchdog/test.sh | 184 ++++
src/tools/pgindent/typedefs.list | 4 +
src/utils/pool_track_table_mutation.c | 902 ++++++++++++++++++
19 files changed, 2374 insertions(+), 25 deletions(-)
create mode 100644 src/include/utils/pool_track_table_mutation.h
create mode 100755 src/test/regression/tests/043.track_table_mutation/test.sh
create mode 100755 src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
create mode 100644 src/utils/pool_track_table_mutation.c
diff --git a/doc/src/sgml/loadbalance.sgml b/doc/src/sgml/loadbalance.sgml
index 9e1e7b39b..d4fbcf1a5 100644
--- a/doc/src/sgml/loadbalance.sgml
+++ b/doc/src/sgml/loadbalance.sgml
@@ -1110,6 +1110,18 @@ app_name_redirect_preference_list > database_redirect_preference_list > us
Dependent functions, triggers, and views on the tables can be configured
using <xref linkend="guc-dml-adaptive-object-relationship-list">
</para>
+
+ <para>
+ If this parameter is set to <varname>dml_adaptive_global</varname>,
+ <productname>Pgpool-II</> behaves like <varname>dml_adaptive</varname>
+ (per-transaction write tracking) and additionally uses shared memory to track
+ recently written tables across all sessions cluster-wide. When a table is
+ written in any session, subsequent reads of that table from any session are
+ routed to primary until a TTL (based on measured replication delay) expires.
+ This prevents stale reads after writes even across different connections.
+ See <xref linkend="runtime-config-table-mutation-map"> for the sub-parameters
+ that control the shared-memory tracking behavior.
+ </para>
</listitem>
</varlistentry>
@@ -1195,4 +1207,280 @@ dml_adaptive_object_relationship_list = 'table_1:table_2'
</variablelist>
</sect2>
+
+ <sect2 id="runtime-config-table-mutation-map">
+ <title>Table Mutation Tracking Configuration</title>
+
+ <para>
+ These parameters configure the track table mutation feature, which is activated by setting
+ <xref linkend="guc-disable-load-balance-on-write"> to <literal>dml_adaptive_global</literal>.
+ The feature tracks recently written tables to prevent stale reads from replica nodes during
+ replication lag.
+ </para>
+
+ <para>
+ When a table is modified (INSERT/UPDATE/DELETE), it is marked as "stale" for a TTL period
+ (<literal>replication_delay * track_table_mutation_ttl_factor</literal>). Any SELECT queries on stale tables are routed
+ to the primary node instead of replicas, ensuring read-after-write consistency.
+ </para>
+
+ <para>
+ This feature requires time-based replication delay monitoring. This can be provided by either
+ <xref linkend="guc-replication-delay-source-cmd"> (external command mode) or by setting
+ <xref linkend="guc-delay-threshold-by-time"> (which uses <literal>pg_stat_replication.replay_lag</literal>
+ from PostgreSQL 10+). If neither is configured, the TTL remains at its default minimum value
+ (100 milliseconds) and is never updated based on actual replication delay, which may result
+ in suboptimal routing decisions.
+ </para>
+
+ <warning>
+ <para>
+ Enabling <literal>dml_adaptive_global</literal> increases shared memory consumption. With default settings,
+ the feature requires approximately 80 KB of shared memory for table tracking:
+ <literal>track_table_mutation_table_size * 40 bytes</literal> (default: 2048 * 40 = ~80 KB).
+ </para>
+ </warning>
+
+ <variablelist>
+
+ <varlistentry id="guc-track-table-mutation-ttl-factor" xreflabel="track_table_mutation_ttl_factor">
+ <term><varname>track_table_mutation_ttl_factor</varname> (<type>floating point</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_ttl_factor</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Multiplier for calculating the TTL: <literal>TTL = replication_delay * track_table_mutation_ttl_factor</literal>.
+ Higher values provide more safety margin but may reduce read replica utilization.
+ </para>
+ <para>
+ Valid range: 1.0-100.0. Default is <literal>5.0</literal>.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-max-staleness" xreflabel="track_table_mutation_max_staleness">
+ <term><varname>track_table_mutation_max_staleness</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_max_staleness</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum duration in milliseconds that a single table entry can continuously force queries to primary,
+ measured from when the table was first marked stale. When this cap is reached, the entry is expired
+ regardless of recent writes. If the table is written to again after expiry, a fresh tracking entry
+ is created.
+ </para>
+ <para>
+ This parameter bounds the cross-session impact of table mutation tracking. Even if a table is written
+ to in a tight loop, its effect on other sessions' load balancing is limited to this duration. For
+ legitimately busy tables, the gap between forced expiry and the next write re-marking the table is
+ negligible (typically milliseconds).
+ </para>
+ <para>
+ Set to 0 to disable the cap (not recommended for production).
+ Valid range: 0-3600000 ms. Default is <literal>60000</literal> (60 seconds).
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-cold-start-duration" xreflabel="track_table_mutation_cold_start_duration">
+ <term><varname>track_table_mutation_cold_start_duration</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_cold_start_duration</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Duration in milliseconds to route all queries to primary after a child process starts.
+ This prevents stale reads when a new connection is established before the track table mutation
+ is populated with recent write history.
+ </para>
+ <para>
+ When watchdog is enabled and the local node becomes the leader, Pgpool-II also triggers a
+ global cold start for this duration to avoid stale reads after leadership changes.
+ </para>
+ <para>
+ Valid range: 0-60000 ms. Default is <literal>2000</literal> (2 seconds).
+ Set to 0 to disable cold start behavior.
+ This parameter can be changed by reloading the <productname>Pgpool-II</> configurations.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-buckets" xreflabel="track_table_mutation_table_buckets">
+ <term><varname>track_table_mutation_table_buckets</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_buckets</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Number of hash buckets for the track table mutation hash table.
+ Higher values reduce hash collisions and improve lookup performance.
+ </para>
+ <para>
+ Valid range: 64-65536. Default is <literal>1024</literal>.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="guc-track-table-mutation-table-size" xreflabel="track_table_mutation_table_size">
+ <term><varname>track_table_mutation_table_size</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>track_table_mutation_table_size</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Maximum number of tables that can be tracked simultaneously in the track table mutation.
+ When full, oldest entries are evicted using a simple eviction strategy.
+ </para>
+ <para>
+ Valid range: 128-131072. Default is <literal>2048</literal>.
+ Memory usage: approximately 40 bytes per entry.
+ This parameter can only be set at server start.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <sect3 id="runtime-config-track-table-mutation-example">
+ <title>Track Table Mutation Configuration Example</title>
+ <para>
+ To enable track table mutation with replication delay monitoring:
+ </para>
+ <programlisting>
+# Enable dml_adaptive_global mode (includes track table mutation)
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_ttl_factor = 5.0
+track_table_mutation_max_staleness = 60000
+track_table_mutation_cold_start_duration = 2000
+
+# Option A: Use external command for replication delay
+replication_delay_source_cmd = '/path/to/get-replication-delay.sh'
+replication_delay_source_timeout = 10
+
+# Option B: Use pg_stat_replication replay_lag (PG 10+)
+# delay_threshold_by_time = 1000
+
+# Adjust table map size based on workload
+track_table_mutation_table_size = 4096
+ </programlisting>
+ <para>
+ Shared memory required for above configuration: approximately 160 KB for the table map.
+ Default configuration (2048 tables) requires approximately 80 KB.
+ </para>
+ </sect3>
+
+ <sect3 id="runtime-config-track-table-mutation-limitations">
+ <title>Limitations</title>
+ <para>
+ The track table mutation feature has the following limitations:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>PREPARE</literal> statements are not tracked. When a prepared statement
+ containing data modification is executed, the table mutation is not recorded.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ A maximum of 128 tables can be tracked per SELECT query for staleness checking.
+ This limit is shared with the query cache subsystem
+ (<literal>POOL_MAX_SELECT_OIDS</literal>).
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ If your application uses prepared statements and requires read-after-write consistency,
+ consider using explicit transaction routing or the <literal>/*NO LOAD BALANCE*/</literal>
+ comment directive for affected queries.
+ </para>
+ <para>
+ The following statement types <emphasis>are</emphasis> tracked and will mark tables as stale:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INSERT</literal>, <literal>UPDATE</literal>, <literal>DELETE</literal>
+ statements (including those with <literal>RETURNING</literal> clauses).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>TRUNCATE</literal> statements (including multiple tables).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>MERGE</literal> statements (PostgreSQL 15+).
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <literal>WITH</literal> clauses containing data modifications (Common Table Expressions
+ with <literal>INSERT</literal>, <literal>UPDATE</literal>, or <literal>DELETE</literal>).
+ For example, <literal>WITH deleted AS (DELETE FROM t1 RETURNING *) SELECT * FROM deleted</literal>
+ will properly mark table <literal>t1</literal> as stale.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <para>
+ <emphasis>Transaction Rollback Behavior:</emphasis> Within explicit transactions, tables
+ are only marked as stale in shared memory when the transaction is committed. If the
+ transaction is rolled back, no tables are marked, since no actual data modification
+ occurred on replicas. This prevents rolled-back transactions from unnecessarily
+ disabling load balancing. For autocommit statements (outside explicit transactions),
+ tables are marked immediately upon command completion.
+ </para>
+
+ <para>
+ <emphasis>Cross-Session Impact and Safety Bounds:</emphasis>
+ Unlike <literal>dml_adaptive</literal> (which only affects the session that issued the write),
+ <literal>dml_adaptive_global</literal> affects all sessions reading the same table in the same database.
+ The following safety mechanisms bound this cross-session impact:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <emphasis>Maximum staleness cap:</emphasis> The <xref linkend="guc-track-table-mutation-max-staleness">
+ parameter (default: 60 seconds) limits how long any single table entry can continuously force primary
+ routing. Even under sustained writes, the entry expires after this period and is only renewed by
+ subsequent committed writes.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Database isolation:</emphasis> Table staleness tracking is scoped by database OID. Writes
+ in one database never affect load balancing decisions for sessions connected to a different database.
+ In multi-tenant deployments where tenants use separate databases, one tenant's write activity cannot
+ influence another tenant's query routing.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Committed writes only:</emphasis> Only committed transactions mark tables as stale.
+ Rolled-back transactions have no effect on the shared tracking state.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <emphasis>Bounded table map size:</emphasis> The shared memory table map has a fixed maximum size
+ (<xref linkend="guc-track-table-mutation-table-size">). At most this many tables can be marked stale
+ simultaneously, providing a natural ceiling on the feature's impact.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+
+ </sect2>
+
</sect1>
diff --git a/src/Makefile.am b/src/Makefile.am
index 4678ab530..39588af58 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -35,6 +35,7 @@ pgpool_SOURCES = main/main.c \
rewrite/pool_timestamp.c \
rewrite/pool_lobj.c \
utils/pool_select_walker.c \
+ utils/pool_track_table_mutation.c \
utils/strlcpy.c \
utils/psprintf.c \
utils/pool_params.c \
diff --git a/src/config/pool_config_variables.c b/src/config/pool_config_variables.c
index f4c73b2aa..587e01cb2 100644
--- a/src/config/pool_config_variables.c
+++ b/src/config/pool_config_variables.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry disable_load_balance_on_write_options[] =
{"trans_transaction", DLBOW_TRANS_TRANSACTION, false},
{"always", DLBOW_ALWAYS, false},
{"dml_adaptive", DLBOW_DML_ADAPTIVE, false},
+ {"dml_adaptive_global", DLBOW_DML_ADAPTIVE_GLOBAL, false},
{NULL, 0, false}
};
@@ -1777,6 +1778,19 @@ static struct config_int_array ConfigureNamesIntArray[] =
static struct config_double ConfigureNamesDouble[] =
{
+ {
+ {"track_table_mutation_ttl_factor",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "TTL multiplier for track table mutation "
+ "(TTL = replication_delay * factor)",
+ CONFIG_VAR_TYPE_DOUBLE, false, 0
+ },
+ &g_pool_config.track_table_mutation_ttl_factor,
+ 5.0, /* boot value: 5x replication delay */
+ 1.0, 100.0, /* min, max */
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_DOUBLE
};
@@ -2397,6 +2411,57 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"track_table_mutation_max_staleness",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Maximum duration in milliseconds that a "
+ "table can be marked stale from its first "
+ "write. 0 disables the cap.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_max_staleness,
+ 60000, /* 60 seconds */
+ 0, 3600000, /* 0 to 1 hour */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_cold_start_duration",
+ CFGCXT_RELOAD, LOAD_BALANCE_CONFIG,
+ "Duration in milliseconds to force queries "
+ "to primary after child process starts.",
+ CONFIG_VAR_TYPE_INT, false, GUC_UNIT_MS
+ },
+ &g_pool_config.track_table_mutation_cold_start_duration,
+ 2000, /* 2 seconds */
+ 0, 60000, /* 0 to 60 seconds */
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_buckets",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Number of hash buckets for track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_buckets,
+ 1024,
+ 64, 65536,
+ NULL, NULL, NULL
+ },
+
+ {
+ {"track_table_mutation_table_size",
+ CFGCXT_INIT, LOAD_BALANCE_CONFIG,
+ "Maximum number of entries in track table mutation.",
+ CONFIG_VAR_TYPE_INT, false, 0
+ },
+ &g_pool_config.track_table_mutation_table_size,
+ 2048,
+ 128, 131072,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
EMPTY_CONFIG_INT
};
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index fbadd2088..bc0256c0b 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -29,6 +29,7 @@
#include "utils/statistics.h"
#include "utils/pool_select_walker.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_session_context.h"
#include "context/pool_query_context.h"
#include "parser/nodes.h"
@@ -1828,20 +1829,26 @@ is_in_list(char *name, List *list)
static bool
is_select_object_in_temp_write_list(Node *node, void *context)
{
- if (node == NULL || pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (node == NULL ||
+ !DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
return false;
if (IsA(node, RangeVar))
{
RangeVar *rgv = (RangeVar *) node;
- POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
+ POOL_SESSION_CONTEXT *session_context;
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context->is_in_transaction)
+ session_context = pool_get_session_context(false);
+
+ if (session_context->is_in_transaction)
{
ereport(DEBUG1,
- (errmsg("is_select_object_in_temp_write_list: \"%s\", found relation \"%s\"", (char *) context, rgv->relname)));
+ (errmsg("is_select_object_in_temp_write_list:"
+ " \"%s\", found relation \"%s\"",
+ (char *) context, rgv->relname)));
- return is_in_list(rgv->relname, session_context->transaction_temp_write_list);
+ return is_in_list(rgv->relname,
+ session_context->transaction_temp_write_list);
}
}
@@ -1880,15 +1887,22 @@ static char *get_associated_object_from_dml_adaptive_relations
void
check_object_relationship_list(char *name, bool is_func_name)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && pool_config->parsed_dml_adaptive_object_relationship_list)
+ bool is_adaptive;
+
+ is_adaptive =
+ (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE);
+
+ if (is_adaptive &&
+ pool_config->parsed_dml_adaptive_object_relationship_list)
{
POOL_SESSION_CONTEXT *session_context = pool_get_session_context(false);
if (session_context->is_in_transaction)
{
char *right_token =
- get_associated_object_from_dml_adaptive_relations
- (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
+ get_associated_object_from_dml_adaptive_relations
+ (name, is_func_name ? OBJECT_TYPE_FUNCTION : OBJECT_TYPE_RELATION);
if (right_token)
{
@@ -1947,7 +1961,7 @@ add_object_into_temp_write_list(Node *node, void *context)
static void
dml_adaptive(Node *node, char *query)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
/* Set/Unset transaction status flags */
if (IsA(node, TransactionStmt))
@@ -1966,6 +1980,45 @@ dml_adaptive(Node *node, char *query)
}
else if (is_commit_or_rollback_query(node))
{
+ /*
+ * For dml_adaptive_global: on COMMIT, flush the accumulated
+ * table writes to shared memory. On ROLLBACK, skip -- the
+ * writes never committed so no stale-read risk exists. This
+ * prevents polluting the table map with rolled-back
+ * transactions.
+ */
+ int dlbow =
+ pool_config->disable_load_balance_on_write;
+ List *wlist =
+ session_context->transaction_temp_write_list;
+
+ if (dlbow == DLBOW_DML_ADAPTIVE_GLOBAL &&
+ is_commit_query(node) &&
+ wlist != NIL)
+ {
+ ListCell *cell;
+ int dboid;
+
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ {
+ foreach(cell, wlist)
+ {
+ char *tname;
+ int toid;
+
+ tname = (char *) lfirst(cell);
+ toid =
+ pool_table_name_to_oid(tname);
+
+ if (toid > 0)
+ pool_track_table_mutation_mark_table_written(
+ toid, dboid);
+ }
+ }
+ }
+
session_context->is_in_transaction = false;
if (session_context->transaction_temp_write_list != NIL)
@@ -2008,7 +2061,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context = pool_get_session_context(false);
backend = session_context->backend;
- /*
+ /*
* Collect/discard information for disable_load_balance_on_write =
* dml_adaptive case.
*/
@@ -2022,6 +2075,20 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
if (dest == POOL_PRIMARY)
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
+
+ /*
+ * Resolve table and database OIDs now to populate relcache. This
+ * avoids potential hangs in CommandComplete where we shouldn't be
+ * running new queries against the backend.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ int *oids;
+
+ pool_extract_table_oids(node, &oids);
+ pool_track_table_mutation_get_database_oid();
+ }
}
/* Should be sent to both primary and standby? */
else if (dest == POOL_BOTH)
@@ -2149,6 +2216,153 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
+
+ /*
+ * Check track table mutation for recently written tables. If
+ * in cold start or any table was recently written, route to
+ * primary to avoid stale reads.
+ */
+ else if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ bool force_primary = false;
+ int lb_node;
+ POOL_QUERY_CONTEXT *qctx =
+ session_context->query_context;
+
+ if (pool_track_table_mutation_in_cold_start())
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load balance"
+ " because of track table"
+ " mutation cold start"),
+ errdetail("destination = PRIMARY"
+ " for query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ SelectContext ctx;
+ int dboid;
+ int num_oids;
+ int i;
+
+ memset(&ctx, 0, sizeof(ctx));
+ num_oids =
+ pool_extract_table_oids_from_select_stmt(
+ node, &ctx);
+ if (num_oids > 0)
+ {
+ dboid =
+ pool_track_table_mutation_get_database_oid();
+
+ if (dboid <= 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " database oid was"
+ " unavailable"),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ }
+ else
+ {
+ for (i = 0; i < num_oids; i++)
+ {
+ bool stale;
+
+ stale =
+ pool_track_table_mutation_table_is_stale(
+ ctx.table_oids[i],
+ dboid);
+ if (stale)
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because"
+ " table \"%s\" was"
+ " recently written",
+ ctx.table_names[i]),
+ errdetail("destination"
+ " = PRIMARY for"
+ " query= \"%s\"",
+ query)));
+ force_primary = true;
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ if (force_primary)
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ else
+ {
+ if (pool_config->statement_level_load_balance)
+ {
+ session_context->load_balance_node_id =
+ select_load_balancing_node();
+ }
+
+ /*
+ * If replication delay is too much, and
+ * prefer_lower_delay_standby is true then elect the
+ * lowest-delayed node, otherwise send to primary.
+ */
+ lb_node =
+ session_context->load_balance_node_id;
+ if (STREAM &&
+ check_replication_delay(lb_node))
+ {
+ ereport(DEBUG1,
+ (errmsg("could not load"
+ " balance because of"
+ " too much replication"
+ " delay"),
+ errdetail("destination"
+ " = %d for"
+ " query= \"%s\"",
+ dest, query)));
+
+ if (pool_config->prefer_lower_delay_standby)
+ {
+ lb_node =
+ select_load_balancing_node();
+ session_context->load_balance_node_id =
+ lb_node;
+ qctx->load_balance_node_id =
+ lb_node;
+ pool_set_node_to_be_sent(
+ query_context,
+ lb_node);
+ }
+ else
+ {
+ pool_set_node_to_be_sent(
+ query_context,
+ PRIMARY_NODE_ID);
+ }
+ }
+ else
+ {
+ qctx->load_balance_node_id =
+ session_context->load_balance_node_id;
+ pool_set_node_to_be_sent(
+ query_context,
+ qctx->load_balance_node_id);
+ }
+ }
+ }
else
{
if (pool_config->statement_level_load_balance)
@@ -2169,7 +2383,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
errdetail("destination = %d for query= \"%s\"", dest, query)));
/*
- * If prefer_lower_delay_standby is on, choose lower delay standby.
+ * If prefer_lower_delay_standby is on, choose lower
+ * delay standby.
*/
if (pool_config->prefer_lower_delay_standby)
{
@@ -2179,7 +2394,8 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context, session_context->query_context->load_balance_node_id);
}
- else /* delay is too much. prefer to send to primary */
+ else /* delay is too much. prefer to send to
+ * primary */
{
pool_set_node_to_be_sent(query_context, PRIMARY_NODE_ID);
}
@@ -2189,7 +2405,7 @@ where_to_send_main_replica(POOL_QUERY_CONTEXT *query_context, char *query, Node
* Not streaming replication mode, or delay_threshold is 0
* or replication delay is small enough.
*/
- else
+ else
{
session_context->query_context->load_balance_node_id = session_context->load_balance_node_id;
pool_set_node_to_be_sent(query_context,
diff --git a/src/context/pool_session_context.c b/src/context/pool_session_context.c
index a87cce164..4c411ff51 100644
--- a/src/context/pool_session_context.c
+++ b/src/context/pool_session_context.c
@@ -533,7 +533,7 @@ dump_sent_message(char *caller, POOL_SENT_MESSAGE *m)
static void
dml_adaptive_init(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE)
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write))
{
session_context->is_in_transaction = false;
session_context->transaction_temp_write_list = NIL;
@@ -543,7 +543,9 @@ dml_adaptive_init(void)
static void
dml_adaptive_destroy(void)
{
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && session_context)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write) &&
+ session_context)
{
if (session_context->transaction_temp_write_list != NIL)
list_free_deep(session_context->transaction_temp_write_list);
@@ -745,10 +747,13 @@ void
pool_set_writing_transaction(void)
{
/*
- * If disable_transaction_on_write is 'off' or 'dml_adaptive', then never
- * turn on writing transaction flag.
+ * If disable_load_balance_on_write is 'off' or 'dml_adaptive' or
+ * 'dml_adaptive_global', then never turn on writing transaction flag.
*/
- if (pool_config->disable_load_balance_on_write != DLBOW_OFF && pool_config->disable_load_balance_on_write != DLBOW_DML_ADAPTIVE)
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_OFF &&
+ !DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write))
{
pool_get_session_context(false)->writing_transaction = true;
ereport(DEBUG5,
diff --git a/src/include/pool.h b/src/include/pool.h
index fea5744f3..549aed30f 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -424,7 +424,7 @@ typedef enum
#define Min(x, y) ((x) < (y) ? (x) : (y))
-#define MAX_NUM_SEMAPHORES 8
+#define MAX_NUM_SEMAPHORES 9
#define CONN_COUNTER_SEM 0
#define REQUEST_INFO_SEM 1
#define QUERY_CACHE_STATS_SEM 2
@@ -434,6 +434,7 @@ typedef enum
#define FOLLOW_PRIMARY_SEM 6
#define MAIN_EXIT_HANDLER_SEM 7 /* used in exit_hander in pgpool main
* process */
+#define TRACK_TABLE_MUTATION_TABLE_SEM 8
#define MAX_REQUEST_QUEUE_SIZE 10
#define MAX_SEC_WAIT_FOR_CLUSTER_TRANSACTION 10 /* time in seconds to keep
diff --git a/src/include/pool_config.h b/src/include/pool_config.h
index 9a397d166..b8abadd50 100644
--- a/src/include/pool_config.h
+++ b/src/include/pool_config.h
@@ -105,9 +105,13 @@ typedef enum DLBOW_OPTION
DLBOW_TRANSACTION,
DLBOW_TRANS_TRANSACTION,
DLBOW_ALWAYS,
- DLBOW_DML_ADAPTIVE
+ DLBOW_DML_ADAPTIVE,
+ DLBOW_DML_ADAPTIVE_GLOBAL
} DLBOW_OPTION;
+#define DLBOW_IS_DML_ADAPTIVE(opt) \
+ ((opt) == DLBOW_DML_ADAPTIVE || (opt) == DLBOW_DML_ADAPTIVE_GLOBAL)
+
typedef enum RELQTARGET_OPTION
{
RELQTARGET_PRIMARY = 1,
@@ -363,8 +367,22 @@ typedef struct
char *sr_check_password; /* password for sr_check_user */
char *sr_check_database; /* PostgreSQL database name for streaming
* replication check */
- char *replication_delay_source_cmd; /* external command for replication delay */
- int replication_delay_source_timeout; /* timeout for external command in seconds */
+ char *replication_delay_source_cmd; /* external command for
+ * replication delay */
+ int replication_delay_source_timeout; /* timeout for external
+ * command in seconds */
+
+ /* Track table mutation configuration */
+ double track_table_mutation_ttl_factor; /* TTL multiplier for
+ * replication delay */
+ int track_table_mutation_max_staleness; /* max staleness duration
+ * ms */
+ int track_table_mutation_cold_start_duration; /* cold start duration
+ * ms */
+ int track_table_mutation_table_buckets; /* hash buckets for table
+ * map */
+ int track_table_mutation_table_size; /* max table map entries */
+
char *failover_command; /* execute command when failover happens */
char *follow_primary_command; /* execute command when failover is
* ended */
diff --git a/src/include/utils/pool_track_table_mutation.h b/src/include/utils/pool_track_table_mutation.h
new file mode 100644
index 000000000..dfbac666d
--- /dev/null
+++ b/src/include/utils/pool_track_table_mutation.h
@@ -0,0 +1,167 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.h: In-memory tracking of
+ * recently written tables to prevent stale reads.
+ */
+
+#ifndef POOL_TRACK_TABLE_MUTATION_H
+#define POOL_TRACK_TABLE_MUTATION_H
+
+#include "pool.h"
+#include <sys/time.h>
+
+/*
+ * Invalid index marker for linked lists
+ */
+#define TRACK_TABLE_MUTATION_INVALID_INDEX (-1)
+
+/*
+ * Default TTL in microseconds (100ms) used when replication delay is unknown
+ */
+#define TRACK_TABLE_MUTATION_DEFAULT_TTL_US (100 * 1000)
+
+/*
+ * Entry in the table mutation hash table (keyed by table/database oids)
+ */
+typedef struct TrackTableMutationEntry
+{
+ int table_oid; /* Table oid */
+ int dboid; /* Database oid */
+ struct timeval first_write_time; /* When the entry was first created */
+ struct timeval last_write_time; /* When the table was last written */
+ uint32 hash; /* Pre-computed hash value */
+ int next; /* Next in collision chain */
+ bool in_use; /* Is this entry in use? */
+} TrackTableMutationEntry;
+
+/*
+ * Header for the table mutation hash table in shared memory
+ */
+typedef struct TrackTableMutationHashTable
+{
+ int num_buckets; /* Number of hash buckets */
+ int max_entries; /* Maximum entries allowed */
+ int num_entries; /* Current number of entries */
+ int free_list_head; /* Head of free entry list */
+
+ /*
+ * Flexible array members follow in shared memory: int
+ * buckets[num_buckets]; TrackTableMutationEntry entries[max_entries];
+ */
+} TrackTableMutationHashTable;
+
+/*
+ * Global state for track table mutation feature
+ */
+typedef struct TrackTableMutationState
+{
+ bool initialized; /* Shmem initialized? */
+ uint64 current_ttl_us; /* Current TTL in microseconds */
+ struct timeval ttl_last_updated; /* When TTL was last updated */
+ struct timeval last_cleanup_time; /* When last expired cleanup ran */
+ struct timeval global_cold_start_until; /* Global cold start end time */
+ uint32 stats_queries_checked; /* Queries checked */
+ uint32 stats_forced_primary; /* Forced to primary */
+ uint32 stats_allowed_replica; /* Allowed to replica */
+} TrackTableMutationState;
+
+/*
+ * Main shared memory structure containing all components
+ */
+typedef struct TrackTableMutationShmem
+{
+ TrackTableMutationState state;
+ TrackTableMutationHashTable *table_map;
+} TrackTableMutationShmem;
+
+/* ----------------
+ * Public API functions
+ * ----------------
+ */
+
+/*
+ * Initialize shared memory structures for track table mutation.
+ * Called from pgpool_main.c after pool_init_pool_info().
+ */
+extern void pool_track_table_mutation_init(void);
+
+/*
+ * Initialize per-child process state for track table mutation.
+ * Called from child.c when a new child process starts.
+ * Sets up cold start tracking.
+ */
+extern void pool_track_table_mutation_child_init(void);
+
+/*
+ * Check if the child process is in cold start period.
+ * During cold start, all queries are routed to primary.
+ * Returns true if in cold start, false otherwise.
+ */
+extern bool pool_track_table_mutation_in_cold_start(void);
+
+/*
+ * Trigger a global cold start period for all processes.
+ * Used after watchdog leader change to avoid stale reads.
+ */
+extern void pool_track_table_mutation_trigger_global_cold_start(void);
+
+/*
+ * Get oid of current database.
+ */
+extern int pool_track_table_mutation_get_database_oid(void);
+
+/*
+ * Check if a table was recently written to (is "stale").
+ * If stale, reads from this table should go to primary.
+ * Returns true if table is stale (recently written), false otherwise.
+ */
+extern bool pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid);
+
+/*
+ * Mark tables as recently written.
+ * Called after INSERT/UPDATE/DELETE queries complete.
+ * table_oids: array of table oids
+ * num_tables: number of tables in array
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid);
+
+/*
+ * Convenience function to mark a single table as written.
+ * table_oid: table oid
+ * dboid: database oid
+ */
+extern void pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid);
+
+/*
+ * Update the TTL based on current replication delay.
+ * Called from pool_worker_child.c when replication delay is updated.
+ * delay_us: replication delay in microseconds
+ */
+extern void pool_track_table_mutation_update_ttl(uint64 delay_us);
+
+/*
+ * Calculate required shared memory size for track table mutation.
+ */
+extern Size pool_track_table_mutation_shmem_size(void);
+
+#endif /* POOL_TRACK_TABLE_MUTATION_H */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index dbf1bfd14..f16e0f576 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -57,6 +57,7 @@
#include "auth/pool_passwd.h"
#include "auth/pool_hba.h"
#include "query_cache/pool_memqcache.h"
+#include "utils/pool_track_table_mutation.h"
#include "watchdog/wd_internal_commands.h"
#include "watchdog/wd_lifecheck.h"
#include "watchdog/watchdog.h"
@@ -1501,11 +1502,14 @@ sigusr1_interrupt_processor(void)
if (user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED])
{
+ WD_STATES wd_state;
+
ereport(LOG,
(errmsg("Pgpool-II parent process received watchdog state change signal from watchdog")));
user1SignalSlot->signalFlags[SIG_WATCHDOG_STATE_CHANGED] = false;
- if (wd_internal_get_watchdog_local_node_state() == WD_STANDBY)
+ wd_state = wd_internal_get_watchdog_local_node_state();
+ if (wd_state == WD_STANDBY)
{
ereport(LOG,
(errmsg("we have joined the watchdog cluster as STANDBY node"),
@@ -1519,6 +1523,12 @@ sigusr1_interrupt_processor(void)
*/
pool_release_follow_primary_lock(true);
}
+ else if (wd_state == WD_COORDINATOR &&
+ pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_trigger_global_cold_start();
+ }
}
if (user1SignalSlot->signalFlags[SIG_FAILOVER_INTERRUPT])
{
@@ -3084,6 +3094,16 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
elog(DEBUG1, "watchdog: %zu bytes requested for shared memory", MAXALIGN(wd_ipc_get_shared_mem_size()));
}
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ size += MAXALIGN(pool_track_table_mutation_shmem_size());
+ elog(DEBUG1,
+ "track_table_mutation: %zu bytes requested"
+ " for shared memory",
+ MAXALIGN(pool_track_table_mutation_shmem_size()));
+ }
+
initialize_shared_memory_main_segment(size);
/* Move the backend descriptors to shared memory */
@@ -3202,6 +3222,13 @@ initialize_shared_mem_objects(bool clear_memcache_oidmaps)
/* initialize pcp worker child pids */
memset(Req_info->pcp_worker_pids, 0, sizeof(Req_info->pcp_worker_pids));
+
+ /* Initialize track table mutation for recently written tables */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_init();
+ }
}
/*
diff --git a/src/protocol/CommandComplete.c b/src/protocol/CommandComplete.c
index 1f63a0e8d..1af3a818b 100644
--- a/src/protocol/CommandComplete.c
+++ b/src/protocol/CommandComplete.c
@@ -38,6 +38,8 @@
#include "utils/palloc.h"
#include "utils/memutils.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
+#include "query_cache/pool_memqcache.h"
static int extract_ntuples(char *message);
static POOL_STATUS handle_mismatch_tuples(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char *packet, int packetlen, bool command_complete);
@@ -304,6 +306,40 @@ handle_query_context(POOL_CONNECTION_POOL *backend)
node = session_context->query_context->parse_tree;
+ /*
+ * Track table writes for dml_adaptive_global feature. Only meaningful in
+ * streaming replication mode (MAIN_REPLICA), where dml_adaptive() tracks
+ * the in-transaction state. For other cluster modes the in_transaction
+ * flag is never set, so triggering this here while actually inside a
+ * backend transaction would cause a relcache do_query that conflicts with
+ * the in-flight transaction and hangs the session.
+ *
+ * For autocommit statements (not in explicit transaction), mark tables
+ * immediately. For explicit transactions, marking is deferred to COMMIT
+ * in dml_adaptive() so that ROLLBACKed writes don't pollute the shared
+ * memory table map.
+ */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ MAIN_REPLICA &&
+ node != NULL &&
+ !session_context->is_in_transaction)
+ {
+ int *oids;
+ int num_oids;
+
+ num_oids = pool_extract_table_oids(node, &oids);
+ if (num_oids > 0)
+ {
+ int dboid;
+
+ dboid = pool_track_table_mutation_get_database_oid();
+ if (dboid > 0)
+ pool_track_table_mutation_mark_tables_written(
+ oids, num_oids, dboid);
+ }
+ }
+
if (IsA(node, PrepareStmt))
{
if (session_context->uncompleted_message)
diff --git a/src/protocol/child.c b/src/protocol/child.c
index 761876f53..4a527c84c 100644
--- a/src/protocol/child.c
+++ b/src/protocol/child.c
@@ -57,6 +57,7 @@
#include "utils/elog.h"
#include "utils/ps_status.h"
#include "utils/timestamp.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -213,6 +214,13 @@ do_child(int *fds)
/* Initialize per process context */
pool_init_process_context();
+ /* Initialize track table mutation child state for cold start tracking */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ pool_track_table_mutation_child_init();
+ }
+
/* initialize connection pool */
if (pool_init_cp())
{
diff --git a/src/protocol/pool_proto_modules.c b/src/protocol/pool_proto_modules.c
index 86fb5f8a8..9c2905147 100644
--- a/src/protocol/pool_proto_modules.c
+++ b/src/protocol/pool_proto_modules.c
@@ -1468,7 +1468,9 @@ Parse(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
pool_where_to_send(query_context, query_context->original_query,
query_context->parse_tree);
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE && strlen(name) != 0)
+ if (DLBOW_IS_DML_ADAPTIVE(
+ pool_config->disable_load_balance_on_write)
+ && strlen(name) != 0)
pool_setall_node_to_be_sent(query_context);
if (REPLICATION)
@@ -1811,7 +1813,7 @@ Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
return POOL_END;
}
- if (pool_config->disable_load_balance_on_write == DLBOW_DML_ADAPTIVE &&
+ if (DLBOW_IS_DML_ADAPTIVE(pool_config->disable_load_balance_on_write) &&
TSTATE(backend, MAIN_REPLICA ? PRIMARY_NODE_ID : REAL_MAIN_NODE_ID) == 'T')
{
pool_where_to_send(query_context, query_context->original_query,
diff --git a/src/sample/pgpool.conf.sample-stream b/src/sample/pgpool.conf.sample-stream
index 1ac982907..ce9b92da0 100644
--- a/src/sample/pgpool.conf.sample-stream
+++ b/src/sample/pgpool.conf.sample-stream
@@ -478,6 +478,14 @@ backend_clustering_mode = streaming_replication
# modified within the current explicit transaction will
# not be load balanced until the end of the transaction.
#
+ # dml_adaptive_global:
+ # Superset of dml_adaptive. In addition to per-transaction
+ # tracking, uses shared memory to track recently written
+ # tables across all sessions. Reads from recently written
+ # tables are routed to primary until a TTL (based on
+ # replication delay) expires. Requires additional shared
+ # memory. See track_table_mutation_* parameters below.
+ #
# always:
# if a write query is issued, read queries will
# not be load balanced until the session ends.
@@ -499,6 +507,43 @@ backend_clustering_mode = streaming_replication
#statement_level_load_balance = off
# Enables statement level load balancing
+# - Track Table Mutation (used by dml_adaptive_global) -
+ # WARNING: dml_adaptive_global increases shared memory usage
+ # Default settings require ~80 KB shared memory for table tracking
+
+#track_table_mutation_ttl_factor = 5.0
+ # TTL multiplier: TTL = replication_delay * factor
+ # Higher values provide more safety margin
+ # Range: 1.0-100.0 (default: 5.0)
+ # (change requires reload)
+
+#track_table_mutation_max_staleness = 60000
+ # Maximum duration (ms) a table can be marked stale
+ # from its first write. Bounds cross-session impact:
+ # even under continuous writes, staleness expires
+ # after this period and is only renewed by new writes.
+ # 0 disables the cap. Range: 0-3600000 (default: 60000 = 60s)
+ # (change requires reload)
+
+#track_table_mutation_cold_start_duration = 2000
+ # Duration in milliseconds to route all queries to primary
+ # after child process starts (cold start period)
+ # Range: 0-60000 ms (default: 2000 ms = 2 seconds)
+ # Set to 0 to disable cold start behavior
+ # (change requires reload)
+
+#track_table_mutation_table_buckets = 1024
+ # Number of hash buckets for track table mutation
+ # Higher values reduce hash collisions
+ # Range: 64-65536 (default: 1024)
+ # (change requires restart)
+
+#track_table_mutation_table_size = 2048
+ # Maximum number of tables to track simultaneously
+ # Range: 128-131072 (default: 2048)
+ # (change requires restart)
+
+
#------------------------------------------------------------------------------
# STREAMING REPLICATION MODE
#------------------------------------------------------------------------------
diff --git a/src/streaming_replication/pool_worker_child.c b/src/streaming_replication/pool_worker_child.c
index 311b63865..cdd570396 100644
--- a/src/streaming_replication/pool_worker_child.c
+++ b/src/streaming_replication/pool_worker_child.c
@@ -58,6 +58,7 @@
#include "utils/pool_ip.h"
#include "utils/ps_status.h"
#include "utils/pool_stream.h"
+#include "utils/pool_track_table_mutation.h"
#include "context/pool_process_context.h"
#include "context/pool_session_context.h"
@@ -419,6 +420,7 @@ check_replication_time_lag(void)
BackendInfo *bkinfo;
uint64 lag;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0;
ErrorContextCallback callback;
int active_standby_node;
bool replication_delay_by_time;
@@ -643,6 +645,10 @@ check_replication_time_lag(void)
* seconds to micro
* seconds */
+ /* Track max delay for mutation TTL */
+ if (lag > max_delay_us)
+ max_delay_us = lag;
+
/* Log delay if necessary */
if ((pool_config->log_standby_delay == LSD_ALWAYS && lag > 0) ||
(pool_config->log_standby_delay == LSD_OVER_THRESHOLD &&
@@ -668,6 +674,13 @@ check_replication_time_lag(void)
}
}
+ /*
+ * Update track table mutation TTL from the max observed time-based
+ * replication delay.
+ */
+ if (replication_delay_by_time && max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
error_context_stack = callback.previous;
}
@@ -695,6 +708,7 @@ check_replication_time_lag_with_cmd(void)
double delay_ms;
uint64 delay;
uint64 delay_threshold_by_time;
+ uint64 max_delay_us = 0; /* Track max delay for mutation map */
int token_count = 0;
int primary_node_id;
int save_errno;
@@ -1003,6 +1017,10 @@ check_replication_time_lag_with_cmd(void)
bkinfo->standby_delay = delay;
bkinfo->standby_delay_by_time = true;
+ /* Track maximum delay for table mutation map TTL calculation */
+ if (delay > max_delay_us)
+ max_delay_us = delay;
+
/*
* Log delay if necessary. threshold is in milliseconds, convert
* to microseconds.
@@ -1021,6 +1039,12 @@ check_replication_time_lag_with_cmd(void)
token = strtok_r(NULL, " \t\n", &saveptr);
}
+ /* Update table mutation TTL based on max observed delay */
+ if (pool_config->disable_load_balance_on_write ==
+ DLBOW_DML_ADAPTIVE_GLOBAL &&
+ max_delay_us > 0)
+ pool_track_table_mutation_update_ttl(max_delay_us);
+
}
PG_CATCH();
{
diff --git a/src/test/regression/libs.sh b/src/test/regression/libs.sh
index 7c5a0c182..1c8ae392d 100644
--- a/src/test/regression/libs.sh
+++ b/src/test/regression/libs.sh
@@ -42,6 +42,8 @@ function wait_for_failover_done {
function clean_all {
pgrep pgpool | xargs kill -9 > /dev/null 2>&1
pgrep postgres | xargs kill -9 > /dev/null 2>&1
+ # Clean up leaked SysV IPC resources left behind by kill -9
+ ipcrm --all 2>/dev/null || true
rm -f $PGSOCKET_DIR/.s.PGSQL.*
netstat -t -p 2>/dev/null|grep pgpool
}
diff --git a/src/test/regression/tests/043.track_table_mutation/test.sh b/src/test/regression/tests/043.track_table_mutation/test.sh
new file mode 100755
index 000000000..8b4dd17b8
--- /dev/null
+++ b/src/test/regression/tests/043.track_table_mutation/test.sh
@@ -0,0 +1,354 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# test script for track table mutation feature (in-memory table tracking).
+# Tests routing of queries based on recently written tables.
+#
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+PSQLOPTS="-a -q -X"
+PGPOOLBIN=$PGPOOL_INSTALL_DIR/bin
+export PGDATABASE=test
+
+# Only run in streaming replication mode since that's the target use case
+for mode in s
+do
+ rm -fr $TESTDIR
+ mkdir $TESTDIR
+ cd $TESTDIR
+
+ # Create test environment with 2 nodes
+ echo -n "creating test environment..."
+ $PGPOOL_SETUP -m $mode -n 2 || exit 1
+ echo "done."
+
+ source ./bashrc.ports
+
+ # Configure track table mutation feature via dml_adaptive_global
+ echo "disable_load_balance_on_write = 'dml_adaptive_global'" >> etc/pgpool.conf
+ echo "track_table_mutation_ttl_factor = 5.0" >> etc/pgpool.conf
+ echo "track_table_mutation_cold_start_duration = 10000" >> etc/pgpool.conf
+
+ # Enable load balancing explicitly
+ echo "load_balance_mode = on" >> etc/pgpool.conf
+
+ # Configure weights so we can distinguish routing
+ # Backend 0 (primary) weight=0, Backend 1 (standby) weight=1
+ # This means load balanced queries go to node 1 by default
+ echo "backend_weight0 = 0" >> etc/pgpool.conf
+ echo "backend_weight1 = 1" >> etc/pgpool.conf
+
+ # Enable debug logging to see routing decisions
+ echo "log_min_messages = debug1" >> etc/pgpool.conf
+
+ ./startall
+
+ export PGPORT=$PGPOOL_PORT
+ export PGHOST=localhost
+
+ wait_for_pgpool_startup
+
+ # Create test tables
+ $PSQL test <<EOF
+CREATE TABLE t1(i INTEGER);
+CREATE TABLE t2(i INTEGER);
+CREATE TABLE t3(i INTEGER);
+EOF
+
+ echo "=== Test 1: Cold Start Routing ==="
+ # During cold start, all queries should go to primary
+ # Restart pgpool to trigger cold start
+ ./shutdownall
+ ./startall
+ wait_for_pgpool_startup
+
+ # Immediately query - should go to primary due to cold start
+ $PSQL test -c "SELECT 'cold_start_test' as marker, * FROM t1;" > /dev/null 2>&1
+
+ # Check log for cold start message (use -a to handle binary log files)
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 1 PASSED: Cold start routing works"
+ else
+ echo "Test 1 FAILED: Cold start routing not detected"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 2: Wait for cold start to end ==="
+ # Wait for cold start period to end (10 seconds).
+ # Use generous margin to avoid flakiness under load (e.g. full regression suite).
+ sleep 12
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Now a clean table query should load balance (go to node 1)
+ $PSQL test -c "SELECT 'after_cold_start' as marker, * FROM t3;" > /dev/null 2>&1
+
+ # After cold start, queries to clean tables should load balance
+ # Check that it did NOT get forced to primary due to track table mutation
+ if grep -a -q "could not load balance because of track table mutation cold start" log/pgpool.log; then
+ echo "Test 2 FAILED: Still in cold start after waiting"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 2 PASSED: Cold start ended correctly"
+
+ echo "=== Test 3: Write-then-Read Routing ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Write to t1 and then read - use single connection to ensure same session
+ $PSQL test <<EOF
+INSERT INTO t1 VALUES (1);
+SELECT 'write_read_test' as marker, * FROM t1;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Check log for table staleness message
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 3 PASSED: Write-then-read routing works"
+ else
+ echo "Test 3 FAILED: Table staleness not detected after write"
+ # Show relevant log entries for debugging
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 4: Clean Table Still Load Balances ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Read from t2 (never written to) - should load balance
+ $PSQL test -c "SELECT 'clean_table_test' as marker, * FROM t2;" > /dev/null 2>&1
+
+ # Should NOT see track table mutation blocking message for t2
+ if grep -a -q "could not load balance because table.*t2.*was recently written" log/pgpool.log; then
+ echo "Test 4 FAILED: Clean table incorrectly marked as stale"
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 4 PASSED: Clean tables still load balance"
+
+ echo "=== Test 5: UPDATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Update t2 and then read - use single connection
+ $PSQL test <<EOF
+UPDATE t2 SET i = 999 WHERE i = 0;
+SELECT 'update_test' as marker, * FROM t2;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 5 PASSED: UPDATE marks table as stale"
+ else
+ echo "Test 5 FAILED: UPDATE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 6: DELETE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Delete from t3 and then read - use single connection
+ $PSQL test <<EOF
+DELETE FROM t3 WHERE i = 0;
+SELECT 'delete_test' as marker, * FROM t3;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 6 PASSED: DELETE marks table as stale"
+ else
+ echo "Test 6 FAILED: DELETE did not mark table as stale"
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 7: TRUNCATE Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for TRUNCATE test
+ $PSQL test -c "CREATE TABLE t_truncate(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_truncate VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Truncate and then read - use single connection
+ $PSQL test <<EOF
+TRUNCATE t_truncate;
+SELECT 'truncate_test' as marker, * FROM t_truncate;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 7 PASSED: TRUNCATE marks table as stale"
+ else
+ echo "Test 7 FAILED: TRUNCATE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo "=== Test 8: WITH Clause (CTE with DELETE) Marks Table as Stale ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create a fresh table for WITH test
+ $PSQL test -c "CREATE TABLE t_cte(i INTEGER);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_cte VALUES (1), (2), (3);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use WITH clause with DELETE, then read from the table
+ $PSQL test <<EOF
+WITH deleted AS (DELETE FROM t_cte WHERE i = 1 RETURNING *)
+SELECT * FROM deleted;
+SELECT 'cte_test' as marker, * FROM t_cte;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 8 PASSED: WITH clause (CTE) marks table as stale"
+ else
+ echo "Test 8 FAILED: WITH clause (CTE) did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ # Test 9: MERGE (PostgreSQL 15+ only)
+ PG_MAJOR_VERSION=$($PSQL -t -c "SELECT substring(version() from 'PostgreSQL ([0-9]+)');" | tr -d ' ')
+ if [ "$PG_MAJOR_VERSION" -ge 15 ] 2>/dev/null; then
+ echo "=== Test 9: MERGE Marks Table as Stale (PostgreSQL $PG_MAJOR_VERSION) ==="
+ # Clear the log
+ > log/pgpool.log
+
+ # Create tables for MERGE test
+ $PSQL test -c "CREATE TABLE t_merge_target(id INTEGER PRIMARY KEY, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "CREATE TABLE t_merge_source(id INTEGER, val TEXT);" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_target VALUES (1, 'old');" > /dev/null 2>&1
+ $PSQL test -c "INSERT INTO t_merge_source VALUES (1, 'new'), (2, 'insert');" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log again
+ > log/pgpool.log
+
+ # Use MERGE, then read from the target table
+ $PSQL test <<EOF
+MERGE INTO t_merge_target t
+USING t_merge_source s ON t.id = s.id
+WHEN MATCHED THEN UPDATE SET val = s.val
+WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.val);
+SELECT 'merge_test' as marker, * FROM t_merge_target;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 9 PASSED: MERGE marks table as stale"
+ else
+ echo "Test 9 FAILED: MERGE did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ else
+ echo "=== Test 9: MERGE skipped (requires PostgreSQL 15+, have $PG_MAJOR_VERSION) ==="
+ fi
+
+ echo "=== Test 10: ROLLBACK Does NOT Mark Table as Stale ==="
+ # Create a fresh table for rollback test
+ $PSQL test -c "CREATE TABLE t_rollback(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then rollback
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_rollback VALUES (1);
+ROLLBACK;
+SELECT 'rollback_test' as marker, * FROM t_rollback;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ # Should NOT see t_rollback marked as stale since the write was rolled back
+ if grep -a -q "could not load balance because table.*t_rollback.*was recently written" log/pgpool.log; then
+ echo "Test 10 FAILED: Rolled-back write incorrectly marked table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+ echo "Test 10 PASSED: ROLLBACK does not mark table as stale"
+
+ echo "=== Test 11: COMMIT Marks Table as Stale ==="
+ # Create a fresh table for commit test
+ $PSQL test -c "CREATE TABLE t_commit(i INTEGER);" > /dev/null 2>&1
+
+ # Wait for any TTL to expire
+ sleep 3
+
+ # Clear the log
+ > log/pgpool.log
+
+ # Write inside a transaction, then commit, then read
+ $PSQL test <<EOF
+BEGIN;
+INSERT INTO t_commit VALUES (1);
+COMMIT;
+SELECT 'commit_test' as marker, * FROM t_commit;
+EOF
+
+ # Small delay to ensure log is flushed
+ sleep 0.5
+
+ if grep -a -q "could not load balance because table.*was recently written" log/pgpool.log; then
+ echo "Test 11 PASSED: COMMIT marks table as stale"
+ else
+ echo "Test 11 FAILED: Committed write did not mark table as stale"
+ grep -a -i "track_table_mutation" log/pgpool.log | tail -20
+ ./shutdownall
+ exit 1
+ fi
+
+ echo ""
+ echo "=== All Track Table Mutation Tests PASSED ==="
+
+ ./shutdownall
+
+ cd ..
+done
+
+exit 0
diff --git a/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh b/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
new file mode 100755
index 000000000..c50c213d6
--- /dev/null
+++ b/src/test/regression/tests/044.track_table_mutation_watchdog/test.sh
@@ -0,0 +1,184 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------------
+# Test script for track table mutation global cold start
+# on watchdog leader change.
+#
+# Uses $WATCHDOG_SETUP to create a 2-node watchdog cluster,
+# then verifies that when the leader is stopped the new
+# leader triggers a global cold start.
+#-------------------------------------------------------------------
+source $TESTLIBS
+TESTDIR=testdir
+PSQL=$PGBIN/psql
+success_count=0
+
+dir=`pwd`
+rm -fr $TESTDIR
+mkdir $TESTDIR
+cd $TESTDIR
+
+# Create 2-node watchdog cluster
+$WATCHDOG_SETUP -wn 2 || exit 1
+
+# Ensure per-node scripts are executable
+# (sed -i in watchdog_setup can strip permissions)
+chmod 755 pgpool*/startall pgpool*/shutdownall
+
+# Append track_table_mutation config to both nodes
+for i in 0 1
+do
+ cat >> pgpool${i}/etc/pgpool.conf <<EOF
+disable_load_balance_on_write = 'dml_adaptive_global'
+track_table_mutation_cold_start_duration = 2000
+enable_consensus_with_half_votes = on
+log_min_messages = debug1
+EOF
+done
+
+./startall
+export PCPPASSFILE=$dir/$TESTDIR/pgpool0/pcppass
+
+# Wait for watchdog lifecheck on node 0
+echo -n "waiting for watchdog node 0 starting up..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "lifecheck started" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ break
+ fi
+ sleep 2
+done
+echo "done."
+
+# Test 1: Verify leader came up
+echo "=== Test 1: Waiting for the pgpool leader... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "I am the cluster leader node" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 1 PASSED: Leader brought up."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 1 ]; then
+ echo "Test 1 FAILED: Leader did not start"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 2: Verify standby joined cluster
+echo "=== Test 2: Waiting for standby to join... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep "successfully joined the watchdog cluster" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 2 PASSED: Standby joined."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 2 ]; then
+ echo "Test 2 FAILED: Standby did not join"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 3: Verify track_table_mutation initialized
+echo "=== Test 3: Verify feature initialized ==="
+if grep -a "track_table_mutation: initialized" \
+ pgpool0/log/pgpool.log > /dev/null 2>&1; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 3 PASSED: Feature initialized."
+else
+ echo "Test 3 FAILED: Feature not initialized"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 4: Stop leader (pgpool0) to trigger failover
+echo "=== Test 4: Stopping leader... ==="
+cd pgpool0
+source ./bashrc.ports
+$PGPOOL_INSTALL_DIR/bin/pgpool \
+ -f etc/pgpool.conf -m f stop
+cd ..
+
+echo "Checking standby detected shutdown..."
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "is shutting down" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 4 PASSED: Shutdown detected."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 4 ]; then
+ echo "Test 4 FAILED: Shutdown not detected"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 5: Verify standby became new leader
+echo "=== Test 5: Checking standby takes over... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "I am the cluster leader node" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 5 PASSED: Standby became leader."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+if [ $success_count -lt 5 ]; then
+ echo "Test 5 FAILED: Standby did not become leader"
+ ./shutdownall
+ exit 1
+fi
+
+# Test 6: Verify global cold start was triggered
+echo "=== Test 6: Checking global cold start... ==="
+for i in 1 2 3 4 5 6 7 8 9 10
+do
+ grep -a "track_table_mutation: global cold start" \
+ pgpool1/log/pgpool.log > /dev/null 2>&1
+ if [ $? = 0 ]; then
+ success_count=$(( success_count + 1 ))
+ echo "Test 6 PASSED: Global cold start triggered."
+ break
+ fi
+ echo "[check] $i times"
+ sleep 2
+done
+
+# Cleanup
+./shutdownall
+
+echo ""
+echo "$success_count out of 6 successful"
+
+if test $success_count -eq 6
+then
+ echo "=== All Watchdog Tests PASSED ==="
+ exit 0
+fi
+
+exit 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 939200965..467ec114c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -519,6 +519,10 @@ TableLikeClause
TableSampleClause
TargetEntry
TokenizedLine
+TrackTableMutationEntry
+TrackTableMutationHashTable
+TrackTableMutationShmem
+TrackTableMutationState
TransactionId
TransactionStmt
TransactionStmtKind
diff --git a/src/utils/pool_track_table_mutation.c b/src/utils/pool_track_table_mutation.c
new file mode 100644
index 000000000..e7771e7bf
--- /dev/null
+++ b/src/utils/pool_track_table_mutation.c
@@ -0,0 +1,902 @@
+/* -*-pgsql-c-*- */
+/*
+ * pgpool: a language independent connection pool server for PostgreSQL
+ * written by Tatsuo Ishii
+ *
+ * Copyright (c) 2003-2026 PgPool Global Development Group
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose and without fee is hereby
+ * granted, provided that the above copyright notice appear in all
+ * copies and that both that copyright notice and this permission
+ * notice appear in supporting documentation, and that the name of the
+ * author not be used in advertising or publicity pertaining to
+ * distribution of the software without specific, written prior
+ * permission. The author makes no representations about the
+ * suitability of this software for any purpose. It is provided "as
+ * is" without express or implied warranty.
+ *
+ * pool_track_table_mutation.c: In-memory tracking of recently
+ * written tables to prevent stale reads from replicas.
+ *
+ * Based on the "lagless" architecture from Tailor Brands.
+ */
+
+#include <string.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <unistd.h>
+
+#include "pool.h"
+#include "pool_config.h"
+#include "context/pool_session_context.h"
+#include "utils/pool_track_table_mutation.h"
+#include "utils/elog.h"
+#include "utils/pool_ipc.h"
+#include "utils/palloc.h"
+#include "utils/pool_relcache.h"
+
+#define DATABASE_TO_OID_QUERY \
+ "SELECT oid FROM pg_catalog.pg_database" \
+ " WHERE datname = '%s'"
+
+/*
+ * Helper macro: true when the feature is not active.
+ */
+#define TRACK_TABLE_MUTATION_DISABLED() \
+ (pool_config->disable_load_balance_on_write != \
+ DLBOW_DML_ADAPTIVE_GLOBAL || \
+ track_table_mutation_shmem == NULL)
+
+/* ----------------
+ * Local variables
+ * ----------------
+ */
+
+/* Pointer to shared memory structure */
+static TrackTableMutationShmem *track_table_mutation_shmem = NULL;
+
+/* Per-process cold start tracking (not in shared memory) */
+static struct timeval process_start_time;
+static bool cold_start_initialized = false;
+
+/* ----------------
+ * Helper macros for flexible arrays in shared memory
+ * ----------------
+ */
+
+/* Get pointer to bucket array in table map */
+#define TABLE_MAP_BUCKETS(map) \
+ ((int *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable)))
+
+/* Get pointer to entry array in table map */
+#define TABLE_MAP_ENTRIES(map) \
+ ((TrackTableMutationEntry *)((char *)(map) + \
+ sizeof(TrackTableMutationHashTable) + \
+ (map)->num_buckets * sizeof(int)))
+
+/* ----------------
+ * Semaphore lock helpers
+ * ----------------
+ */
+
+static inline void
+table_map_lock(void)
+{
+ pool_semaphore_lock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+static inline void
+table_map_unlock(void)
+{
+ pool_semaphore_unlock(TRACK_TABLE_MUTATION_TABLE_SEM);
+}
+
+/* ----------------
+ * Hash functions
+ * ----------------
+ */
+
+/*
+ * FNV-1a hash for table/database oid pair
+ */
+static uint32
+fnv1a_hash_table_key(int table_oid, int dboid)
+{
+ uint32 hash = 2166136261u; /* FNV offset basis */
+ uint32 data[2];
+ const unsigned char *bytes;
+ size_t i;
+
+ data[0] = (uint32) table_oid;
+ data[1] = (uint32) dboid;
+ bytes = (const unsigned char *) data;
+
+ for (i = 0; i < sizeof(data); i++)
+ {
+ hash ^= bytes[i];
+ hash *= 16777619u; /* FNV prime */
+ }
+
+ return hash;
+}
+
+/* ----------------
+ * Time utilities
+ * ----------------
+ */
+
+/*
+ * Get elapsed time in microseconds between two timevals
+ */
+static int64
+elapsed_us(struct timeval *start, struct timeval *end)
+{
+ return ((int64) (end->tv_sec - start->tv_sec) * 1000000)
+ + (end->tv_usec - start->tv_usec);
+}
+
+/*
+ * Get current time
+ */
+static void
+get_current_time(struct timeval *tv)
+{
+ gettimeofday(tv, NULL);
+}
+
+/* ----------------
+ * Database oid lookup
+ * ----------------
+ */
+
+static int
+track_table_mutation_get_database_oid_internal(void)
+{
+ int oid = 0;
+ static POOL_RELCACHE *relcache;
+ POOL_CONNECTION_POOL *backend;
+ POOL_SESSION_CONTEXT *session_context;
+
+ /* Safety check: must have shmem initialized */
+ if (track_table_mutation_shmem == NULL)
+ return oid;
+
+ session_context = pool_get_session_context(false);
+ if (session_context == NULL)
+ return oid;
+
+ backend = session_context->backend;
+ if (backend == NULL ||
+ MAIN_CONNECTION(backend) == NULL ||
+ MAIN_CONNECTION(backend)->sp == NULL)
+ return oid;
+
+ /* Ensure database name is valid */
+ if (MAIN_CONNECTION(backend)->sp->database == NULL)
+ return oid;
+
+ if (!relcache)
+ {
+ relcache = pool_create_relcache(
+ pool_config->relcache_size,
+ DATABASE_TO_OID_QUERY,
+ int_register_func,
+ int_unregister_func,
+ false);
+ if (relcache == NULL)
+ {
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "error creating relcache")));
+ return oid;
+ }
+ }
+
+ oid = (int) (intptr_t) pool_search_relcache(
+ relcache, backend,
+ MAIN_CONNECTION(backend)->sp->database);
+ return oid;
+}
+
+int
+pool_track_table_mutation_get_database_oid(void)
+{
+ return track_table_mutation_get_database_oid_internal();
+}
+
+/* ----------------
+ * Table mutation hash table operations
+ * ----------------
+ */
+
+/*
+ * Initialize table mutation hash table
+ */
+static void
+table_map_init(TrackTableMutationHashTable *map,
+ int num_buckets, int max_entries)
+{
+ int *buckets;
+ TrackTableMutationEntry *entries;
+ int i;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ map->num_buckets = num_buckets;
+ map->max_entries = max_entries;
+ map->num_entries = 0;
+ map->free_list_head = 0;
+
+ buckets = TABLE_MAP_BUCKETS(map);
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Initialize all buckets to empty */
+ for (i = 0; i < num_buckets; i++)
+ buckets[i] = invalid;
+
+ /* Initialize free list - chain all entries */
+ for (i = 0; i < max_entries; i++)
+ {
+ entries[i].in_use = false;
+ entries[i].next = (i < max_entries - 1) ?
+ i + 1 : invalid;
+ }
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "table map init %d buckets, "
+ "%d max entries",
+ num_buckets, max_entries)));
+}
+
+/*
+ * Allocate an entry from the free list
+ */
+static int
+table_map_alloc_entry(TrackTableMutationHashTable *map)
+{
+ TrackTableMutationEntry *entries;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ if (map->free_list_head == invalid)
+ return invalid;
+
+ idx = map->free_list_head;
+ map->free_list_head = entries[idx].next;
+ entries[idx].in_use = true;
+ entries[idx].next = invalid;
+ map->num_entries++;
+
+ return idx;
+}
+
+/*
+ * Free an entry back to the free list
+ */
+static void
+table_map_free_entry(TrackTableMutationHashTable *map,
+ int idx)
+{
+ TrackTableMutationEntry *entries;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ entries[idx].in_use = false;
+ entries[idx].next = map->free_list_head;
+ map->free_list_head = idx;
+ map->num_entries--;
+}
+
+/*
+ * Look up a table in the hash table.
+ * Returns entry index or INVALID_INDEX if not found.
+ * Must be called with lock held.
+ */
+static int
+table_map_lookup(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx = buckets[bucket];
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ while (idx != invalid)
+ {
+ if (entries[idx].hash == hash &&
+ entries[idx].table_oid == table_oid &&
+ entries[idx].dboid == dboid)
+ {
+ return idx;
+ }
+ idx = entries[idx].next;
+ }
+
+ return invalid;
+}
+
+/*
+ * Insert or update a table entry.
+ * Must be called with lock held.
+ */
+static void
+table_map_insert(TrackTableMutationHashTable *map,
+ int table_oid, int dboid,
+ uint32 hash,
+ struct timeval *write_time)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ int bucket = hash % map->num_buckets;
+ int idx;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+
+ /* Check if entry already exists */
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != invalid)
+ {
+ /* Update last write time; keep first_write_time */
+ entries[idx].last_write_time = *write_time;
+ return;
+ }
+
+ /* Allocate new entry */
+ idx = table_map_alloc_entry(map);
+ if (idx == invalid)
+ {
+ int b;
+
+ /* Table is full - evict first non-empty bucket */
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ if (buckets[b] != invalid)
+ {
+ int victim = buckets[b];
+
+ buckets[b] = entries[victim].next;
+ table_map_free_entry(map, victim);
+ idx = table_map_alloc_entry(map);
+ break;
+ }
+ }
+
+ if (idx == invalid)
+ {
+ ereport(WARNING,
+ (errmsg("track_table_mutation: "
+ "failed to allocate entry "
+ "for oid %d (dboid %d)",
+ table_oid, dboid)));
+ return;
+ }
+ }
+
+ /* Initialize new entry */
+ entries[idx].table_oid = table_oid;
+ entries[idx].dboid = dboid;
+ entries[idx].hash = hash;
+ entries[idx].first_write_time = *write_time;
+ entries[idx].last_write_time = *write_time;
+
+ /* Insert at head of bucket chain */
+ entries[idx].next = buckets[bucket];
+ buckets[bucket] = idx;
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "marked oid %d (dboid %d) written",
+ table_oid, dboid)));
+}
+
+/*
+ * Remove expired entries from the table map.
+ * Must be called with lock held.
+ */
+static void
+table_map_cleanup_expired(
+ TrackTableMutationHashTable *map, uint64 ttl_us)
+{
+ int *buckets = TABLE_MAP_BUCKETS(map);
+ TrackTableMutationEntry *entries;
+ struct timeval now;
+ int64 max_stale_us;
+ int removed = 0;
+ int b;
+ int invalid = TRACK_TABLE_MUTATION_INVALID_INDEX;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness * 1000LL;
+
+ for (b = 0; b < map->num_buckets; b++)
+ {
+ int *prev_ptr = &buckets[b];
+ int idx = buckets[b];
+
+ while (idx != invalid)
+ {
+ int64 age;
+ int64 total_age;
+ bool expired;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ expired = (age > (int64) ttl_us);
+
+ /*
+ * Also evict entries that exceed max_staleness from first write.
+ */
+ if (!expired && max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ expired = (total_age >= max_stale_us);
+ }
+
+ if (expired)
+ {
+ /* Entry has expired - remove it */
+ int next = entries[idx].next;
+
+ *prev_ptr = next;
+ table_map_free_entry(map, idx);
+ idx = next;
+ removed++;
+ }
+ else
+ {
+ prev_ptr = &entries[idx].next;
+ idx = entries[idx].next;
+ }
+ }
+ }
+
+ if (removed > 0)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "cleaned up %d expired entries",
+ removed)));
+ }
+}
+
+
+/* ----------------
+ * Public API implementation
+ * ----------------
+ */
+
+/*
+ * Calculate the total shared memory size required
+ * for the track table mutation feature.
+ */
+Size
+pool_track_table_mutation_shmem_size(void)
+{
+ Size size = 0;
+ int tbl_bkt;
+ int tbl_sz;
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+
+ /* Main structure */
+ size += sizeof(TrackTableMutationShmem);
+
+ /* Table mutation hash table */
+ size += sizeof(TrackTableMutationHashTable);
+ size += tbl_bkt * sizeof(int);
+ size += tbl_sz * sizeof(TrackTableMutationEntry);
+
+ return size;
+}
+
+/*
+ * Initialize shared memory structures for the
+ * track table mutation feature. Allocates and sets
+ * up the table map and parse cache in shared memory.
+ * Called once from pgpool main process at startup.
+ */
+void
+pool_track_table_mutation_init(void)
+{
+#ifndef POOL_PRIVATE
+ Size shmem_size;
+ char *shmem_ptr;
+ TrackTableMutationState *st;
+ int tbl_bkt;
+ int tbl_sz;
+
+ if (pool_config->disable_load_balance_on_write !=
+ DLBOW_DML_ADAPTIVE_GLOBAL)
+ {
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "feature disabled")));
+ return;
+ }
+
+ tbl_bkt = pool_config->track_table_mutation_table_buckets;
+ tbl_sz = pool_config->track_table_mutation_table_size;
+
+ shmem_size = pool_track_table_mutation_shmem_size();
+
+ /*
+ * Allocate from the main shared memory segment. Memory is zeroed by
+ * initialize_shared_memory_main_segment().
+ */
+ shmem_ptr = pool_shared_memory_segment_get_chunk(
+ shmem_size);
+ if (shmem_ptr == NULL)
+ {
+ ereport(ERROR,
+ (errmsg("track_table_mutation: "
+ "failed to allocate %zu bytes",
+ shmem_size)));
+ return;
+ }
+
+ /* Set up pointers within shared memory */
+ track_table_mutation_shmem =
+ (TrackTableMutationShmem *) shmem_ptr;
+ shmem_ptr += sizeof(TrackTableMutationShmem);
+
+ track_table_mutation_shmem->table_map =
+ (TrackTableMutationHashTable *) shmem_ptr;
+
+ /* Initialize table map */
+ table_map_init(
+ track_table_mutation_shmem->table_map,
+ tbl_bkt, tbl_sz);
+
+ /* Initialize global state */
+ st = &track_table_mutation_shmem->state;
+ st->initialized = true;
+ st->current_ttl_us = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+ get_current_time(&st->ttl_last_updated);
+ get_current_time(&st->last_cleanup_time);
+ st->global_cold_start_until.tv_sec = 0;
+ st->global_cold_start_until.tv_usec = 0;
+ st->stats_queries_checked = 0;
+ st->stats_forced_primary = 0;
+ st->stats_allowed_replica = 0;
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "initialized with %zu bytes shmem",
+ shmem_size)));
+#endif
+}
+
+/*
+ * Initialize per-child process state.
+ * Records the process start time for cold start
+ * period tracking. Called when a child process starts.
+ */
+void
+pool_track_table_mutation_child_init(void)
+{
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ get_current_time(&process_start_time);
+ cold_start_initialized = true;
+ dur = pool_config->track_table_mutation_cold_start_duration;
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "child init, cold start %d ms",
+ dur)));
+}
+
+/*
+ * Check if the process is in cold start period.
+ * During cold start, all queries are routed to
+ * primary to avoid stale reads. Checks both
+ * per-process and global (watchdog) cold start.
+ */
+bool
+pool_track_table_mutation_in_cold_start(void)
+{
+ struct timeval now;
+ int64 elapsed_ms;
+ int dur;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return false;
+
+ get_current_time(&now);
+ st = &track_table_mutation_shmem->state;
+
+ /* Check watchdog-triggered global cold start */
+ if (st->global_cold_start_until.tv_sec != 0 &&
+ elapsed_us(&now,
+ &st->global_cold_start_until) > 0)
+ {
+ return true;
+ }
+
+ /* Check per-process cold start */
+ if (!cold_start_initialized)
+ return false;
+
+ elapsed_ms = elapsed_us(&process_start_time, &now) / 1000;
+
+ if (elapsed_ms < dur)
+ {
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "cold start (%ld/%d ms)",
+ (long) elapsed_ms, dur)));
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Trigger a global cold start for all processes.
+ * Sets the cold start end time in shared memory.
+ * Called after watchdog leader change to force all
+ * queries to primary during the transition.
+ */
+void
+pool_track_table_mutation_trigger_global_cold_start(void)
+{
+ struct timeval now;
+ struct timeval *until;
+ int dur;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ dur = pool_config->track_table_mutation_cold_start_duration;
+ if (dur <= 0)
+ return;
+
+ get_current_time(&now);
+ until = &track_table_mutation_shmem->state
+ .global_cold_start_until;
+ *until = now;
+ until->tv_sec += dur / 1000;
+ until->tv_usec += (dur % 1000) * 1000;
+ if (until->tv_usec >= 1000000)
+ {
+ until->tv_sec += until->tv_usec / 1000000;
+ until->tv_usec %= 1000000;
+ }
+
+ ereport(LOG,
+ (errmsg("track_table_mutation: "
+ "global cold start for %d ms",
+ dur)));
+}
+
+/*
+ * Check if a table was recently written (is "stale").
+ * Returns true if reads should go to primary because
+ * the table was written within the current TTL window.
+ */
+bool
+pool_track_table_mutation_table_is_stale(
+ int table_oid, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ struct timeval now;
+ uint64 ttl_us;
+ uint32 hash;
+ int idx;
+ bool is_stale = false;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return false;
+
+ if (table_oid <= 0 || dboid <= 0)
+ {
+ is_stale = true;
+ goto update_stats;
+ }
+
+ map = track_table_mutation_shmem->table_map;
+ hash = fnv1a_hash_table_key(table_oid, dboid);
+
+ table_map_lock();
+
+ idx = table_map_lookup(map, table_oid, dboid, hash);
+ if (idx != TRACK_TABLE_MUTATION_INVALID_INDEX)
+ {
+ TrackTableMutationEntry *entries;
+ int64 age;
+ int64 total_age;
+ int64 max_stale_us;
+
+ entries = TABLE_MAP_ENTRIES(map);
+ get_current_time(&now);
+ ttl_us = track_table_mutation_shmem->state
+ .current_ttl_us;
+
+ age = elapsed_us(
+ &entries[idx].last_write_time, &now);
+ is_stale = (age < (int64) ttl_us);
+
+ /*
+ * Enforce max_staleness hard cap: no entry can force primary routing
+ * longer than max_staleness from its first write.
+ */
+ if (is_stale)
+ {
+ max_stale_us = (int64) pool_config
+ ->track_table_mutation_max_staleness
+ * 1000LL;
+ if (max_stale_us > 0)
+ {
+ total_age = elapsed_us(
+ &entries[idx].first_write_time,
+ &now);
+ if (total_age >= max_stale_us)
+ is_stale = false;
+ }
+ }
+
+ ereport(DEBUG2,
+ (errmsg("track_table_mutation: "
+ "oid %d dboid %d "
+ "elapsed=%ld ttl=%lu stale=%d",
+ table_oid, dboid,
+ (long) age,
+ (unsigned long) ttl_us,
+ is_stale)));
+ }
+
+ table_map_unlock();
+
+update_stats:
+ /* Update statistics using semaphore */
+ if (track_table_mutation_shmem != NULL)
+ {
+ TrackTableMutationState *st;
+
+ table_map_lock();
+ st = &track_table_mutation_shmem->state;
+ st->stats_queries_checked++;
+ if (is_stale)
+ st->stats_forced_primary++;
+ else
+ st->stats_allowed_replica++;
+ table_map_unlock();
+ }
+
+ return is_stale;
+}
+
+/*
+ * Mark multiple tables as recently written.
+ * Called after DML queries complete to record
+ * which tables were modified.
+ */
+void
+pool_track_table_mutation_mark_tables_written(
+ const int *table_oids, int num_tables, int dboid)
+{
+ TrackTableMutationHashTable *map;
+ TrackTableMutationState *st;
+ struct timeval now;
+ int i;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ if (num_tables <= 0 || table_oids == NULL ||
+ dboid <= 0)
+ return;
+
+ map = track_table_mutation_shmem->table_map;
+ st = &track_table_mutation_shmem->state;
+ get_current_time(&now);
+
+ table_map_lock();
+
+ /* Periodically clean up expired entries */
+ if (map->num_entries > map->max_entries * 3 / 4)
+ {
+ int64 since_cleanup;
+
+ since_cleanup = elapsed_us(
+ &st->last_cleanup_time, &now);
+ /* 100ms interval */
+ if (since_cleanup > 100000)
+ {
+ table_map_cleanup_expired(
+ map, st->current_ttl_us);
+ st->last_cleanup_time = now;
+ }
+ }
+
+ for (i = 0; i < num_tables; i++)
+ {
+ uint32 hash;
+ int table_oid = table_oids[i];
+
+ if (table_oid > 0)
+ {
+ hash = fnv1a_hash_table_key(
+ table_oid, dboid);
+ table_map_insert(map, table_oid,
+ dboid, hash, &now);
+ }
+ }
+
+ table_map_unlock();
+}
+
+/*
+ * Mark a single table as recently written.
+ */
+void
+pool_track_table_mutation_mark_table_written(
+ int table_oid, int dboid)
+{
+ if (table_oid > 0 && dboid > 0)
+ {
+ const int tables[1] = {table_oid};
+
+ pool_track_table_mutation_mark_tables_written(
+ tables, 1, dboid);
+ }
+}
+
+/*
+ * Update the staleness TTL based on observed
+ * replication delay. New TTL = delay * factor,
+ * clamped to [default_ttl, 1 hour].
+ */
+void
+pool_track_table_mutation_update_ttl(uint64 delay_us)
+{
+ uint64 new_ttl;
+ double factor;
+ TrackTableMutationState *st;
+
+ if (TRACK_TABLE_MUTATION_DISABLED())
+ return;
+
+ factor = pool_config->track_table_mutation_ttl_factor;
+ new_ttl = (uint64) (delay_us * factor);
+ if (new_ttl < TRACK_TABLE_MUTATION_DEFAULT_TTL_US)
+ new_ttl = TRACK_TABLE_MUTATION_DEFAULT_TTL_US;
+
+ /* Maximum TTL of 1 hour */
+ if (new_ttl > 3600ULL * 1000000ULL)
+ new_ttl = 3600ULL * 1000000ULL;
+
+ st = &track_table_mutation_shmem->state;
+ st->current_ttl_us = new_ttl;
+ get_current_time(&st->ttl_last_updated);
+
+ ereport(DEBUG1,
+ (errmsg("track_table_mutation: "
+ "TTL=%lu us (delay=%lu factor=%.1f)",
+ (unsigned long) new_ttl,
+ (unsigned long) delay_us,
+ factor)));
+}
--
2.54.0
^ permalink raw reply [nested|flat] 44+ messages in thread
* Re: Proposal: Recent mutated table tracking in memory
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Re: Proposal: Recent mutated table tracking in memory Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Re: Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
@ 2026-02-25 23:55 ` Tatsuo Ishii <[email protected]>
1 sibling, 0 replies; 44+ messages in thread
From: Tatsuo Ishii @ 2026-02-25 23:55 UTC (permalink / raw)
To: [email protected]; +Cc: [email protected]
> Hi Tatsuo,
>
> Thank you for the careful review. You raised an important concern. I've
> addressed it in the updated patch ― here's the explanation:
>
> The attack scenario you describe is now handled. In the updated patch,
> writes inside explicit transactions are only flushed to the shared-memory
> table map at COMMIT time. If the transaction is rolled back, the table is
> never marked as stale. So the attack pattern:
>
> BEGIN;
> UPDATE t1 SET i = 1 WHERE FALSE;
> ROLLBACK;
>
> has zero effect on the shared-memory table map. The dml_adaptive_global
> mode piggybacks on the existing dml_adaptive per-transaction write list
> (transaction_temp_write_list). On COMMIT, the accumulated table names are
> resolved to OIDs and flushed to shared memory. On ROLLBACK,
> the list is simply discarded (the existing dml_adaptive behavior).
Ok.
> For autocommit statements (outside explicit transactions), tables are
> marked immediately ― but in that case the write is committed, so this is
> correct.
Agreed.
> Regression test included. Test 042 now includes:
> - Test 10: verifies that BEGIN; INSERT; ROLLBACK; SELECT does NOT route
> the SELECT to primary
> - Test 11: verifies that BEGIN; INSERT; COMMIT; SELECT DOES route the
> SELECT to primary
>
> Additional context on the threat model:
>
> 1. This feature requires disable_load_balance_on_write =
> 'dml_adaptive_global' ― it is opt-in, not enabled by default. Operators who
> enable it accept documented trade-offs (additional shared memory, TTL-based
> staleness window).
Ok.
> 2. An attacker who can connect and execute SQL against pgpool already has
> the ability to cause far more damage (DROP TABLE, mass DELETEs, resource
> exhaustion via expensive queries, connection flooding, etc.). The
> table-marking via committed writes is a minor concern compared to
> those vectors.
These existing risks are widely understood by PostgreSQL users and
already accepted (with various measures/workarounds). But the concern
is new one. If the concern is left, it will make users to hesitate to
use the new feature. I don't want to add a feature which users might
avoid to use it especially when we know that it is technically
possible to remove the concern.
> Authentication, connection limits, and network security
> are the appropriate defenses at that layer.
> 3. Even in the worst case (an attacker commits real writes in a loop),
> the impact is bounded: the stale marking is temporary (TTL-based, typically
> a few seconds), and only affects load-balancing decisions ― it doesn't
> cause data loss or correctness issues.
But it makes Pgpool's load balance feature less usefull. In my
uderstanding the new feature tries to keep load balance usefully while
it also tries to avoid steal data read.
> 4. The existing dml_adaptive mode has analogous behavior: within a
> transaction, a write to table T causes all reads of T to go to primary for
> the remainder of that transaction. The only difference is scope ―
> dml_adaptive_global extends this across sessions with a TTL.
The scope difference is huge. With existing dml_adaptive mode, it just
hits himself in the foot. Your patch allows him to hit someone else's
foot.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
^ permalink raw reply [nested|flat] 44+ messages in thread
end of thread, other threads:[~2026-05-24 17:00 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-06 11:25 Proposal: Recent mutated table tracking in memory Nadav Shatz <[email protected]>
2026-01-14 08:55 ` Nadav Shatz <[email protected]>
2026-01-26 11:02 ` Nadav Shatz <[email protected]>
2026-01-28 05:08 ` Tatsuo Ishii <[email protected]>
2026-01-28 05:37 ` Nadav Shatz <[email protected]>
2026-01-29 08:28 ` Tatsuo Ishii <[email protected]>
2026-01-29 08:54 ` Nadav Shatz <[email protected]>
2026-01-30 08:09 ` Tatsuo Ishii <[email protected]>
2026-01-31 17:11 ` Nadav Shatz <[email protected]>
2026-02-03 07:43 ` Tatsuo Ishii <[email protected]>
2026-02-03 23:23 ` Tatsuo Ishii <[email protected]>
2026-02-06 11:29 ` Nadav Shatz <[email protected]>
2026-02-10 15:16 ` Nadav Shatz <[email protected]>
2026-02-11 10:28 ` Tatsuo Ishii <[email protected]>
2026-02-12 09:05 ` Nadav Shatz <[email protected]>
2026-02-18 23:51 ` Tatsuo Ishii <[email protected]>
2026-02-19 04:40 ` Nadav Shatz <[email protected]>
2026-02-19 11:05 ` Nadav Shatz <[email protected]>
2026-02-26 00:02 ` Tatsuo Ishii <[email protected]>
2026-02-26 07:47 ` Tatsuo Ishii <[email protected]>
2026-02-26 15:26 ` Nadav Shatz <[email protected]>
2026-03-09 05:18 ` Tatsuo Ishii <[email protected]>
2026-03-09 09:22 ` Nadav Shatz <[email protected]>
2026-03-23 05:13 ` Tatsuo Ishii <[email protected]>
2026-03-23 13:07 ` Nadav Shatz <[email protected]>
2026-04-07 00:08 ` Tatsuo Ishii <[email protected]>
2026-04-07 05:45 ` Nadav Shatz <[email protected]>
2026-04-07 09:10 ` Tatsuo Ishii <[email protected]>
2026-04-07 09:43 ` Tatsuo Ishii <[email protected]>
2026-04-09 07:21 ` Nadav Shatz <[email protected]>
2026-04-14 22:43 ` Tatsuo Ishii <[email protected]>
2026-04-15 12:17 ` Nadav Shatz <[email protected]>
2026-04-19 07:24 ` Tatsuo Ishii <[email protected]>
2026-04-19 14:29 ` Nadav Shatz <[email protected]>
2026-04-23 08:14 ` Tatsuo Ishii <[email protected]>
2026-04-23 14:16 ` Nadav Shatz <[email protected]>
2026-05-18 09:54 ` Nadav Shatz <[email protected]>
2026-05-18 10:11 ` Tatsuo Ishii <[email protected]>
2026-05-20 04:28 ` Tatsuo Ishii <[email protected]>
2026-05-20 12:25 ` Nadav Shatz <[email protected]>
2026-05-21 09:50 ` Tatsuo Ishii <[email protected]>
2026-05-23 11:18 ` Tatsuo Ishii <[email protected]>
2026-05-24 17:00 ` Nadav Shatz <[email protected]>
2026-02-25 23:55 ` Tatsuo Ishii <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox