public inbox for [email protected]help / color / mirror / Atom feed
Autofail back inconsistent 3+ messages / 2 participants [nested] [flat]
* Autofail back inconsistent @ 2025-11-04 05:23 VASUKI M <[email protected]> 2025-11-04 11:40 ` Re: Autofail back inconsistent Tatsuo Ishii <[email protected]> 0 siblings, 1 reply; 3+ messages in thread From: VASUKI M @ 2025-11-04 05:23 UTC (permalink / raw) To: [email protected]; Bo Peng <[email protected]>; +Cc: [email protected]; [email protected] Hi Pgpool-II Community, I’ve been working with PostgreSQL for quite some time, and while managing backend servers with Pgpool-II, I noticed that there’s no built-in data synchronization before standby nodes are reattached. To address this, I’ve developed a new enhancement for Pgpool-II called *“Automatic Resync and Reattach”*, aimed at improving high availability and reducing manual intervention after node failures. This feature automatically performs: - *pg_rewind synchronization* and *WAL replay* for failed or lagging standby nodes - *Automatic reattachment* of standby nodes once they are safely resynced This enhancement overcomes some limitations of existing features: - *Online Recovery* – requires manual execution of pcp_recovery_node - *Auto Failback* – may reattach inconsistent nodes without synchronization With *Auto Resync & Reattach*, Pgpool-II ensures data safety and consistency during failover recovery, providing a more robust and self-healing HA environment. I’ve created a dedicated GitHub repository with setup instructions, configuration samples, and architecture diagrams here: https://github.com/BharatDBPG/pgpool2-auto-resync The repository includes: - Enhanced failover.sh and follow_primary.sh scripts (handling both primary and standby failures) - A new resync.sh script integrating pg_rewind + WAL replay - Sample configuration files for a *1 Primary + 2 Standby* setup I would love to receive feedback or suggestions from the community regarding possible improvements or integration ideas. I will be happy to share a patch if needed. Thank you for your time and support. Best regards, *Vasuki MCDAC,CHENNAI* https://github.com/BharatDBPG/pgpool2-auto-resync ^ permalink raw reply [nested|flat] 3+ messages in thread
* Re: Autofail back inconsistent 2025-11-04 05:23 Autofail back inconsistent VASUKI M <[email protected]> @ 2025-11-04 11:40 ` Tatsuo Ishii <[email protected]> 2025-11-07 11:25 ` Re: Autofail back inconsistent VASUKI M <[email protected]> 0 siblings, 1 reply; 3+ messages in thread From: Tatsuo Ishii @ 2025-11-04 11:40 UTC (permalink / raw) To: [email protected]; +Cc: [email protected]; [email protected]; [email protected] > Hi Pgpool-II Community, Thank you for the proposal. > I’ve been working with PostgreSQL for quite some time, and while managing > backend servers with Pgpool-II, I noticed that there’s no built-in data > synchronization before standby nodes are reattached. > > To address this, I’ve developed a new enhancement for Pgpool-II called > *“Automatic > Resync and Reattach”*, aimed at improving high availability and reducing > manual intervention after node failures. > > This feature automatically performs: > > - > > *pg_rewind synchronization* and *WAL replay* for failed or lagging > standby nodes > - > > *Automatic reattachment* of standby nodes once they are safely resynced > > This enhancement overcomes some limitations of existing features: > > - > > *Online Recovery* – requires manual execution of pcp_recovery_node > - > > *Auto Failback* – may reattach inconsistent nodes without synchronization > > With *Auto Resync & Reattach*, Pgpool-II ensures data safety and > consistency during failover recovery, providing a more robust and > self-healing HA environment. > > I’ve created a dedicated GitHub repository with setup instructions, > configuration samples, and architecture diagrams here: > https://github.com/BharatDBPG/pgpool2-auto-resync > > The repository includes: > > - > > Enhanced failover.sh and follow_primary.sh scripts (handling both > primary and standby failures) > - > > A new resync.sh script integrating pg_rewind + WAL replay > - > > Sample configuration files for a *1 Primary + 2 Standby* setup > > I would love to receive feedback or suggestions from the community > regarding possible improvements or integration ideas. > > I will be happy to share a patch if needed. > Thank you for your time and support. > > Best regards, > > *Vasuki MCDAC,CHENNAI* > > https://github.com/BharatDBPG/pgpool2-auto-resync Yes, it would be nice to share a patch. As we follow the PostgreSQL development model (not using GitHub), we are familiar with discussing on a patch base. Best regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp ^ permalink raw reply [nested|flat] 3+ messages in thread
* Re: Autofail back inconsistent 2025-11-04 05:23 Autofail back inconsistent VASUKI M <[email protected]> 2025-11-04 11:40 ` Re: Autofail back inconsistent Tatsuo Ishii <[email protected]> @ 2025-11-07 11:25 ` VASUKI M <[email protected]> 0 siblings, 0 replies; 3+ messages in thread From: VASUKI M @ 2025-11-07 11:25 UTC (permalink / raw) To: Tatsuo Ishii <[email protected]>; Bo Peng <[email protected]>; +Cc: [email protected]; [email protected] Thank you for your reply Tatsuo. Overview: This enhancement ensures data consistency during standby node reattachment by automatically performing pg_rewind synchronization and WAL replay before allowing a node to rejoin the cluster. This automation reduces manual intervention, improves fault recovery speed, and enhances high availability. Key Changes - Enhanced `failover.sh` and `follow_primary.sh` scripts to trigger automatic synchronization. - Added a new `resync.sh` script for safe data synchronization and reattach. - Included sample configuration files for a 3-node cluster (1 primary + 2 standbys). - Improved recovery automation and minimized risk of inconsistent data after failover. Testing The feature has been successfully tested on a **3-node local cluster setup**: - 1 Primary PostgreSQL node - 2 Standby nodes managed through Pgpool-II - Verified automatic resync and reattach after simulated failover events. Patch Details The patch is based on the latest Pgpool-II master branch (commit `76a06e5af`). Please find the attached patch file: **0001-Add-Automatic-Resync-and-Reattach-feature-pg_rewind-.patch** I would appreciate it if the community could review the changes and provide feedback or suggestions for improvement. Thank you for your time and support. Best regards, **Vasuki M** CDAC Chennai Email: [email protected] On Tue, 4 Nov 2025 at 17:10, Tatsuo Ishii <[email protected]> wrote: > > Hi Pgpool-II Community, > > Thank you for the proposal. > > > I’ve been working with PostgreSQL for quite some time, and while managing > > backend servers with Pgpool-II, I noticed that there’s no built-in data > > synchronization before standby nodes are reattached. > > > > To address this, I’ve developed a new enhancement for Pgpool-II called > > *“Automatic > > Resync and Reattach”*, aimed at improving high availability and reducing > > manual intervention after node failures. > > > > This feature automatically performs: > > > > - > > > > *pg_rewind synchronization* and *WAL replay* for failed or lagging > > standby nodes > > - > > > > *Automatic reattachment* of standby nodes once they are safely > resynced > > > > This enhancement overcomes some limitations of existing features: > > > > - > > > > *Online Recovery* – requires manual execution of pcp_recovery_node > > - > > > > *Auto Failback* – may reattach inconsistent nodes without > synchronization > > > > With *Auto Resync & Reattach*, Pgpool-II ensures data safety and > > consistency during failover recovery, providing a more robust and > > self-healing HA environment. > > > > I’ve created a dedicated GitHub repository with setup instructions, > > configuration samples, and architecture diagrams here: > > https://github.com/BharatDBPG/pgpool2-auto-resync > > > > The repository includes: > > > > - > > > > Enhanced failover.sh and follow_primary.sh scripts (handling both > > primary and standby failures) > > - > > > > A new resync.sh script integrating pg_rewind + WAL replay > > - > > > > Sample configuration files for a *1 Primary + 2 Standby* setup > > > > I would love to receive feedback or suggestions from the community > > regarding possible improvements or integration ideas. > > > > I will be happy to share a patch if needed. > > Thank you for your time and support. > > > > Best regards, > > > > *Vasuki MCDAC,CHENNAI* > > > > https://github.com/BharatDBPG/pgpool2-auto-resync > > Yes, it would be nice to share a patch. As we follow the PostgreSQL > development model (not using GitHub), we are familiar with discussing > on a patch base. > > Best regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp > Attachments: [text/x-patch] 0001-Add-Automatic-Resync-and-Reattach-feature-pg_rewind-.patch (24.3K, 3-0001-Add-Automatic-Resync-and-Reattach-feature-pg_rewind-.patch) download | inline diff: From 1a08939048d0a397bb101c3b61708b16f2dab00a Mon Sep 17 00:00:00 2001 From: BharatDBPG <[email protected]> Date: Fri, 7 Nov 2025 16:42:06 +0530 Subject: [PATCH] Add Automatic Resync and Reattach feature (pg_rewind + WAL replay) This patch introduces a new enhancement to Pgpool-II that ensures data consistency during node reattachment by performing pg_rewind synchronization and WAL replay before reattaching a standby node. Changes include: - Enhanced failover.sh and follow_primary.sh scripts - New resync.sh script for automatic synchronization and safe reattach - New sample configuration files for 1 primary and 2 standby setup This enhancement improves high availability by combining data safety with automation, reducing manual recovery and reattach operations. Tested successfully on a 3-node (1 primary + 2 standby) local cluster. Signed-off-by: Vasuki M <[email protected]> Signed-off-by: BharatDBPG <[email protected]> --- src/sample/Postgresql_main.conf.sample | 25 +++ src/sample/pcp.conf.sample | 2 +- src/sample/pg_hba.conf.sample | 13 ++ src/sample/pool_hba.conf.sample | 4 + src/sample/scripts/failover.sh.sample | 102 +++--------- src/sample/scripts/follow_primary.sh.sample | 163 ++++---------------- src/sample/scripts/recovery.sh.sample | 87 +++++++++++ src/sample/scripts/resync.sh | 155 +++++++++++++++++++ 8 files changed, 340 insertions(+), 211 deletions(-) create mode 100644 src/sample/Postgresql_main.conf.sample create mode 100644 src/sample/pg_hba.conf.sample create mode 100755 src/sample/scripts/recovery.sh.sample create mode 100755 src/sample/scripts/resync.sh diff --git a/src/sample/Postgresql_main.conf.sample b/src/sample/Postgresql_main.conf.sample new file mode 100644 index 000000000..6e28b87ba --- /dev/null +++ b/src/sample/Postgresql_main.conf.sample @@ -0,0 +1,25 @@ +# PostgreSQL Configuration (Sample) +# Safe to publish – paths, ports, and credentials removed + +#connection settings + +listen_addresses = '*' +port=5432 +max_connections = 100 +password_encryption = md5 +#replication settings +wal_level = replica +wal_keep_size= '1GB' +max_wal_senders = 10 +archive_mode = on +archive_command = 'cp %p /var/lib/postgresql/18/archive_wal/%f' +#hot standby settings +hot_standby = on +#logging settings +log_destination = 'stderr' +logging_collector = on +log_directory = '/var/log/postgresql' +log_filename = 'postgresql-18-main.log' +log_min_messages = info +# Data Directory (example only, adjust in production) +# data_directory = '/var/lib/postgresql/18/main' diff --git a/src/sample/pcp.conf.sample b/src/sample/pcp.conf.sample index 7e3f6202f..3fecd0234 100644 --- a/src/sample/pcp.conf.sample +++ b/src/sample/pcp.conf.sample @@ -25,4 +25,4 @@ # Lines beginning with '#' (pound) are comments and will # be ignored. Again, no spaces or tabs allowed before '#'. -# USERID:MD5PASSWD +# USERID:MD5PASSWD[plain] diff --git a/src/sample/pg_hba.conf.sample b/src/sample/pg_hba.conf.sample new file mode 100644 index 000000000..ab05d1495 --- /dev/null +++ b/src/sample/pg_hba.conf.sample @@ -0,0 +1,13 @@ +#change scram-sha to md5 to connect with pgpool +# PostgreSQL Client Authentication Configuration File (Sample) +# Safe version for GitHub (remove IPs or private info) + +# TYPE DATABASE USER ADDRESS METHOD +local all all trust +host all all 127.0.0.1/32 md5 +host replication repl 127.0.0.1/32 md5 +host all all ::1/128 md5 +#add entry for host and replication with your user +#assuming user:postgres +host replication postgres 127.0.0.1/32 md5 +host all postgres 127.0.0.1/32 md5 diff --git a/src/sample/pool_hba.conf.sample b/src/sample/pool_hba.conf.sample index d0b9d1283..a05786413 100644 --- a/src/sample/pool_hba.conf.sample +++ b/src/sample/pool_hba.conf.sample @@ -70,3 +70,7 @@ local all all trust # IPv4 local connections: host all all 127.0.0.1/32 trust host all all ::1/128 trust +# TYPE DATABASE USER ADDRESS METHOD +host all all 0.0.0.0/0 md5 +host replication repl 0.0.0.0/0 md5 +host replication postgres 127.0.0.1/32 md5 diff --git a/src/sample/scripts/failover.sh.sample b/src/sample/scripts/failover.sh.sample index 01878d7e5..28db4bef2 100755 --- a/src/sample/scripts/failover.sh.sample +++ b/src/sample/scripts/failover.sh.sample @@ -1,84 +1,30 @@ #!/bin/bash -# This script is run by failover_command. - -set -o xtrace - -# Special values: -# 1) %d = failed node id -# 2) %h = failed node hostname -# 3) %p = failed node port number -# 4) %D = failed node database cluster path -# 5) %m = new main node id -# 6) %H = new main node hostname -# 7) %M = old main node id -# 8) %P = old primary node id -# 9) %r = new main port number -# 10) %R = new main database cluster path -# 11) %N = old primary node hostname -# 12) %S = old primary node port number -# 13) %% = '%' character +# failover.sh — Pgpool-II failover handling script +# Args: %d %h %p %D %m %H %M %P FAILED_NODE_ID="$1" -FAILED_NODE_HOST="$2" -FAILED_NODE_PORT="$3" -FAILED_NODE_PGDATA="$4" -NEW_MAIN_NODE_ID="$5" -NEW_MAIN_NODE_HOST="$6" -OLD_MAIN_NODE_ID="$7" -OLD_PRIMARY_NODE_ID="$8" -NEW_MAIN_NODE_PORT="$9" -NEW_MAIN_NODE_PGDATA="${10}" -OLD_PRIMARY_NODE_HOST="${11}" -OLD_PRIMARY_NODE_PORT="${12}" - -PGHOME=/usr/pgsql-17 -REPL_SLOT_NAME=$(echo ${FAILED_NODE_HOST,,} | tr -- -. _) -POSTGRESQL_STARTUP_USER=postgres -SSH_KEY_FILE=id_rsa_pgpool -SSH_OPTIONS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/${SSH_KEY_FILE}" - - -echo failover.sh: start: failed_node_id=$FAILED_NODE_ID failed_host=$FAILED_NODE_HOST \ - old_primary_node_id=$OLD_PRIMARY_NODE_ID new_main_node_id=$NEW_MAIN_NODE_ID new_main_host=$NEW_MAIN_NODE_HOST - -## If there's no main node anymore, skip failover. -if [ $NEW_MAIN_NODE_ID -lt 0 ]; then - echo failover.sh: All nodes are down. Skipping failover. - exit 0 -fi - -## Test passwordless SSH -ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NEW_MAIN_NODE_HOST} ls /tmp > /dev/null - -if [ $? -ne 0 ]; then - echo failover.sh: passwordless SSH to ${POSTGRESQL_STARTUP_USER}@${NEW_MAIN_NODE_HOST} failed. Please setup passwordless SSH. - exit 1 -fi - -## If Standby node is down, skip failover. -if [ $OLD_PRIMARY_NODE_ID != "-1" -a $FAILED_NODE_ID != $OLD_PRIMARY_NODE_ID ]; then - - # If Standby node is down, drop replication slot. - ${PGHOME}/bin/psql -h ${OLD_PRIMARY_NODE_HOST} -p ${OLD_PRIMARY_NODE_PORT} postgres \ - -c "SELECT pg_drop_replication_slot('${REPL_SLOT_NAME}');" >/dev/null 2>&1 - - if [ $? -ne 0 ]; then - echo ERROR: failover.sh: drop replication slot \"${REPL_SLOT_NAME}\" failed. You may need to drop replication slot manually. - fi - - echo failover.sh: end: standby node is down. Skipping failover. - exit 0 -fi - -## Promote Standby node. -echo failover.sh: primary node is down, promote new_main_node_id=$NEW_MAIN_NODE_ID on ${NEW_MAIN_NODE_HOST}. - -ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NEW_MAIN_NODE_HOST} ${PGHOME}/bin/pg_ctl -D ${NEW_MAIN_NODE_PGDATA} -w promote - -if [ $? -ne 0 ]; then - echo ERROR: failover.sh: promote failed - exit 1 +FAILED_HOST="$2" +FAILED_PORT="$3" +FAILED_DATA="$4" +NEW_PRIMARY_NODE_ID="$5" +NEW_PRIMARY_HOST="$6" +NEW_PRIMARY_PORT="$7" + +#change this as per your requirement +PG_BIN="/usr/lib/postgresql/18/bin" +PGDATA="/var/lib/postgresql/18/main" +LOG="/var/log/pgpool/failover.log" + +mkdir -p /var/log/pgpool +echo "$(date '+%F %T') Failover triggered: failed_node=$FAILED_NODE_ID host=$FAILED_HOST:$FAILED_PORT new_primary=$NEW_PRIMARY_HOST:$NEW_PRIMARY_PORT" >> "$LOG" + +# 1️⃣ If the failed node was the old primary +if [ "$FAILED_NODE_ID" != "$NEW_PRIMARY_NODE_ID" ]; then + echo "$(date '+%F %T') Promoting new primary at $NEW_PRIMARY_HOST:$NEW_PRIMARY_PORT..." >> "$LOG" + sudo -u postgres "$PG_BIN/pg_ctl" promote -D "$PGDATA" >> "$LOG" 2>&1 + echo "$(date '+%F %T') Promotion completed for node_id=$NEW_PRIMARY_NODE_ID" >> "$LOG" +else + echo "$(date '+%F %T') Standby node $FAILED_HOST:$FAILED_PORT failed. No promotion required." >> "$LOG" fi -echo failover.sh: end: new_main_node_id=$NEW_MAIN_NODE_ID on ${NEW_MAIN_NODE_HOST} was successfully promoted to primary exit 0 diff --git a/src/sample/scripts/follow_primary.sh.sample b/src/sample/scripts/follow_primary.sh.sample index eb83b7123..32d4ef1ea 100755 --- a/src/sample/scripts/follow_primary.sh.sample +++ b/src/sample/scripts/follow_primary.sh.sample @@ -1,137 +1,36 @@ #!/bin/bash -# This script is run after failover_command to synchronize the Standby with the new Primary. - -set -o xtrace - -# Special values: -# 1) %d = node id -# 2) %h = hostname -# 3) %p = port number -# 4) %D = node database cluster path -# 5) %m = new primary node id -# 6) %H = new primary node hostname -# 7) %M = old main node id -# 8) %P = old primary node id -# 9) %r = new primary port number -# 10) %R = new primary database cluster path -# 11) %N = old primary node hostname -# 12) %S = old primary node port number -# 13) %% = '%' character - -NODE_ID="$1" -NODE_HOST="$2" -NODE_PORT="$3" -NODE_PGDATA="$4" -NEW_PRIMARY_NODE_ID="$5" -NEW_PRIMARY_NODE_HOST="$6" -OLD_MAIN_NODE_ID="$7" -OLD_PRIMARY_NODE_ID="$8" -NEW_PRIMARY_NODE_PORT="$9" -NEW_PRIMARY_NODE_PGDATA="${10}" - -PGHOME=/usr/pgsql-17 -REPLUSER=repl -PCP_USER=pgpool -PGPOOL_PATH=/usr/bin -PCP_PORT=9898 -REPL_SLOT_NAME=$(echo ${NODE_HOST,,} | tr -- -. _) -POSTGRESQL_STARTUP_USER=postgres -SSH_KEY_FILE=id_rsa_pgpool -SSH_OPTIONS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/${SSH_KEY_FILE}" - -echo follow_primary.sh: start: Standby node ${NODE_ID} - -# Check the connection status of Standby -${PGHOME}/bin/pg_isready -h ${NODE_HOST} -p ${NODE_PORT} > /dev/null 2>&1 - -if [ $? -ne 0 ]; then - echo follow_primary.sh: node_id=${NODE_ID} is not running. skipping follow primary command - exit 0 -fi - -# Test passwordless SSH -ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NEW_PRIMARY_NODE_HOST} ls /tmp > /dev/null - -if [ $? -ne 0 ]; then - echo follow_main.sh: passwordless SSH to ${POSTGRESQL_STARTUP_USER}@${NEW_PRIMARY_NODE_HOST} failed. Please setup passwordless SSH. - exit 1 -fi - -# Get PostgreSQL major version -PGVERSION=`${PGHOME}/bin/psql -V | awk '{print $3}' | sed 's/\..*//' | sed 's/\([0-9]*\)[a-zA-Z].*/\1/'` - -if [ $PGVERSION -ge 12 ]; then - RECOVERYCONF=${NODE_PGDATA}/myrecovery.conf -else - RECOVERYCONF=${NODE_PGDATA}/recovery.conf -fi - -# Synchronize Standby with the new Primary. -echo follow_primary.sh: pg_rewind for node ${NODE_ID} - -# Run checkpoint command to update control file before running pg_rewind -${PGHOME}/bin/psql -h ${NEW_PRIMARY_NODE_HOST} -p ${NEW_PRIMARY_NODE_PORT} postgres -c "checkpoint;" - -# Create replication slot "${REPL_SLOT_NAME}" -${PGHOME}/bin/psql -h ${NEW_PRIMARY_NODE_HOST} -p ${NEW_PRIMARY_NODE_PORT} postgres \ - -c "SELECT pg_create_physical_replication_slot('${REPL_SLOT_NAME}');" >/dev/null 2>&1 - -if [ $? -ne 0 ]; then - echo follow_primary.sh: create replication slot \"${REPL_SLOT_NAME}\" failed. You may need to create replication slot manually. -fi - -ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NODE_HOST} " - - set -o errexit - - ${PGHOME}/bin/pg_ctl -w -m f -D ${NODE_PGDATA} stop - - ${PGHOME}/bin/pg_rewind -D ${NODE_PGDATA} --source-server=\"user=${POSTGRESQL_STARTUP_USER} host=${NEW_PRIMARY_NODE_HOST} port=${NEW_PRIMARY_NODE_PORT} dbname=postgres\" - - [ -d \"${NODE_PGDATA}\" ] && rm -rf ${NODE_PGDATA}/pg_replslot/* - - cat > ${RECOVERYCONF} << EOT -primary_conninfo = 'host=${NEW_PRIMARY_NODE_HOST} port=${NEW_PRIMARY_NODE_PORT} user=${REPLUSER} application_name=${NODE_HOST} passfile=''/var/lib/pgsql/.pgpass''' -recovery_target_timeline = 'latest' -primary_slot_name = '${REPL_SLOT_NAME}' -EOT - - if [ ${PGVERSION} -ge 12 ]; then - sed -i -e \"\\\$ainclude_if_exists = '$(echo ${RECOVERYCONF} | sed -e 's/\//\\\//g')'\" \ - -e \"/^include_if_exists = '$(echo ${RECOVERYCONF} | sed -e 's/\//\\\//g')'/d\" ${NODE_PGDATA}/postgresql.conf - touch ${NODE_PGDATA}/standby.signal - else - echo \"standby_mode = 'on'\" >> ${RECOVERYCONF} - fi - - ${PGHOME}/bin/pg_ctl -l /dev/null -w -D ${NODE_PGDATA} start - -" - -# If start Standby successfully, attach this node -if [ $? -eq 0 ]; then - - # Run pcp_attact_node to attach Standby node to Pgpool-II. - ${PGPOOL_PATH}/pcp_attach_node -w -h localhost -U $PCP_USER -p ${PCP_PORT} -n ${NODE_ID} - - if [ $? -ne 0 ]; then - echo ERROR: follow_primary.sh: end: pcp_attach_node failed - exit 1 - fi - +# follow_primary.sh +# Called by Pgpool-II after a new primary is elected. +# Args: %d (failed node id) +# %h (failed host) +# %p (failed port) +# %D (failed data directory) +# %m (new primary node id) +# %H (new primary host) +# %M (new primary port) + +FAILED_ID="$1" +FAILED_HOST="$2" +FAILED_PORT="$3" +FAILED_DATA="$4" +NEW_MAIN_ID="$5" +NEW_MAIN_HOST="$6" +NEW_MAIN_PORT="$7" + +LOG="/var/log/pgpool/follow_primary.log" +mkdir -p /var/log/pgpool + +echo "$(date '+%F %T') follow_primary: failed=$FAILED_HOST:$FAILED_PORT new_primary=$NEW_MAIN_HOST:$NEW_MAIN_PORT" >> "$LOG" + +# Path to your resync script (this handles recloning/resyncing) +RESYNC_SCRIPT="/etc/pgpool2/resync.sh" + +if [ -x "$RESYNC_SCRIPT" ]; then + echo "$(date '+%F %T') Starting resync of old primary ($FAILED_HOST:$FAILED_PORT) from new primary ($NEW_MAIN_HOST:$NEW_MAIN_PORT)..." >> "$LOG" + "$RESYNC_SCRIPT" "$FAILED_HOST" "$NEW_MAIN_HOST" "$FAILED_PORT" "$NEW_MAIN_PORT" "$FAILED_DATA" >> "$LOG" 2>&1 & + echo "$(date '+%F %T') Resync script triggered successfully." >> "$LOG" else - - # If start Standby failed, drop replication slot "${REPL_SLOT_NAME}" - ${PGHOME}/bin/psql -h ${NEW_PRIMARY_NODE_HOST} -p ${NEW_PRIMARY_NODE_PORT} postgres \ - -c "SELECT pg_drop_replication_slot('${REPL_SLOT_NAME}');" >/dev/null 2>&1 - - if [ $? -ne 0 ]; then - echo ERROR: follow_primary.sh: drop replication slot \"${REPL_SLOT_NAME}\" failed. You may need to drop replication slot manually. - fi - - echo ERROR: follow_primary.sh: end: follow primary command failed - exit 1 + echo "$(date '+%F %T') ERROR: Resync script not found or not executable at $RESYNC_SCRIPT" >> "$LOG" fi -echo follow_primary.sh: end: follow primary command is completed successfully exit 0 diff --git a/src/sample/scripts/recovery.sh.sample b/src/sample/scripts/recovery.sh.sample new file mode 100755 index 000000000..18b2acdb9 --- /dev/null +++ b/src/sample/scripts/recovery.sh.sample @@ -0,0 +1,87 @@ +#!/bin/bash +#===================================================================== +# recovery.sh +# Called automatically by Pgpool-II when a down node becomes available +# or when online recovery is triggered. +# This script identifies the current primary via PCP and resyncs +# the target standby (old node) using resync.sh. +#===================================================================== + +set -euo pipefail + +#change this as per ur required path +NODE_ID="${1:-}" +LOG=/var/log/pgpool/recovery.log +mkdir -p /var/log/pgpool +exec >>"$LOG" 2>&1 + +echo "=====================================================================" +echo "$(date '+%F %T') [INFO] recovery.sh called with NODE_ID=$NODE_ID" + +# --- Mapping Node IDs to ports and data directories ----------------- +case "$NODE_ID" in + 0) + TARGET_PGDATA="/var/lib/postgresql/18/main" + TARGET_PORT=5432 + TARGET_HOST="localhost" + ;; + 1) + TARGET_PGDATA="/var/lib/postgresql/18/standby1" + TARGET_PORT=5433 + TARGET_HOST="localhost" + ;; + 2) + TARGET_PGDATA="/var/lib/postgresql/18/standby2" + TARGET_PORT=5434 + TARGET_HOST="localhost" + ;; + *) + echo "$(date '+%F %T') [ERROR] Unknown node id: $NODE_ID" + exit 1 + ;; +esac + +echo "$(date '+%F %T') [INFO] Target node: $TARGET_HOST:$TARGET_PORT ($TARGET_PGDATA)" + +# --- PCP settings --------------------------------------------------- +PCP_PORT=9898 +PCP_USER=postgres + +# --- Detect current primary using pcp_node_info --------------------- +echo "$(date '+%F %T') [INFO] Detecting current primary node..." +#PRIMARY_INFO=$(sudo -u postgres pcp_node_info -h localhost -U "$PCP_USER" -p "$PCP_PORT" 2>/dev/null | grep "primary primary" | head -n 1) +# Detect current primary node +PRIMARY_INFO=$(sudo -u postgres pcp_node_info -h localhost -U "$PCP_USER" -p "$PCP_PORT" 2>/dev/null \ + | awk '$7=="primary" {print $1 ":" $2; exit}') + +if [ -z "$PRIMARY_INFO" ]; then + echo "$(date '+%F %T') [ERROR] Could not determine primary via pcp_node_info" + exit 1 +fi + +# --- Split host:port correctly --- +PRIMARY_HOST=${PRIMARY_INFO%%:*} # everything before colon +PRIMARY_PORT=${PRIMARY_INFO##*:} # everything after colon + +echo "$(date '+%F %T') [INFO] Current primary detected: $PRIMARY_HOST:$PRIMARY_PORT" + +# --- Safety check: do not resync the current primary --- +if [ "$PRIMARY_PORT" = "$TARGET_PORT" ]; then + echo "$(date '+%F %T') [WARN] Target node is already primary; skipping resync." + exit 0 +fi + +# --- Perform the resync --- +echo "$(date '+%F %T') [INFO] Calling resync.sh for node $NODE_ID..." +/etc/pgpool2/resync.sh "$TARGET_HOST" "$PRIMARY_HOST" "$TARGET_PORT" "$PRIMARY_PORT" "$TARGET_PGDATA" >> /var/log/pgpool/resync.log 2>&1 + +RC=$? +if [ $RC -eq 0 ]; then + echo "$(date '+%F %T') [INFO] Recovery completed successfully for node $NODE_ID." +else + echo "$(date '+%F %T') [ERROR] Recovery failed for node $NODE_ID (exit=$RC). Check resync.log." +fi + + +echo "=====================================================================" +exit $RC diff --git a/src/sample/scripts/resync.sh b/src/sample/scripts/resync.sh new file mode 100755 index 000000000..8bd692a3a --- /dev/null +++ b/src/sample/scripts/resync.sh @@ -0,0 +1,155 @@ +#!/bin/bash +#this is for 3 node setup assuming 5432 as primary and 5433,5434 are standbys +# Improved resync.sh +# Args: $1 = old_host (node to resync), $2 = new_primary_host, $3 = old_port, $4 = new_port, $5 = old_data_dir +set -euo pipefail + +OLD_HOST="${1:-}" +NEW_HOST="${2:-}" +OLD_PORT="${3:-}" +NEW_PORT="${4:-}" +OLD_PGDATA="${5:-}" + +# Adjust these if your binaries or primary data path differ +PG_BIN="/usr/lib/postgresql/18/bin" +PRIMARY_PGDATA="/var/lib/postgresql/18/main" + +PCP_PORT=9898 +PCP_USER=postgres +LOG=/var/log/pgpool/resync.log + +mkdir -p /var/log/pgpool +# Basic argument dump for debugging +echo "$(date '+%F %T') ==== resync.sh start ====" >> "$LOG" +echo "$(date '+%F %T') args: OLD_HOST=$OLD_HOST NEW_HOST=$NEW_HOST OLD_PORT=$OLD_PORT NEW_PORT=$NEW_PORT OLD_PGDATA=$OLD_PGDATA" >> "$LOG" +echo "$(date '+%F %T') environment: PG_BIN=$PG_BIN PCP_PORT=$PCP_PORT PCP_USER=$PCP_USER" >> "$LOG" + +# Basic validation +if [ -z "$OLD_HOST" ] || [ -z "$NEW_HOST" ] || [ -z "$OLD_PORT" ] || [ -z "$NEW_PORT" ] || [ -z "$OLD_PGDATA" ]; then + echo "$(date '+%F %T') ERROR: missing required arguments" >> "$LOG" + exit 1 +fi + +# Stop target Postgres (ignore if already stopped) +echo "$(date '+%F %T') stopping target postgres (if running): $OLD_PGDATA" >> "$LOG" +sudo -u postgres "$PG_BIN/pg_ctl" -D "$OLD_PGDATA" stop -m immediate >> "$LOG" 2>&1 || true +sleep 2 + +# Try to auto-determine NODE_ID_FOR_PCP from pcp_node_info (safer than static mapping) +NODE_ID_FOR_PCP="" +PCP_OUT=$(sudo -u postgres pcp_node_info -h localhost -p "$PCP_PORT" -U "$PCP_USER" 2>/dev/null || true) +if [ -n "$PCP_OUT" ]; then + # iterate lines and find matching host:port; pcp_node_info lines: "<host> <port> ..." + idx=0 + while read -r line; do + hs=$(echo "$line" | awk '{print $1}') + pt=$(echo "$line" | awk '{print $2}') + if [ "$hs" = "$OLD_HOST" ] && [ "$pt" = "$OLD_PORT" ]; then + NODE_ID_FOR_PCP="$idx" + break + fi + idx=$((idx+1)) + done <<< "$PCP_OUT" +fi + +# Fallback static mapping if we couldn't determine node id +if [ -z "$NODE_ID_FOR_PCP" ]; then + echo "$(date '+%F %T') WARN: could not detect node id via pcp_node_info; using port->id fallback" >> "$LOG" + case "$OLD_PORT" in + 5432) NODE_ID_FOR_PCP=0 ;; + 5433) NODE_ID_FOR_PCP=1 ;; + 5434) NODE_ID_FOR_PCP=2 ;; + *) echo "$(date '+%F %T') ERROR: unknown OLD_PORT=$OLD_PORT" >> "$LOG"; exit 1 ;; + esac +fi +echo "$(date '+%F %T') NODE_ID_FOR_PCP=$NODE_ID_FOR_PCP" >> "$LOG" + +# Ensure NEW_HOST/NEW_PORT present +if [ -z "$NEW_HOST" ] || [ -z "$NEW_PORT" ]; then + echo "$(date '+%F %T') ERROR: NEW_HOST or NEW_PORT empty" >> "$LOG" + exit 1 +fi + +# Print pg_controldata system ids (best-effort) +SYS_SRC=$(sudo -u postgres "$PG_BIN/pg_controldata" "$OLD_PGDATA" 2>/dev/null | awk -F: '/Database system identifier/ {print $2}' | tr -d ' ' || true) +SYS_TGT=$(sudo -u postgres "$PG_BIN/pg_controldata" "$PRIMARY_PGDATA" 2>/dev/null | awk -F: '/Database system identifier/ {print $2}' | tr -d ' ' || true) +echo "$(date '+%F %T') systemid old=$SYS_SRC current_primary=$SYS_TGT" >> "$LOG" + +# Try pg_rewind +echo "$(date '+%F %T') Attempting pg_rewind: target=$OLD_PGDATA <- source=$NEW_HOST:$NEW_PORT" >> "$LOG" +if sudo -u postgres "$PG_BIN/pg_rewind" --target-pgdata="$OLD_PGDATA" --source-server="host=$NEW_HOST port=$NEW_PORT user=postgres dbname=postgres" >> "$LOG" 2>&1; then + echo "$(date '+%F %T') pg_rewind succeeded" >> "$LOG" +else + echo "$(date '+%F %T') pg_rewind failed; falling back to pg_basebackup" >> "$LOG" + # Remove contents but keep directory itself + LOG="/var/log/pgpool/resync.log" + sudo rm -rf "$OLD_PGDATA" >> "$LOG" 2>&1 + sudo mkdir -p "$OLD_PGDATA" >> "$LOG" 2>&1 + sudo chown -R postgres:postgres "$OLD_PGDATA" >> "$LOG" 2>&1 + sudo chmod 700 "$OLD_PGDATA" >> "$LOG" 2>&1 + + + # Run pg_basebackup (requires repl user and .pgpass) + echo "$(date '+%F %T') running pg_basebackup from $NEW_HOST:$NEW_PORT to $OLD_PGDATA" >> "$LOG" + if ! sudo -u postgres "$PG_BIN/pg_basebackup" -h "$NEW_HOST" -p "$NEW_PORT" -U repl -D "$OLD_PGDATA" -Fp -Xs -P -R >> "$LOG" 2>&1; then + echo "$(date '+%F %T') pg_basebackup failed; aborting resync" >> "$LOG" + exit 1 + fi + echo "$(date '+%F %T') pg_basebackup succeeded" >> "$LOG" +fi + +# Ensure primary_conninfo exists in postgresql.auto.conf (safe append) +sudo -u postgres bash -c " + conf=\"$OLD_PGDATA/postgresql.auto.conf\" + if ! grep -q primary_conninfo \"\$conf\" 2>/dev/null; then + echo \"primary_conninfo = 'host=$NEW_HOST port=$NEW_PORT user=repl passfile=/var/lib/postgresql/.pgpass'\" >> \"\$conf\" + fi +" + +# Create standby.signal for modern Postgres +sudo -u postgres touch "$OLD_PGDATA/standby.signal" + +# Fix ownership/permissions +sudo chown -R postgres:postgres "$OLD_PGDATA" +sudo chmod 700 "$OLD_PGDATA" || true + +case "$OLD_PORT" in + 5432) CONF="/etc/postgresql/18/main/postgresql.conf" ;; + 5433) CONF="/etc/postgresql/18/standby1/postgresql.conf" ;; + 5434) CONF="/etc/postgresql/18/standby2/postgresql.conf" ;; + *) CONF="/etc/postgresql/18/main/postgresql.conf" ;; +esac + + +echo "$(date '+%F %T') starting postgres at $OLD_PGDATA using config $CONF" +sudo -u postgres "$PG_BIN/pg_ctl" -D "$OLD_PGDATA" -o "-c config_file=$CONF" -w start >>"$LOG" 2>&1 || echo "$(date '+%F %T') pg_ctl start returned non-zero (check postgres logs)" + + +# Wait for socket with pg_isready (up to ~30s) +attempt=0 +until sudo -u postgres "$PG_BIN/pg_isready" -q -p "$OLD_PORT" || [ $attempt -ge 20 ]; do + attempt=$((attempt+1)) + sleep 2 +done +if [ $attempt -ge 15 ]; then + echo "$(date '+%F %T') WARNING: server at $OLD_PGDATA didn't become ready within timeout" >> "$LOG" +fi + +# Attach node to Pgpool via PCP with retries +RETRIES=6 +i=0 +while [ $i -lt $RETRIES ]; do + if sudo -u postgres pcp_attach_node -h localhost -p "$PCP_PORT" -U "$PCP_USER" -n "$NODE_ID_FOR_PCP" >> "$LOG" 2>&1; then + echo "$(date '+%F %T') pcp_attach_node succeeded for node $NODE_ID_FOR_PCP" >> "$LOG" + break + fi + echo "$(date '+%F %T') pcp_attach_node attempt $((i+1)) failed; sleeping and retrying..." >> "$LOG" + i=$((i+1)) + sleep 3 +done +if [ $i -ge $RETRIES ]; then + echo "$(date '+%F %T') ERROR: pcp_attach_node failed after $RETRIES attempts" >> "$LOG" +fi + +echo "$(date '+%F %T') ==== resync complete ====" >> "$LOG" +exit 0 -- 2.43.0 ^ permalink raw reply [nested|flat] 3+ messages in thread
end of thread, other threads:[~2025-11-07 11:25 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2025-11-04 05:23 Autofail back inconsistent VASUKI M <[email protected]> 2025-11-04 11:40 ` Tatsuo Ishii <[email protected]> 2025-11-07 11:25 ` VASUKI M <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox