Autofail back inconsistent

public inbox for [email protected]  
help / color / mirror / Atom feed

Autofail back inconsistent
3+ messages / 2 participants
[nested] [flat]

* Autofail back inconsistent
@ 2025-11-04 05:23 VASUKI M <[email protected]>
  2025-11-04 11:40 ` Re: Autofail back inconsistent Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: VASUKI M @ 2025-11-04 05:23 UTC (permalink / raw)
  To: [email protected]; Bo Peng <[email protected]>; +Cc: [email protected]; [email protected]

Hi Pgpool-II Community,

I’ve been working with PostgreSQL for quite some time, and while managing
backend servers with Pgpool-II, I noticed that there’s no built-in data
synchronization before standby nodes are reattached.

To address this, I’ve developed a new enhancement for Pgpool-II called
*“Automatic
Resync and Reattach”*, aimed at improving high availability and reducing
manual intervention after node failures.

This feature automatically performs:

   -

   *pg_rewind synchronization* and *WAL replay* for failed or lagging
   standby nodes
   -

   *Automatic reattachment* of standby nodes once they are safely resynced

This enhancement overcomes some limitations of existing features:

   -

   *Online Recovery* – requires manual execution of pcp_recovery_node
   -

   *Auto Failback* – may reattach inconsistent nodes without synchronization

With *Auto Resync & Reattach*, Pgpool-II ensures data safety and
consistency during failover recovery, providing a more robust and
self-healing HA environment.

I’ve created a dedicated GitHub repository with setup instructions,
configuration samples, and architecture diagrams here:
 https://github.com/BharatDBPG/pgpool2-auto-resync

The repository includes:

   -

   Enhanced failover.sh and follow_primary.sh scripts (handling both
   primary and standby failures)
   -

   A new resync.sh script integrating pg_rewind + WAL replay
   -

   Sample configuration files for a *1 Primary + 2 Standby* setup

I would love to receive feedback or suggestions from the community
regarding possible improvements or integration ideas.

I will be happy to share a patch if needed.
Thank you for your time and support.

Best regards,

*Vasuki MCDAC,CHENNAI*

https://github.com/BharatDBPG/pgpool2-auto-resync

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Autofail back inconsistent
  2025-11-04 05:23 Autofail back inconsistent VASUKI M <[email protected]>
@ 2025-11-04 11:40 ` Tatsuo Ishii <[email protected]>
  2025-11-07 11:25   ` Re: Autofail back inconsistent VASUKI M <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Tatsuo Ishii @ 2025-11-04 11:40 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]; [email protected]; [email protected]

> Hi Pgpool-II Community,

Thank you for the proposal.

> I’ve been working with PostgreSQL for quite some time, and while managing
> backend servers with Pgpool-II, I noticed that there’s no built-in data
> synchronization before standby nodes are reattached.
> 
> To address this, I’ve developed a new enhancement for Pgpool-II called
> *“Automatic
> Resync and Reattach”*, aimed at improving high availability and reducing
> manual intervention after node failures.
> 
> This feature automatically performs:
> 
>    -
> 
>    *pg_rewind synchronization* and *WAL replay* for failed or lagging
>    standby nodes
>    -
> 
>    *Automatic reattachment* of standby nodes once they are safely resynced
> 
> This enhancement overcomes some limitations of existing features:
> 
>    -
> 
>    *Online Recovery* – requires manual execution of pcp_recovery_node
>    -
> 
>    *Auto Failback* – may reattach inconsistent nodes without synchronization
> 
> With *Auto Resync & Reattach*, Pgpool-II ensures data safety and
> consistency during failover recovery, providing a more robust and
> self-healing HA environment.
> 
> I’ve created a dedicated GitHub repository with setup instructions,
> configuration samples, and architecture diagrams here:
>  https://github.com/BharatDBPG/pgpool2-auto-resync
> 
> The repository includes:
> 
>    -
> 
>    Enhanced failover.sh and follow_primary.sh scripts (handling both
>    primary and standby failures)
>    -
> 
>    A new resync.sh script integrating pg_rewind + WAL replay
>    -
> 
>    Sample configuration files for a *1 Primary + 2 Standby* setup
> 
> I would love to receive feedback or suggestions from the community
> regarding possible improvements or integration ideas.
> 
> I will be happy to share a patch if needed.
> Thank you for your time and support.
> 
> Best regards,
> 
> *Vasuki MCDAC,CHENNAI*
> 
> https://github.com/BharatDBPG/pgpool2-auto-resync

Yes, it would be nice to share a patch. As we follow the PostgreSQL
development model (not using GitHub), we are familiar with discussing
on a patch base.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Autofail back inconsistent
  2025-11-04 05:23 Autofail back inconsistent VASUKI M <[email protected]>
  2025-11-04 11:40 ` Re: Autofail back inconsistent Tatsuo Ishii <[email protected]>
@ 2025-11-07 11:25   ` VASUKI M <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: VASUKI M @ 2025-11-07 11:25 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; Bo Peng <[email protected]>; +Cc: [email protected]; [email protected]

Thank you for your reply Tatsuo.

Overview:
This enhancement ensures data consistency during standby node reattachment
by automatically performing pg_rewind synchronization and WAL replay before
allowing a node to rejoin the cluster. This automation reduces manual
intervention, improves fault recovery speed, and enhances high availability.

Key Changes
- Enhanced `failover.sh` and `follow_primary.sh` scripts to trigger
automatic synchronization.
- Added a new `resync.sh` script for safe data synchronization and reattach.
- Included sample configuration files for a 3-node cluster (1 primary + 2
standbys).
- Improved recovery automation and minimized risk of inconsistent data
after failover.

 Testing
The feature has been successfully tested on a **3-node local cluster
setup**:
- 1 Primary PostgreSQL node
- 2 Standby nodes managed through Pgpool-II
- Verified automatic resync and reattach after simulated failover events.

Patch Details
The patch is based on the latest Pgpool-II master branch (commit
`76a06e5af`).

Please find the attached patch file:
**0001-Add-Automatic-Resync-and-Reattach-feature-pg_rewind-.patch**

I would appreciate it if the community could review the changes and provide
feedback or suggestions for improvement.

Thank you for your time and support.

Best regards,
**Vasuki M**
CDAC Chennai
Email: [email protected]

On Tue, 4 Nov 2025 at 17:10, Tatsuo Ishii <[email protected]> wrote:

> > Hi Pgpool-II Community,
>
> Thank you for the proposal.
>
> > I’ve been working with PostgreSQL for quite some time, and while managing
> > backend servers with Pgpool-II, I noticed that there’s no built-in data
> > synchronization before standby nodes are reattached.
> >
> > To address this, I’ve developed a new enhancement for Pgpool-II called
> > *“Automatic
> > Resync and Reattach”*, aimed at improving high availability and reducing
> > manual intervention after node failures.
> >
> > This feature automatically performs:
> >
> >    -
> >
> >    *pg_rewind synchronization* and *WAL replay* for failed or lagging
> >    standby nodes
> >    -
> >
> >    *Automatic reattachment* of standby nodes once they are safely
> resynced
> >
> > This enhancement overcomes some limitations of existing features:
> >
> >    -
> >
> >    *Online Recovery* – requires manual execution of pcp_recovery_node
> >    -
> >
> >    *Auto Failback* – may reattach inconsistent nodes without
> synchronization
> >
> > With *Auto Resync & Reattach*, Pgpool-II ensures data safety and
> > consistency during failover recovery, providing a more robust and
> > self-healing HA environment.
> >
> > I’ve created a dedicated GitHub repository with setup instructions,
> > configuration samples, and architecture diagrams here:
> >  https://github.com/BharatDBPG/pgpool2-auto-resync
> >
> > The repository includes:
> >
> >    -
> >
> >    Enhanced failover.sh and follow_primary.sh scripts (handling both
> >    primary and standby failures)
> >    -
> >
> >    A new resync.sh script integrating pg_rewind + WAL replay
> >    -
> >
> >    Sample configuration files for a *1 Primary + 2 Standby* setup
> >
> > I would love to receive feedback or suggestions from the community
> > regarding possible improvements or integration ideas.
> >
> > I will be happy to share a patch if needed.
> > Thank you for your time and support.
> >
> > Best regards,
> >
> > *Vasuki MCDAC,CHENNAI*
> >
> > https://github.com/BharatDBPG/pgpool2-auto-resync
>
> Yes, it would be nice to share a patch. As we follow the PostgreSQL
> development model (not using GitHub), we are familiar with discussing
> on a patch base.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


Attachments:

  [text/x-patch] 0001-Add-Automatic-Resync-and-Reattach-feature-pg_rewind-.patch (24.3K, 3-0001-Add-Automatic-Resync-and-Reattach-feature-pg_rewind-.patch)
  download | inline diff:
From 1a08939048d0a397bb101c3b61708b16f2dab00a Mon Sep 17 00:00:00 2001
From: BharatDBPG <[email protected]>
Date: Fri, 7 Nov 2025 16:42:06 +0530
Subject: [PATCH] Add Automatic Resync and Reattach feature (pg_rewind + WAL
 replay)

This patch introduces a new enhancement to Pgpool-II that ensures data
consistency during node reattachment by performing pg_rewind synchronization
and WAL replay before reattaching a standby node.

Changes include:
- Enhanced failover.sh and follow_primary.sh scripts
- New resync.sh script for automatic synchronization and safe reattach
- New sample configuration files for 1 primary and 2 standby setup

This enhancement improves high availability by combining data safety
with automation, reducing manual recovery and reattach operations.

Tested successfully on a 3-node (1 primary + 2 standby) local cluster.

Signed-off-by: Vasuki M <[email protected]>
Signed-off-by: BharatDBPG <[email protected]>
---
 src/sample/Postgresql_main.conf.sample      |  25 +++
 src/sample/pcp.conf.sample                  |   2 +-
 src/sample/pg_hba.conf.sample               |  13 ++
 src/sample/pool_hba.conf.sample             |   4 +
 src/sample/scripts/failover.sh.sample       | 102 +++---------
 src/sample/scripts/follow_primary.sh.sample | 163 ++++----------------
 src/sample/scripts/recovery.sh.sample       |  87 +++++++++++
 src/sample/scripts/resync.sh                | 155 +++++++++++++++++++
 8 files changed, 340 insertions(+), 211 deletions(-)
 create mode 100644 src/sample/Postgresql_main.conf.sample
 create mode 100644 src/sample/pg_hba.conf.sample
 create mode 100755 src/sample/scripts/recovery.sh.sample
 create mode 100755 src/sample/scripts/resync.sh

diff --git a/src/sample/Postgresql_main.conf.sample b/src/sample/Postgresql_main.conf.sample
new file mode 100644
index 000000000..6e28b87ba
--- /dev/null
+++ b/src/sample/Postgresql_main.conf.sample
@@ -0,0 +1,25 @@
+# PostgreSQL Configuration (Sample)
+# Safe to publish – paths, ports, and credentials removed
+
+#connection settings
+
+listen_addresses = '*'
+port=5432
+max_connections = 100
+password_encryption = md5
+#replication settings
+wal_level = replica
+wal_keep_size= '1GB'
+max_wal_senders = 10
+archive_mode = on
+archive_command = 'cp %p /var/lib/postgresql/18/archive_wal/%f'
+#hot standby settings
+hot_standby = on
+#logging settings
+log_destination = 'stderr'
+logging_collector = on
+log_directory = '/var/log/postgresql'
+log_filename = 'postgresql-18-main.log'
+log_min_messages = info
+# Data Directory (example only, adjust in production)
+# data_directory = '/var/lib/postgresql/18/main'
diff --git a/src/sample/pcp.conf.sample b/src/sample/pcp.conf.sample
index 7e3f6202f..3fecd0234 100644
--- a/src/sample/pcp.conf.sample
+++ b/src/sample/pcp.conf.sample
@@ -25,4 +25,4 @@
 # Lines beginning with '#' (pound) are comments and will
 # be ignored. Again, no spaces or tabs allowed before '#'.
 
-# USERID:MD5PASSWD
+# USERID:MD5PASSWD[plain]
diff --git a/src/sample/pg_hba.conf.sample b/src/sample/pg_hba.conf.sample
new file mode 100644
index 000000000..ab05d1495
--- /dev/null
+++ b/src/sample/pg_hba.conf.sample
@@ -0,0 +1,13 @@
+#change scram-sha to md5 to connect with  pgpool
+# PostgreSQL Client Authentication Configuration File (Sample)
+# Safe version for GitHub (remove IPs or private info)
+
+# TYPE  DATABASE        USER            ADDRESS                 METHOD
+local   all             all                                     trust
+host    all             all             127.0.0.1/32            md5
+host    replication     repl            127.0.0.1/32            md5
+host    all             all             ::1/128                 md5
+#add entry for host and replication with your user
+#assuming user:postgres
+host    replication     postgres        127.0.0.1/32            md5
+host    all             postgres        127.0.0.1/32            md5
diff --git a/src/sample/pool_hba.conf.sample b/src/sample/pool_hba.conf.sample
index d0b9d1283..a05786413 100644
--- a/src/sample/pool_hba.conf.sample
+++ b/src/sample/pool_hba.conf.sample
@@ -70,3 +70,7 @@ local   all         all                               trust
 # IPv4 local connections:
 host    all         all         127.0.0.1/32          trust
 host    all         all         ::1/128               trust
+# TYPE  DATABASE        USER            ADDRESS                 METHOD
+host    all             all             0.0.0.0/0               md5
+host    replication     repl            0.0.0.0/0               md5
+host	replication	postgres	127.0.0.1/32		md5
diff --git a/src/sample/scripts/failover.sh.sample b/src/sample/scripts/failover.sh.sample
index 01878d7e5..28db4bef2 100755
--- a/src/sample/scripts/failover.sh.sample
+++ b/src/sample/scripts/failover.sh.sample
@@ -1,84 +1,30 @@
 #!/bin/bash
-# This script is run by failover_command.
-
-set -o xtrace
-
-# Special values:
-# 1)  %d = failed node id
-# 2)  %h = failed node hostname
-# 3)  %p = failed node port number
-# 4)  %D = failed node database cluster path
-# 5)  %m = new main node id
-# 6)  %H = new main node hostname
-# 7)  %M = old main node id
-# 8)  %P = old primary node id
-# 9)  %r = new main port number
-# 10) %R = new main database cluster path
-# 11) %N = old primary node hostname
-# 12) %S = old primary node port number
-# 13) %% = '%' character
+# failover.sh — Pgpool-II failover handling script
+# Args: %d %h %p %D %m %H %M %P
 
 FAILED_NODE_ID="$1"
-FAILED_NODE_HOST="$2"
-FAILED_NODE_PORT="$3"
-FAILED_NODE_PGDATA="$4"
-NEW_MAIN_NODE_ID="$5"
-NEW_MAIN_NODE_HOST="$6"
-OLD_MAIN_NODE_ID="$7"
-OLD_PRIMARY_NODE_ID="$8"
-NEW_MAIN_NODE_PORT="$9"
-NEW_MAIN_NODE_PGDATA="${10}"
-OLD_PRIMARY_NODE_HOST="${11}"
-OLD_PRIMARY_NODE_PORT="${12}"
-
-PGHOME=/usr/pgsql-17
-REPL_SLOT_NAME=$(echo ${FAILED_NODE_HOST,,} | tr -- -. _)
-POSTGRESQL_STARTUP_USER=postgres
-SSH_KEY_FILE=id_rsa_pgpool
-SSH_OPTIONS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/${SSH_KEY_FILE}"
-
-
-echo failover.sh: start: failed_node_id=$FAILED_NODE_ID failed_host=$FAILED_NODE_HOST \
-    old_primary_node_id=$OLD_PRIMARY_NODE_ID new_main_node_id=$NEW_MAIN_NODE_ID new_main_host=$NEW_MAIN_NODE_HOST
-
-## If there's no main node anymore, skip failover.
-if [ $NEW_MAIN_NODE_ID -lt 0 ]; then
-    echo failover.sh: All nodes are down. Skipping failover.
-    exit 0
-fi
-
-## Test passwordless SSH
-ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NEW_MAIN_NODE_HOST} ls /tmp > /dev/null
-
-if [ $? -ne 0 ]; then
-    echo failover.sh: passwordless SSH to ${POSTGRESQL_STARTUP_USER}@${NEW_MAIN_NODE_HOST} failed. Please setup passwordless SSH.
-    exit 1
-fi
-
-## If Standby node is down, skip failover.
-if [ $OLD_PRIMARY_NODE_ID != "-1" -a $FAILED_NODE_ID != $OLD_PRIMARY_NODE_ID ]; then
-
-    # If Standby node is down, drop replication slot.
-    ${PGHOME}/bin/psql -h ${OLD_PRIMARY_NODE_HOST} -p ${OLD_PRIMARY_NODE_PORT} postgres \
-        -c "SELECT pg_drop_replication_slot('${REPL_SLOT_NAME}');"  >/dev/null 2>&1
-
-    if [ $? -ne 0 ]; then
-        echo ERROR: failover.sh: drop replication slot \"${REPL_SLOT_NAME}\" failed. You may need to drop replication slot manually.
-    fi
-
-    echo failover.sh: end: standby node is down. Skipping failover.
-    exit 0
-fi
-
-## Promote Standby node.
-echo failover.sh: primary node is down, promote new_main_node_id=$NEW_MAIN_NODE_ID on ${NEW_MAIN_NODE_HOST}.
-
-ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NEW_MAIN_NODE_HOST} ${PGHOME}/bin/pg_ctl -D ${NEW_MAIN_NODE_PGDATA} -w promote
-
-if [ $? -ne 0 ]; then
-    echo ERROR: failover.sh: promote failed
-    exit 1
+FAILED_HOST="$2"
+FAILED_PORT="$3"
+FAILED_DATA="$4"
+NEW_PRIMARY_NODE_ID="$5"
+NEW_PRIMARY_HOST="$6"
+NEW_PRIMARY_PORT="$7"
+
+#change this as per your requirement
+PG_BIN="/usr/lib/postgresql/18/bin"
+PGDATA="/var/lib/postgresql/18/main"
+LOG="/var/log/pgpool/failover.log"
+
+mkdir -p /var/log/pgpool
+echo "$(date '+%F %T') Failover triggered: failed_node=$FAILED_NODE_ID host=$FAILED_HOST:$FAILED_PORT new_primary=$NEW_PRIMARY_HOST:$NEW_PRIMARY_PORT" >> "$LOG"
+
+# 1️⃣ If the failed node was the old primary
+if [ "$FAILED_NODE_ID" != "$NEW_PRIMARY_NODE_ID" ]; then
+    echo "$(date '+%F %T') Promoting new primary at $NEW_PRIMARY_HOST:$NEW_PRIMARY_PORT..." >> "$LOG"
+    sudo -u postgres "$PG_BIN/pg_ctl" promote -D "$PGDATA" >> "$LOG" 2>&1
+    echo "$(date '+%F %T') Promotion completed for node_id=$NEW_PRIMARY_NODE_ID" >> "$LOG"
+else
+    echo "$(date '+%F %T') Standby node $FAILED_HOST:$FAILED_PORT failed. No promotion required." >> "$LOG"
 fi
 
-echo failover.sh: end: new_main_node_id=$NEW_MAIN_NODE_ID on ${NEW_MAIN_NODE_HOST} was successfully promoted to primary
 exit 0
diff --git a/src/sample/scripts/follow_primary.sh.sample b/src/sample/scripts/follow_primary.sh.sample
index eb83b7123..32d4ef1ea 100755
--- a/src/sample/scripts/follow_primary.sh.sample
+++ b/src/sample/scripts/follow_primary.sh.sample
@@ -1,137 +1,36 @@
 #!/bin/bash
-# This script is run after failover_command to synchronize the Standby with the new Primary.
-
-set -o xtrace
-
-# Special values:
-# 1)  %d = node id
-# 2)  %h = hostname
-# 3)  %p = port number
-# 4)  %D = node database cluster path
-# 5)  %m = new primary node id
-# 6)  %H = new primary node hostname
-# 7)  %M = old main node id
-# 8)  %P = old primary node id
-# 9)  %r = new primary port number
-# 10) %R = new primary database cluster path
-# 11) %N = old primary node hostname
-# 12) %S = old primary node port number
-# 13) %% = '%' character
-
-NODE_ID="$1"
-NODE_HOST="$2"
-NODE_PORT="$3"
-NODE_PGDATA="$4"
-NEW_PRIMARY_NODE_ID="$5"
-NEW_PRIMARY_NODE_HOST="$6"
-OLD_MAIN_NODE_ID="$7"
-OLD_PRIMARY_NODE_ID="$8"
-NEW_PRIMARY_NODE_PORT="$9"
-NEW_PRIMARY_NODE_PGDATA="${10}"
-
-PGHOME=/usr/pgsql-17
-REPLUSER=repl
-PCP_USER=pgpool
-PGPOOL_PATH=/usr/bin
-PCP_PORT=9898
-REPL_SLOT_NAME=$(echo ${NODE_HOST,,} | tr -- -. _)
-POSTGRESQL_STARTUP_USER=postgres
-SSH_KEY_FILE=id_rsa_pgpool
-SSH_OPTIONS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ~/.ssh/${SSH_KEY_FILE}"
-
-echo follow_primary.sh: start: Standby node ${NODE_ID}
-
-# Check the connection status of Standby
-${PGHOME}/bin/pg_isready -h ${NODE_HOST} -p ${NODE_PORT} > /dev/null 2>&1
-
-if [ $? -ne 0 ]; then
-    echo follow_primary.sh: node_id=${NODE_ID} is not running. skipping follow primary command
-    exit 0
-fi
-
-# Test passwordless SSH
-ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NEW_PRIMARY_NODE_HOST} ls /tmp > /dev/null
-
-if [ $? -ne 0 ]; then
-    echo follow_main.sh: passwordless SSH to ${POSTGRESQL_STARTUP_USER}@${NEW_PRIMARY_NODE_HOST} failed. Please setup passwordless SSH.
-    exit 1
-fi
-
-# Get PostgreSQL major version
-PGVERSION=`${PGHOME}/bin/psql -V | awk '{print $3}' | sed 's/\..*//' | sed 's/\([0-9]*\)[a-zA-Z].*/\1/'`
-
-if [ $PGVERSION -ge 12 ]; then
-    RECOVERYCONF=${NODE_PGDATA}/myrecovery.conf
-else
-    RECOVERYCONF=${NODE_PGDATA}/recovery.conf
-fi
-
-# Synchronize Standby with the new Primary.
-echo follow_primary.sh: pg_rewind for node ${NODE_ID}
-
-# Run checkpoint command to update control file before running pg_rewind
-${PGHOME}/bin/psql -h ${NEW_PRIMARY_NODE_HOST} -p ${NEW_PRIMARY_NODE_PORT} postgres -c "checkpoint;"
-
-# Create replication slot "${REPL_SLOT_NAME}"
-${PGHOME}/bin/psql -h ${NEW_PRIMARY_NODE_HOST} -p ${NEW_PRIMARY_NODE_PORT} postgres \
-    -c "SELECT pg_create_physical_replication_slot('${REPL_SLOT_NAME}');"  >/dev/null 2>&1
-
-if [ $? -ne 0 ]; then
-    echo follow_primary.sh: create replication slot \"${REPL_SLOT_NAME}\" failed. You may need to create replication slot manually.
-fi
-
-ssh -T ${SSH_OPTIONS} ${POSTGRESQL_STARTUP_USER}@${NODE_HOST} "
-
-    set -o errexit
-
-    ${PGHOME}/bin/pg_ctl -w -m f -D ${NODE_PGDATA} stop
-
-    ${PGHOME}/bin/pg_rewind -D ${NODE_PGDATA} --source-server=\"user=${POSTGRESQL_STARTUP_USER} host=${NEW_PRIMARY_NODE_HOST} port=${NEW_PRIMARY_NODE_PORT} dbname=postgres\"
-
-    [ -d \"${NODE_PGDATA}\" ] && rm -rf ${NODE_PGDATA}/pg_replslot/*
-
-    cat > ${RECOVERYCONF} << EOT
-primary_conninfo = 'host=${NEW_PRIMARY_NODE_HOST} port=${NEW_PRIMARY_NODE_PORT} user=${REPLUSER} application_name=${NODE_HOST} passfile=''/var/lib/pgsql/.pgpass'''
-recovery_target_timeline = 'latest'
-primary_slot_name = '${REPL_SLOT_NAME}'
-EOT
-
-    if [ ${PGVERSION} -ge 12 ]; then
-        sed -i -e \"\\\$ainclude_if_exists = '$(echo ${RECOVERYCONF} | sed -e 's/\//\\\//g')'\" \
-               -e \"/^include_if_exists = '$(echo ${RECOVERYCONF} | sed -e 's/\//\\\//g')'/d\" ${NODE_PGDATA}/postgresql.conf
-        touch ${NODE_PGDATA}/standby.signal
-    else
-        echo \"standby_mode = 'on'\" >> ${RECOVERYCONF}
-    fi
-
-    ${PGHOME}/bin/pg_ctl -l /dev/null -w -D ${NODE_PGDATA} start
-
-"
-
-# If start Standby successfully, attach this node
-if [ $? -eq 0 ]; then
-
-    # Run pcp_attact_node to attach Standby node to Pgpool-II.
-    ${PGPOOL_PATH}/pcp_attach_node -w -h localhost -U $PCP_USER -p ${PCP_PORT} -n ${NODE_ID}
-
-    if [ $? -ne 0 ]; then
-        echo ERROR: follow_primary.sh: end: pcp_attach_node failed
-        exit 1
-    fi
-
+# follow_primary.sh
+# Called by Pgpool-II after a new primary is elected.
+# Args: %d (failed node id)
+#       %h (failed host)
+#       %p (failed port)
+#       %D (failed data directory)
+#       %m (new primary node id)
+#       %H (new primary host)
+#       %M (new primary port)
+
+FAILED_ID="$1"
+FAILED_HOST="$2"
+FAILED_PORT="$3"
+FAILED_DATA="$4"
+NEW_MAIN_ID="$5"
+NEW_MAIN_HOST="$6"
+NEW_MAIN_PORT="$7"
+
+LOG="/var/log/pgpool/follow_primary.log"
+mkdir -p /var/log/pgpool
+
+echo "$(date '+%F %T') follow_primary: failed=$FAILED_HOST:$FAILED_PORT new_primary=$NEW_MAIN_HOST:$NEW_MAIN_PORT" >> "$LOG"
+
+# Path to your resync script (this handles recloning/resyncing)
+RESYNC_SCRIPT="/etc/pgpool2/resync.sh"
+
+if [ -x "$RESYNC_SCRIPT" ]; then
+    echo "$(date '+%F %T') Starting resync of old primary ($FAILED_HOST:$FAILED_PORT) from new primary ($NEW_MAIN_HOST:$NEW_MAIN_PORT)..." >> "$LOG"
+    "$RESYNC_SCRIPT" "$FAILED_HOST" "$NEW_MAIN_HOST" "$FAILED_PORT" "$NEW_MAIN_PORT" "$FAILED_DATA" >> "$LOG" 2>&1 &
+    echo "$(date '+%F %T') Resync script triggered successfully." >> "$LOG"
 else
-
-    # If start Standby failed, drop replication slot "${REPL_SLOT_NAME}"
-    ${PGHOME}/bin/psql -h ${NEW_PRIMARY_NODE_HOST} -p ${NEW_PRIMARY_NODE_PORT} postgres \
-        -c "SELECT pg_drop_replication_slot('${REPL_SLOT_NAME}');"  >/dev/null 2>&1
-
-    if [ $? -ne 0 ]; then
-        echo ERROR: follow_primary.sh: drop replication slot \"${REPL_SLOT_NAME}\" failed. You may need to drop replication slot manually.
-    fi
-
-    echo ERROR: follow_primary.sh: end: follow primary command failed
-    exit 1
+    echo "$(date '+%F %T') ERROR: Resync script not found or not executable at $RESYNC_SCRIPT" >> "$LOG"
 fi
 
-echo follow_primary.sh: end: follow primary command is completed successfully
 exit 0
diff --git a/src/sample/scripts/recovery.sh.sample b/src/sample/scripts/recovery.sh.sample
new file mode 100755
index 000000000..18b2acdb9
--- /dev/null
+++ b/src/sample/scripts/recovery.sh.sample
@@ -0,0 +1,87 @@
+#!/bin/bash
+#=====================================================================
+# recovery.sh
+# Called automatically by Pgpool-II when a down node becomes available
+# or when online recovery is triggered.
+# This script identifies the current primary via PCP and resyncs
+# the target standby (old node) using resync.sh.
+#=====================================================================
+
+set -euo pipefail
+
+#change this as per ur required path
+NODE_ID="${1:-}"
+LOG=/var/log/pgpool/recovery.log
+mkdir -p /var/log/pgpool
+exec >>"$LOG" 2>&1
+
+echo "====================================================================="
+echo "$(date '+%F %T') [INFO] recovery.sh called with NODE_ID=$NODE_ID"
+
+# --- Mapping Node IDs to ports and data directories -----------------
+case "$NODE_ID" in
+  0)
+    TARGET_PGDATA="/var/lib/postgresql/18/main"
+    TARGET_PORT=5432
+    TARGET_HOST="localhost"
+    ;;
+  1)
+    TARGET_PGDATA="/var/lib/postgresql/18/standby1"
+    TARGET_PORT=5433
+    TARGET_HOST="localhost"
+    ;;
+  2)
+    TARGET_PGDATA="/var/lib/postgresql/18/standby2"
+    TARGET_PORT=5434
+    TARGET_HOST="localhost"
+    ;;
+  *)
+    echo "$(date '+%F %T') [ERROR] Unknown node id: $NODE_ID"
+    exit 1
+    ;;
+esac
+
+echo "$(date '+%F %T') [INFO] Target node: $TARGET_HOST:$TARGET_PORT ($TARGET_PGDATA)"
+
+# --- PCP settings ---------------------------------------------------
+PCP_PORT=9898
+PCP_USER=postgres
+
+# --- Detect current primary using pcp_node_info ---------------------
+echo "$(date '+%F %T') [INFO] Detecting current primary node..."
+#PRIMARY_INFO=$(sudo -u postgres pcp_node_info -h localhost -U "$PCP_USER" -p "$PCP_PORT" 2>/dev/null | grep "primary primary" | head -n 1)
+# Detect current primary node
+PRIMARY_INFO=$(sudo -u postgres pcp_node_info -h localhost -U "$PCP_USER" -p "$PCP_PORT" 2>/dev/null \
+  | awk '$7=="primary" {print $1 ":" $2; exit}')
+
+if [ -z "$PRIMARY_INFO" ]; then
+  echo "$(date '+%F %T') [ERROR] Could not determine primary via pcp_node_info"
+  exit 1
+fi
+
+# --- Split host:port correctly ---
+PRIMARY_HOST=${PRIMARY_INFO%%:*}   # everything before colon
+PRIMARY_PORT=${PRIMARY_INFO##*:}   # everything after colon
+
+echo "$(date '+%F %T') [INFO] Current primary detected: $PRIMARY_HOST:$PRIMARY_PORT"
+
+# --- Safety check: do not resync the current primary ---
+if [ "$PRIMARY_PORT" = "$TARGET_PORT" ]; then
+  echo "$(date '+%F %T') [WARN] Target node is already primary; skipping resync."
+  exit 0
+fi
+
+# --- Perform the resync ---
+echo "$(date '+%F %T') [INFO] Calling resync.sh for node $NODE_ID..."
+/etc/pgpool2/resync.sh "$TARGET_HOST" "$PRIMARY_HOST" "$TARGET_PORT" "$PRIMARY_PORT" "$TARGET_PGDATA" >> /var/log/pgpool/resync.log 2>&1
+
+RC=$?
+if [ $RC -eq 0 ]; then
+  echo "$(date '+%F %T') [INFO] Recovery completed successfully for node $NODE_ID."
+else
+  echo "$(date '+%F %T') [ERROR] Recovery failed for node $NODE_ID (exit=$RC). Check resync.log."
+fi
+
+
+echo "====================================================================="
+exit $RC
diff --git a/src/sample/scripts/resync.sh b/src/sample/scripts/resync.sh
new file mode 100755
index 000000000..8bd692a3a
--- /dev/null
+++ b/src/sample/scripts/resync.sh
@@ -0,0 +1,155 @@
+#!/bin/bash
+#this is for 3 node setup assuming 5432 as primary and 5433,5434 are standbys
+# Improved resync.sh
+# Args: $1 = old_host (node to resync), $2 = new_primary_host, $3 = old_port, $4 = new_port, $5 = old_data_dir
+set -euo pipefail
+
+OLD_HOST="${1:-}"
+NEW_HOST="${2:-}"
+OLD_PORT="${3:-}"
+NEW_PORT="${4:-}"
+OLD_PGDATA="${5:-}"
+
+# Adjust these if your binaries or primary data path differ
+PG_BIN="/usr/lib/postgresql/18/bin"
+PRIMARY_PGDATA="/var/lib/postgresql/18/main"
+
+PCP_PORT=9898
+PCP_USER=postgres
+LOG=/var/log/pgpool/resync.log
+
+mkdir -p /var/log/pgpool
+# Basic argument dump for debugging
+echo "$(date '+%F %T') ==== resync.sh start ====" >> "$LOG"
+echo "$(date '+%F %T') args: OLD_HOST=$OLD_HOST NEW_HOST=$NEW_HOST OLD_PORT=$OLD_PORT NEW_PORT=$NEW_PORT OLD_PGDATA=$OLD_PGDATA" >> "$LOG"
+echo "$(date '+%F %T') environment: PG_BIN=$PG_BIN PCP_PORT=$PCP_PORT PCP_USER=$PCP_USER" >> "$LOG"
+
+# Basic validation
+if [ -z "$OLD_HOST" ] || [ -z "$NEW_HOST" ] || [ -z "$OLD_PORT" ] || [ -z "$NEW_PORT" ] || [ -z "$OLD_PGDATA" ]; then
+  echo "$(date '+%F %T') ERROR: missing required arguments" >> "$LOG"
+  exit 1
+fi
+
+# Stop target Postgres (ignore if already stopped)
+echo "$(date '+%F %T') stopping target postgres (if running): $OLD_PGDATA" >> "$LOG"
+sudo -u postgres "$PG_BIN/pg_ctl" -D "$OLD_PGDATA" stop -m immediate >> "$LOG" 2>&1 || true
+sleep 2
+
+# Try to auto-determine NODE_ID_FOR_PCP from pcp_node_info (safer than static mapping)
+NODE_ID_FOR_PCP=""
+PCP_OUT=$(sudo -u postgres pcp_node_info -h localhost -p "$PCP_PORT" -U "$PCP_USER" 2>/dev/null || true)
+if [ -n "$PCP_OUT" ]; then
+  # iterate lines and find matching host:port; pcp_node_info lines: "<host> <port> ..."
+  idx=0
+  while read -r line; do
+    hs=$(echo "$line" | awk '{print $1}')
+    pt=$(echo "$line" | awk '{print $2}')
+    if [ "$hs" = "$OLD_HOST" ] && [ "$pt" = "$OLD_PORT" ]; then
+      NODE_ID_FOR_PCP="$idx"
+      break
+    fi
+    idx=$((idx+1))
+  done <<< "$PCP_OUT"
+fi
+
+# Fallback static mapping if we couldn't determine node id
+if [ -z "$NODE_ID_FOR_PCP" ]; then
+  echo "$(date '+%F %T') WARN: could not detect node id via pcp_node_info; using port->id fallback" >> "$LOG"
+  case "$OLD_PORT" in
+    5432) NODE_ID_FOR_PCP=0 ;;
+    5433) NODE_ID_FOR_PCP=1 ;;
+    5434) NODE_ID_FOR_PCP=2 ;;
+    *) echo "$(date '+%F %T') ERROR: unknown OLD_PORT=$OLD_PORT" >> "$LOG"; exit 1 ;;
+  esac
+fi
+echo "$(date '+%F %T') NODE_ID_FOR_PCP=$NODE_ID_FOR_PCP" >> "$LOG"
+
+# Ensure NEW_HOST/NEW_PORT present
+if [ -z "$NEW_HOST" ] || [ -z "$NEW_PORT" ]; then
+  echo "$(date '+%F %T') ERROR: NEW_HOST or NEW_PORT empty" >> "$LOG"
+  exit 1
+fi
+
+# Print pg_controldata system ids (best-effort)
+SYS_SRC=$(sudo -u postgres "$PG_BIN/pg_controldata" "$OLD_PGDATA" 2>/dev/null | awk -F: '/Database system identifier/ {print $2}' | tr -d ' ' || true)
+SYS_TGT=$(sudo -u postgres "$PG_BIN/pg_controldata" "$PRIMARY_PGDATA" 2>/dev/null | awk -F: '/Database system identifier/ {print $2}' | tr -d ' ' || true)
+echo "$(date '+%F %T') systemid old=$SYS_SRC current_primary=$SYS_TGT" >> "$LOG"
+
+# Try pg_rewind
+echo "$(date '+%F %T') Attempting pg_rewind: target=$OLD_PGDATA <- source=$NEW_HOST:$NEW_PORT" >> "$LOG"
+if sudo -u postgres "$PG_BIN/pg_rewind" --target-pgdata="$OLD_PGDATA" --source-server="host=$NEW_HOST port=$NEW_PORT user=postgres dbname=postgres" >> "$LOG" 2>&1; then
+  echo "$(date '+%F %T') pg_rewind succeeded" >> "$LOG"
+else
+  echo "$(date '+%F %T') pg_rewind failed; falling back to pg_basebackup" >> "$LOG"
+  # Remove contents but keep directory itself
+  LOG="/var/log/pgpool/resync.log"
+  sudo rm -rf "$OLD_PGDATA" >> "$LOG" 2>&1
+  sudo mkdir -p "$OLD_PGDATA" >> "$LOG" 2>&1
+  sudo chown -R postgres:postgres "$OLD_PGDATA" >> "$LOG" 2>&1
+  sudo chmod 700 "$OLD_PGDATA" >> "$LOG" 2>&1
+
+
+  # Run pg_basebackup (requires repl user and .pgpass)
+  echo "$(date '+%F %T') running pg_basebackup from $NEW_HOST:$NEW_PORT to $OLD_PGDATA" >> "$LOG"
+  if ! sudo -u postgres "$PG_BIN/pg_basebackup" -h "$NEW_HOST" -p "$NEW_PORT" -U repl -D "$OLD_PGDATA" -Fp -Xs -P -R >> "$LOG" 2>&1; then
+    echo "$(date '+%F %T') pg_basebackup failed; aborting resync" >> "$LOG"
+    exit 1
+  fi
+  echo "$(date '+%F %T') pg_basebackup succeeded" >> "$LOG"
+fi
+
+# Ensure primary_conninfo exists in postgresql.auto.conf (safe append)
+sudo -u postgres bash -c "
+  conf=\"$OLD_PGDATA/postgresql.auto.conf\"
+  if ! grep -q primary_conninfo \"\$conf\" 2>/dev/null; then
+    echo \"primary_conninfo = 'host=$NEW_HOST port=$NEW_PORT user=repl passfile=/var/lib/postgresql/.pgpass'\" >> \"\$conf\"
+  fi
+"
+
+# Create standby.signal for modern Postgres
+sudo -u postgres touch "$OLD_PGDATA/standby.signal"
+
+# Fix ownership/permissions
+sudo chown -R postgres:postgres "$OLD_PGDATA"
+sudo chmod 700 "$OLD_PGDATA" || true
+
+case "$OLD_PORT" in
+  5432) CONF="/etc/postgresql/18/main/postgresql.conf" ;;
+  5433) CONF="/etc/postgresql/18/standby1/postgresql.conf" ;;
+  5434) CONF="/etc/postgresql/18/standby2/postgresql.conf" ;;
+  *) CONF="/etc/postgresql/18/main/postgresql.conf" ;;
+esac
+
+
+echo "$(date '+%F %T') starting postgres at $OLD_PGDATA using config $CONF"
+sudo -u postgres "$PG_BIN/pg_ctl" -D "$OLD_PGDATA" -o "-c config_file=$CONF" -w start >>"$LOG" 2>&1 || echo "$(date '+%F %T') pg_ctl start returned non-zero (check postgres logs)"
+
+
+# Wait for socket with pg_isready (up to ~30s)
+attempt=0
+until sudo -u postgres "$PG_BIN/pg_isready" -q -p "$OLD_PORT" || [ $attempt -ge 20 ]; do
+  attempt=$((attempt+1))
+  sleep 2
+done
+if [ $attempt -ge 15 ]; then
+  echo "$(date '+%F %T') WARNING: server at $OLD_PGDATA didn't become ready within timeout" >> "$LOG"
+fi
+
+# Attach node to Pgpool via PCP with retries
+RETRIES=6
+i=0
+while [ $i -lt $RETRIES ]; do
+  if sudo -u postgres pcp_attach_node -h localhost -p "$PCP_PORT" -U "$PCP_USER" -n "$NODE_ID_FOR_PCP" >> "$LOG" 2>&1; then
+    echo "$(date '+%F %T') pcp_attach_node succeeded for node $NODE_ID_FOR_PCP" >> "$LOG"
+    break
+  fi
+  echo "$(date '+%F %T') pcp_attach_node attempt $((i+1)) failed; sleeping and retrying..." >> "$LOG"
+  i=$((i+1))
+  sleep 3
+done
+if [ $i -ge $RETRIES ]; then
+  echo "$(date '+%F %T') ERROR: pcp_attach_node failed after $RETRIES attempts" >> "$LOG"
+fi
+
+echo "$(date '+%F %T') ==== resync complete ====" >> "$LOG"
+exit 0
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2025-11-07 11:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-11-04 05:23 Autofail back inconsistent VASUKI M <[email protected]>
2025-11-04 11:40 ` Tatsuo Ishii <[email protected]>
2025-11-07 11:25   ` VASUKI M <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox