Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rUaqQ-00BpgP-S0 for pgsql-general@arkaria.postgresql.org; Mon, 29 Jan 2024 23:15:15 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rUaqP-001AIG-MN for pgsql-general@arkaria.postgresql.org; Mon, 29 Jan 2024 23:15:13 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rUaqP-001AI6-75 for pgsql-general@lists.postgresql.org; Mon, 29 Jan 2024 23:15:13 +0000 Received: from mail.pernau.at ([2a01:4f8:151:4086::2]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rUaqL-0042wk-Dl for pgsql-general@lists.postgresql.org; Mon, 29 Jan 2024 23:15:11 +0000 Received: from pernau.at (localhost.localdomain [127.0.0.1]) by mail.pernau.at (Postfix) with ESMTPSA id 315952FA50B5; Tue, 30 Jan 2024 00:15:06 +0100 (CET) MIME-Version: 1.0 Date: Tue, 30 Jan 2024 00:15:06 +0100 From: Klaus Darilion To: Shaheed Haque Cc: pgsql-general list Subject: Re: Monitoring logical replication In-Reply-To: References: Message-ID: X-Sender: klaus.mailinglists@pernau.at Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi Saheed! I monitor our replication this way: 1. Every 10 seconds i fetch the current LSN and write it into a table, next with the current timestamp. Further I fetch confirmend LSNs from the replication slots and delete old entries in lsn2data table. calculate_logical_replication_lag.php: >>" TMP_FILE="/tmp/logical_replication_lag.csv" # Schwellwerte warn=300 crit=600 final_output="$CMK_HEADER\nP $CMK_NAME " # move to a directory where user postgresl may reside (sudo) cd /tmp # Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid. # Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause. rm -f "$TMP_FILE" sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC) TO '$TMP_FILE' With CSV" 2>&1> /dev/null LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM get_replication_lag();" | tr -d ' ') if [ $LC == "0" ]; then echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE exit 0; fi grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null if [ $? != "0" ]; then echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not seem valid. Please check script $ME and output in $TMP_FILE" > $CMK_SPOOLDIR/$CMK_SPOOLFILE exit 1; fi # CSV in Array einlesen IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE" # Auswerten maxlag=0 for i in "${input_file[@]}"; do node=`echo $i | awk -F "," '{print $1}' | tr -- _ -` lag=`echo $i | awk -F "," '{print $2}'` final_output="$final_output$node=$lag;$warn;$crit|" # https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers maxlag=$(( maxlag > lag ? maxlag : lag )) done final_output="${final_output}max-lag=$maxlag;$warn;$crit" # Letztes Pipe Zeichen rausschneiden #final_output=`echo $final_output | rev | cut -c 2- | rev` # Spool File schreiben echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE logger -t "$MEBASE" "$final_output" ) 200>/tmp/`basename $0`.exclusivelock 3. During initial sync I check the status on the subscriber. Once it has synced all tables of the publication, it will send me an email. # # Managed by Puppet: modules/pdns/templates/check_pglogical_subscription.sh.erb # # # This script checks and eventually creates the subscription, and wait until the initial sync is finished # PUB=regdns2020_pub SLEEP=5 PREFIX=check_pglogical_subscription.sh NUMTAB=175 SECONDS=0 date while true; do echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t regdns | grep -q $PUB if [ $? -eq 0 ]; then echo "OK: Host is subscribed to '$PUB'. Checking for table count ..." break fi echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to master ..." logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'. Subscribing to master ..." echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION 'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION regdns2020_pub;" | sudo -u postgres psql regdns && touch /etc/regdns.schema_subscription.created echo "Re-Checking in $SLEEP seconds ..." logger -t $PREFIX "Re-Checking in $SLEEP seconds ..." sleep $SLEEP done while true; do COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo -u postgres psql -t regdns | head -1 | xargs) if [ $COUNT -eq $NUMTAB ]; then echo "OK: Subscription '$PUB' contains $NUMTAB tables - that is OK. Checking for initial-sync status ..." logger -t $PREFIX "OK: Subscription '$PUB' contains $NUMTAB tables - that is OK. Checking for initial-sync status ..." break fi echo "ERROR: Subscription '$PUB' contains $COUNT tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." sleep $SLEEP done while true; do COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs) if [ $COUNTFIN -eq $NUMTAB ]; then echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables finished in $SECONDS seconds." logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB tables finished in $SECONDS seconds." echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to subscribe" -- root break fi echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables finished. Re-Checking in $SLEEP seconds ..." logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables finished. Re-Checking in $SLEEP seconds ..." sleep $SLEEP done regards Klaus Am 2023-10-07 17:31, schrieb Shaheed Haque: > Hi, > > I've been playing with logical replication (currently on PG14), > specifically in an AWS RDS Postgres context, but NOT using AWS' own > replication tooling. I'm generally familiar with the challenges of > distributed systems (such causality, time synchronisation etc), but > not especially familiar with PG. > > In looking at how to tell how a given subscriber has caught up with > its publisher, there is plenty of advice around the Web, for example > https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn. > Like this example, much advice ends up talking about using separate > queries on the publisher and the subscriber to compare LSNs. First, (I > think) I understand the core difficulty that comparing LSNs is > inherently racy, but given that, I'm a bit unclear as to why a single > query on the publisher is not enough...IIUC: > > * Changes sent from the publisher to the subscriber are identified by > LSN. > * The publisher knows it's own current latest LSN > (pg_current_wal_lsn()), but this seems not to be exposed at the > subscriber. > * The subscriber knows what it has applied locally and even tells the > publisher (pg_stat_subscription.latest_end_lsn), but it does not seem > to be exposed at the publisher. > > Have I missed something? Is there a way to track the LSN delta (given > that this is known to be racy) just by querying one end? > > Second, how do folk "know" when replication is "done". For example, if > the two LSNs continued to match for 1 * replication lag? Or N * > replication lag? What would be a plausible N? > > Third, as we know when logical replication is started, the initial > table state is captured in a snapshot, and sent across using COPY > TABLE under the covers. Now, let's say that the publisher is idle > (i.e. no SQL writes to the user's schema...obviously pg_catalog might > change as replication is configured and enabled) and that the > replication starts with the publisher as LSN_start. How could one know > when the copying is done: > > * I initially assumed that the publisher's LSN would not change from > LSN_start, but as the copying proceeds, I see that it DOES change > (presumably because there are updates happening to pg_catalog, such as > the temporary slots coming and going). > * Is there some kind of singleton state on either publisher or > subscriber that could be checked to know? (At the moment, I am > counting the records in all copied tables). > > I realise that the knowledge that the publisher is "idle" is a special > case, but right now, my test for being "done" is: > > * Number of records in copied tables matches AND the publisher's > pg_stat_subscription matches the subscriber's > pg_stat_subscription.latest_end_lsn. > > Plus or minus the bit about replication lag, is there a better way? > > Thanks, Shaheed