Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rUmGv-00Cy6K-M7 for pgsql-general@arkaria.postgresql.org; Tue, 30 Jan 2024 11:27:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rUmGu-004hLP-Rk for pgsql-general@arkaria.postgresql.org; Tue, 30 Jan 2024 11:27:20 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rUmGu-004hLH-Ei for pgsql-general@lists.postgresql.org; Tue, 30 Jan 2024 11:27:20 +0000 Received: from mail-ua1-x92a.google.com ([2607:f8b0:4864:20::92a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rUmGq-004VtE-Ne for pgsql-general@lists.postgresql.org; Tue, 30 Jan 2024 11:27:19 +0000 Received: by mail-ua1-x92a.google.com with SMTP id a1e0cc1a2514c-7d2a67daa25so1792797241.0 for ; Tue, 30 Jan 2024 03:27:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706614035; x=1707218835; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dbE1omNCMMue4G+VoGLtL0o0HSl5ALKq9LXdjC/f0IU=; b=IzkQxmzBJQk/39e91y0jaTnbUuCb91Lwg5bDb80xs6lXwRXzETcIgzXKCC4rk/E+qp 6nEQPPdEdi7x8SYlAPrttwtHVG+hTX+7QnUS2WtwXtWEAO0xE5t9vueopLTtSmUCUe9U Rv+ZeUGX1rZoF0qFjb1chB3R/UjSgfpzDDbXbPtzgZYRVzokQdp7S001SHlbHipnKYYz VtMndS1/JMvl1aTxbWzH72swcoy1ck+dAHi4Ra74Th1hHMzdzJLHtdHxKZLUtAMUclpm b+pJu/EzJ2jjozZ1SottPS4eGFO3rAfs48BJr/v6G5/ADJY7ozR9DZTifDnAvkeCpqLQ yEGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706614035; x=1707218835; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dbE1omNCMMue4G+VoGLtL0o0HSl5ALKq9LXdjC/f0IU=; b=hA5Bii8D2tj+/2PM+e4nZHfoZ66TXcy5zHWX2GrT67QGu48lqSwgJrB3H5i5mKFRqY JVkqoQM7ty+rNTlLD4kbjwF7SC9dppelcMpq4LebPgdTVQ5+fykyTL73AdUper/DjNU+ TI6KRhuYGI7TijXADMYZry0PbPwJ3W1gNqqI1movRK+pUBYMX+Q7KtjAW/UdpC9o9Y2q AOswl7jra6iC+O9jl5iRNzWCK64tcDZfG6GQXbPsQTwPLlz5giNAlSx0QlOgCrHkdVrh Oy5OTAFH7lw+eJTzFNCJKsUeDgrhJ26p9l7Bn6P/Rvj/lZRPRrypvzGPnNmgZl5ESq+k Xdiw== X-Gm-Message-State: AOJu0Yygzizennwrcl6QdQQx62LQVTp2tltABcoPa0/nL04aYEiNJ6CI zzwuv+w87VgoRuX2U+AWonxA6h0qe7+jj6Pvkx+S34wkjf9OGHVfPhJcl8fofwnC2oPaxpS9wnv W5NrckBz3d1+HOhUlVC/eL3vbEl8W6Hp4oac= X-Google-Smtp-Source: AGHT+IHBMfdk96a78tAoqQQDHXb3+UGW8la/fO6Mf2hpOWPtnFAPf4PtzKGvwdN19HnR10/QYZZs5HtNwj+33GT062A= X-Received: by 2002:a05:6122:288b:b0:4bd:73cb:e6d1 with SMTP id fl11-20020a056122288b00b004bd73cbe6d1mr3636979vkb.15.1706614034864; Tue, 30 Jan 2024 03:27:14 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Shaheed Haque Date: Tue, 30 Jan 2024 17:27:04 +0600 Message-ID: Subject: Re: Monitoring logical replication To: Klaus Darilion Cc: pgsql-general list Content-Type: multipart/alternative; boundary="00000000000051211e0610280b47" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000051211e0610280b47 Content-Type: text/plain; charset="UTF-8" This is great, thank you for posting. I'm currently a subcontinent or two away from my dev env, but will compare your approach with mine (you are using some facilities of psql I'm not familiar with). At least you have confirmed that LSNs are the place to start. Thanks again, Shaheed On Tue, 30 Jan 2024, 05:15 Klaus Darilion, wrote: > Hi Saheed! > > I monitor our replication this way: > > 1. Every 10 seconds i fetch the current LSN and write it into a table, > next with the current timestamp. Further I fetch confirmend LSNs from > the replication slots and delete old entries in lsn2data table. > > calculate_logical_replication_lag.php: > > > $path = realpath(dirname(__FILE__) . "/../inc"); > set_include_path($path . PATH_SEPARATOR . get_include_path()); > > require_once('config.php'); > $config_int['syslogprefix'] = basename(__FILE__); > require_once('logging.php'); > > $dbuser="replication_lag_user"; > $dbpass="XXXXXXXXXXXXXXXXXXXX"; > if (!$dbconn = pg_pconnect('host='.$config_int['dbhost'].' > dbname='.$config_int['dbname'].' user='.$dbuser.' password='.$dbpass)) { > print "Sorry, database connection failed"; > exit; > } > > $accuracy = 10; // in seconds > > // > // Preparations: > // > // CREATE TABLE lsn2data( > // lsn pg_lsn PRIMARY KEY, > // seen timestamp NOT NULL DEFAULT NOW() > // ); > // CREATE ROLE replication_lag_user WITH LOGIN PASSWORD > 'XXXXXXXXXXXXXXXXXXX'; > // GRANT ALL ON TABLE lsn2data TO replication_lag_user; > // > // CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE > (subscriber name, lag bigint) AS > // $BODY$ > // DECLARE > // subscriber name; > // BEGIN > // FOR subscriber IN > // SELECT slot_name FROM pg_replication_slots > // LOOP > // RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM > NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE > slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC > LIMIT 1; > // END LOOP; > // RETURN; > // END > // $BODY$ > // LANGUAGE plpgsql; > // > while (1) { > $dbq = pg_query("INSERT INTO lsn2data (lsn) VALUES > (pg_current_wal_lsn())"); > if ($dbq === FALSE) { > mylog(LOG_ERROR, "SQL query error: > ".pg_last_error()."\n"); > exit(1); > } > > $dbq = pg_query("DELETE FROM lsn2data WHERE lsn < (". > "SELECT lsn FROM lsn2data WHERE lsn < (". > "SELECT confirmed_flush_lsn FROM > pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1". > ") ORDER BY lsn DESC LIMIT 1". > ")" > ); > if ($dbq === FALSE) { > mylog(LOG_ERROR, "SQL query error: > ".pg_last_error()."\n"); > exit(1); > } > sleep($accuracy); > } > > 2. I graph the replications lags (converted from LSN to seconds) in my > check_mk monitoring: > > #!/bin/bash > > # > # Managed by Puppet: > modules/base/files/monitoring/check_logical_replication_lag.sh > # > # Check the logical replication lag and export performance data for each > subscriber > # > > # exit on error > #set -e > > #Make sure this script only runs one at a time > ( > > ME=$0 > MEBASE=`basename $0` > > mylog () { > echo "$MEBASE: $1" > logger -t "$MEBASE" "$1" > } > > flock -x -w 1 200 > if [ $? != "0" ]; then > #echo "ERROR: $0 is already running ... exit" > logger -t "$MEBASE" "ERROR: $0 is already running ... exit" > exit 1 > fi > > # Do stuff > > # Variablen fuer Monitoring > CMK_SPOOLDIR=/var/lib/check_mk_agent/spool > CMK_NAME=$MEBASE > CMK_SPOOLFILE=600_`basename ${CMK_NAME}`.txt > CMK_HEADER="<<>>" > TMP_FILE="/tmp/logical_replication_lag.csv" > > # Schwellwerte > warn=300 > crit=600 > > final_output="$CMK_HEADER\nP $CMK_NAME " > > # move to a directory where user postgresl may reside (sudo) > cd /tmp > > # Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt > es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid. > # Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause. > rm -f "$TMP_FILE" > sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM > get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC) > TO '$TMP_FILE' With CSV" 2>&1> /dev/null > LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM > get_replication_lag();" | tr -d ' ') > > if [ $LC == "0" ]; then > echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication > found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE > exit 0; > fi > > grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null > if [ $? != "0" ]; then > echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not > seem valid. Please check script $ME and output in $TMP_FILE" > > $CMK_SPOOLDIR/$CMK_SPOOLFILE > exit 1; > fi > > # CSV in Array einlesen > IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE" > > # Auswerten > maxlag=0 > for i in "${input_file[@]}"; do > node=`echo $i | awk -F "," '{print $1}' | tr -- _ -` > lag=`echo $i | awk -F "," '{print $2}'` > final_output="$final_output$node=$lag;$warn;$crit|" > # > > https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers > maxlag=$(( maxlag > lag ? maxlag : lag )) > done > final_output="${final_output}max-lag=$maxlag;$warn;$crit" > > # Letztes Pipe Zeichen rausschneiden > #final_output=`echo $final_output | rev | cut -c 2- | rev` > > # Spool File schreiben > echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE > logger -t "$MEBASE" "$final_output" > > > ) 200>/tmp/`basename $0`.exclusivelock > > > 3. During initial sync I check the status on the subscriber. Once it has > synced all tables of the publication, it will send me an email. > # > # Managed by Puppet: > modules/pdns/templates/check_pglogical_subscription.sh.erb > # > > # > # This script checks and eventually creates the subscription, and wait > until the initial sync is finished > # > > PUB=regdns2020_pub > SLEEP=5 > PREFIX=check_pglogical_subscription.sh > NUMTAB=175 > > SECONDS=0 > date > while true; do > echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t > regdns | grep -q $PUB > if [ $? -eq 0 ]; then > echo "OK: Host is subscribed to '$PUB'. Checking for > table count ..." > break > fi > echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to > master ..." > logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'. > Subscribing to master ..." > echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION > 'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION > regdns2020_pub;" | sudo -u postgres psql regdns && touch > /etc/regdns.schema_subscription.created > echo "Re-Checking in $SLEEP seconds ..." > logger -t $PREFIX "Re-Checking in $SLEEP seconds ..." > sleep $SLEEP > done > > while true; do > COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo > -u postgres psql -t regdns | head -1 | xargs) > if [ $COUNT -eq $NUMTAB ]; then > echo "OK: Subscription '$PUB' contains $NUMTAB tables - > that is OK. Checking for initial-sync status ..." > logger -t $PREFIX "OK: Subscription '$PUB' contains > $NUMTAB tables - that is OK. Checking for initial-sync status ..." > break > fi > echo "ERROR: Subscription '$PUB' contains $COUNT tables, but > should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." > logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT > tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds > ..." > sleep $SLEEP > done > > while true; do > COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE > srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs) > if [ $COUNTFIN -eq $NUMTAB ]; then > echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables > finished in $SECONDS seconds." > logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB > tables finished in $SECONDS seconds." > echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables > finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to > subscribe" -- root > break > fi > echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables > finished. Re-Checking in $SLEEP seconds ..." > logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB > tables finished. Re-Checking in $SLEEP seconds ..." > sleep $SLEEP > done > > > > regards > Klaus > > > > > Am 2023-10-07 17:31, schrieb Shaheed Haque: > > Hi, > > > > I've been playing with logical replication (currently on PG14), > > specifically in an AWS RDS Postgres context, but NOT using AWS' own > > replication tooling. I'm generally familiar with the challenges of > > distributed systems (such causality, time synchronisation etc), but > > not especially familiar with PG. > > > > In looking at how to tell how a given subscriber has caught up with > > its publisher, there is plenty of advice around the Web, for example > > > https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn > . > > Like this example, much advice ends up talking about using separate > > queries on the publisher and the subscriber to compare LSNs. First, (I > > think) I understand the core difficulty that comparing LSNs is > > inherently racy, but given that, I'm a bit unclear as to why a single > > query on the publisher is not enough...IIUC: > > > > * Changes sent from the publisher to the subscriber are identified > by > > LSN. > > * The publisher knows it's own current latest LSN > > (pg_current_wal_lsn()), but this seems not to be exposed at the > > subscriber. > > * The subscriber knows what it has applied locally and even tells > the > > publisher (pg_stat_subscription.latest_end_lsn), but it does not seem > > to be exposed at the publisher. > > > > Have I missed something? Is there a way to track the LSN delta (given > > that this is known to be racy) just by querying one end? > > > > Second, how do folk "know" when replication is "done". For example, if > > the two LSNs continued to match for 1 * replication lag? Or N * > > replication lag? What would be a plausible N? > > > > Third, as we know when logical replication is started, the initial > > table state is captured in a snapshot, and sent across using COPY > > TABLE under the covers. Now, let's say that the publisher is idle > > (i.e. no SQL writes to the user's schema...obviously pg_catalog might > > change as replication is configured and enabled) and that the > > replication starts with the publisher as LSN_start. How could one know > > when the copying is done: > > > > * I initially assumed that the publisher's LSN would not change > from > > LSN_start, but as the copying proceeds, I see that it DOES change > > (presumably because there are updates happening to pg_catalog, such as > > the temporary slots coming and going). > > * Is there some kind of singleton state on either publisher or > > subscriber that could be checked to know? (At the moment, I am > > counting the records in all copied tables). > > > > I realise that the knowledge that the publisher is "idle" is a special > > case, but right now, my test for being "done" is: > > > > * Number of records in copied tables matches AND the publisher's > > pg_stat_subscription matches the subscriber's > > pg_stat_subscription.latest_end_lsn. > > > > Plus or minus the bit about replication lag, is there a better way? > > > > Thanks, Shaheed > --00000000000051211e0610280b47 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
This is great, thank you for posting. I'm curren= tly a subcontinent or two away from my dev env, but will compare your appro= ach with mine (you are using some facilities of psql I'm not familiar w= ith). At least you have confirmed that LSNs are the place to start.=C2=A0

Thanks again, Shaheed


On Tue, 30 Jan 2024, 05:15 Klaus Darilion, &= lt;klaus.mailinglists@perna= u.at> wrote:
Hi Saheed!

I monitor our replication this way:

1. Every 10 seconds i fetch the current LSN and write it into a table,
next with the current timestamp. Further I fetch confirmend LSNs from
the replication slots and delete old entries in lsn2data table.

calculate_logical_replication_lag.php:

<?php

$path =3D realpath(dirname(__FILE__) . "/../inc");
set_include_path($path . PATH_SEPARATOR . get_include_path());

require_once('config.php');
$config_int['syslogprefix'] =3D basename(__FILE__);
require_once('logging.php');

$dbuser=3D"replication_lag_user";
$dbpass=3D"XXXXXXXXXXXXXXXXXXXX";
if (!$dbconn =3D pg_pconnect('host=3D'.$config_int['dbhost'= ].'
dbname=3D'.$config_int['dbname'].' user=3D'.$dbuser.= 9; password=3D'.$dbpass)) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0print "Sorry, database connection fa= iled";
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit;
}

$accuracy =3D 10; // in seconds

//
// Preparations:
//
// CREATE TABLE lsn2data(
//=C2=A0 =C2=A0 lsn pg_lsn PRIMARY KEY,
//=C2=A0 =C2=A0 seen timestamp NOT NULL DEFAULT NOW()
// );
// CREATE ROLE replication_lag_user WITH LOGIN PASSWORD
'XXXXXXXXXXXXXXXXXXX';
// GRANT ALL ON TABLE lsn2data TO replication_lag_user;
//
// CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE
(subscriber name, lag bigint) AS
// $BODY$
// DECLARE
//=C2=A0 =C2=A0 =C2=A0subscriber name;
// BEGIN
//=C2=A0 =C2=A0 =C2=A0FOR subscriber IN
//=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0SELECT slot_name FROM pg_replication_sl= ots
//=C2=A0 =C2=A0 =C2=A0LOOP
//=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0RETURN QUERY SELECT slot_name, EXTRACT(= EPOCH FROM
NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE
slot_name=3Dsubscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC <= br> LIMIT 1;
//=C2=A0 =C2=A0 =C2=A0END LOOP;
//=C2=A0 =C2=A0 =C2=A0RETURN;
// END
// $BODY$
// LANGUAGE plpgsql;
//
while (1) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0$dbq =3D pg_query("INSERT INTO lsn2d= ata (lsn) VALUES
(pg_current_wal_lsn())");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ($dbq =3D=3D=3D FALSE) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mylog(LOG_ERR= OR, "SQL query error:
".pg_last_error()."\n");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit(1);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0$dbq =3D pg_query("DELETE FROM lsn2d= ata WHERE lsn < (".
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0"SELECT lsn FROM lsn2data WHERE lsn < (&quo= t;.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"SELECT confirmed_flush_lsn FROM
pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1".
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0") ORDER BY lsn DESC LIMIT 1".
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0")"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ($dbq =3D=3D=3D FALSE) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mylog(LOG_ERR= OR, "SQL query error:
".pg_last_error()."\n");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit(1);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep($accuracy);
}

2. I graph the replications lags (converted from LSN to seconds) in my
check_mk monitoring:

#!/bin/bash

#
# Managed by Puppet:
modules/base/files/monitoring/check_logical_replication_lag.sh
#
# Check the logical replication lag and export performance data for each subscriber
#

# exit on error
#set -e

#Make sure this script only runs one at a time
(

=C2=A0 =C2=A0ME=3D$0
=C2=A0 =C2=A0MEBASE=3D`basename $0`

=C2=A0 =C2=A0mylog () {
=C2=A0 =C2=A0 =C2=A0echo "$MEBASE: $1"
=C2=A0 =C2=A0 =C2=A0logger -t "$MEBASE" "$1"
=C2=A0 =C2=A0}

=C2=A0 =C2=A0flock -x -w 1 200
=C2=A0 =C2=A0if [ $? !=3D "0" ]; then
=C2=A0 =C2=A0 =C2=A0#echo "ERROR: $0 is already running ... exit"=
=C2=A0 =C2=A0 =C2=A0logger -t "$MEBASE" "ERROR: $0 is alread= y running ... exit"
=C2=A0 =C2=A0 =C2=A0exit 1
=C2=A0 =C2=A0fi

=C2=A0 =C2=A0# Do stuff

# Variablen fuer Monitoring
CMK_SPOOLDIR=3D/var/lib/check_mk_agent/spool
CMK_NAME=3D$MEBASE
CMK_SPOOLFILE=3D600_`basename ${CMK_NAME}`.txt
CMK_HEADER=3D"<<<local>>>"
TMP_FILE=3D"/tmp/logical_replication_lag.csv"

# Schwellwerte
warn=3D300
crit=3D600

final_output=3D"$CMK_HEADER\nP $CMK_NAME "

# move to a directory where user postgresl may reside (sudo)
cd /tmp

# Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid.
# Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause.
rm -f "$TMP_FILE"
sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM
get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DE= SC)
TO '$TMP_FILE' With CSV" 2>&1> /dev/null
LC=3D$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM
get_replication_lag();" | tr -d ' ')

if [ $LC =3D=3D "0" ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo -e "$CMK_HEADER\n0 $CMK_NAME - = No Slaves with Replication
found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit 0;
fi

grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/n= ull
if [ $? !=3D "0" ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo -e "$CMK_HEADER\n2 $CMK_NAME - = Postgres Output does not
seem valid. Please check script $ME and output in $TMP_FILE" >
$CMK_SPOOLDIR/$CMK_SPOOLFILE
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit 1;
fi

# CSV in Array einlesen
IFS=3D$'\n' read -d '' -r -a input_file < "$TMP_FIL= E"

# Auswerten
maxlag=3D0
for i in "${input_file[@]}"; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0node=3D`echo $i | awk -F=C2=A0 ",&qu= ot; '{print $1}' | tr -- _ -`
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0lag=3D`echo $i | awk -F=C2=A0 ",&quo= t; '{print $2}'`
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0final_output=3D"$final_output$node= =3D$lag;$warn;$crit|"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0#
https://unix.stackexchange.com/questions/186663= /is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0maxlag=3D$(( maxlag > lag ? maxlag : l= ag ))
done
final_output=3D"${final_output}max-lag=3D$maxlag;$warn;$crit"

# Letztes Pipe Zeichen rausschneiden
#final_output=3D`echo $final_output | rev | cut -c 2- | rev`

# Spool File schreiben
echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE
logger -t "$MEBASE" "$final_output"


) 200>/tmp/`basename $0`.exclusivelock


3. During initial sync I check the status on the subscriber. Once it has synced all tables of the publication, it will send me an email.
#
# Managed by Puppet:
modules/pdns/templates/check_pglogical_subscription.sh.erb
#

#
# This script checks and eventually creates the subscription, and wait
until the initial sync is finished
#

PUB=3Dregdns2020_pub
SLEEP=3D5
PREFIX=3Dcheck_pglogical_subscription.sh
NUMTAB=3D175

SECONDS=3D0
date
while true; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "SELECT * from pg_subscription;= " | sudo -u postgres psql -t
regdns | grep -q $PUB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if [ $? -eq 0 ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Host is subscribed to '$PUB'. Checking for
table count ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0fi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "ERROR: Host is not subscribed = to '$PUB'. Subscribing to
master ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX "ERROR: Host is no= t subscribed to '$PUB'.
Subscribing to master ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "CREATE SUBSCRIPTION `hostname = -s|tr -- - _` CONNECTION
'host=3DXXXXX dbname=3DXXXX user=3DXXXXX password=3DXXXXXX PUBLICATION =
regdns2020_pub;" | sudo -u postgres psql regdns && touch
/etc/regdns.schema_subscription.created
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "Re-Checking in $SLEEP seconds = ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX "Re-Checking in $S= LEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep $SLEEP
done

while true; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0COUNT=3D$(echo "SELECT count(*) from= pg_subscription_rel;" | sudo
-u postgres psql -t regdns | head -1 | xargs)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if [ $COUNT -eq $NUMTAB ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Subscription '$PUB' contains $NUMTAB tables -
that is OK. Checking for initial-sync status ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PR= EFIX "OK: Subscription '$PUB' contains
$NUMTAB tables - that is OK. Checking for initial-sync status ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0fi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "ERROR: Subscription '$PUB&= #39; contains $COUNT tables, but
should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX=C2=A0 "ERROR: Subs= cription '$PUB' contains $COUNT
tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep $SLEEP
done

while true; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0COUNTFIN=3D$(echo "SELECT count(*) f= rom pg_subscription_rel WHERE
srsubstate=3D'r';" | sudo -u postgres psql -t regdns | head -1= | xargs)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if [ $COUNTFIN -eq $NUMTAB ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Initial sync of $COUNTFIN/$NUMTAB tables
finished in $SECONDS seconds."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PR= EFIX "OK: Initial sync of $COUNTFIN/$NUMTAB
tables finished in $SECONDS seconds."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Initial sync of $COUNTFIN/$NUMTAB tables
finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds= to
subscribe" -- root
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0fi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "PROGRESS: Initial sync of $COU= NTFIN/$NUMTAB tables
finished. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX=C2=A0 "PROGRESS: I= nitial sync of $COUNTFIN/$NUMTAB
tables finished. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep $SLEEP
done



regards
Klaus




Am 2023-10-07 17:31, schrieb Shaheed Haque:
> Hi,
>
> I've been playing with logical replication (currently on PG14), > specifically in an AWS RDS Postgres context, but NOT using AWS' ow= n
> replication tooling. I'm generally familiar with the challenges of=
> distributed systems (such causality, time synchronisation etc), but > not especially familiar with PG.
>
> In looking at how to tell how a given subscriber has caught up with > its publisher, there is plenty of advice around the Web, for example > = https://dba.stackexchange.com/questions/314324/monitor-logical-replication-= using-lsn.
> Like this example, much advice ends up talking about using separate > queries on the publisher and the subscriber to compare LSNs. First, (I=
> think) I understand the core difficulty that comparing LSNs is
> inherently racy, but given that, I'm a bit unclear as to why a sin= gle
> query on the publisher is not enough...IIUC:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* Changes sent from the publisher to the sub= scriber are identified by
> LSN.
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* The publisher knows it's own current l= atest LSN
> (pg_current_wal_lsn()), but this seems not to be exposed at the
> subscriber.
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* The subscriber knows what it has applied l= ocally and even tells the
> publisher (pg_stat_subscription.latest_end_lsn), but it does not seem<= br> > to be exposed at the publisher.
>
> Have I missed something? Is there a way to track the LSN delta (given<= br> > that this is known to be racy) just by querying one end?
>
> Second, how do folk "know" when replication is "done&qu= ot;. For example, if
> the two LSNs continued to match for 1 * replication lag? Or N *
> replication lag? What would be a plausible N?
>
> Third, as we know when logical replication is started, the initial
> table state is captured in a snapshot, and sent across using COPY
> TABLE under the covers. Now, let's say that the publisher is idle<= br> > (i.e. no SQL writes to the user's schema...obviously pg_catalog mi= ght
> change as replication is configured and enabled) and that the
> replication starts with the publisher as LSN_start. How could one know=
> when the copying is done:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* I initially assumed that the publisher'= ;s LSN would not change from
> LSN_start, but as the copying proceeds, I see that it DOES change
> (presumably because there are updates happening to pg_catalog, such as=
> the temporary slots coming and going).
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* Is there some kind of singleton state on e= ither publisher or
> subscriber that could be checked to know? (At the moment, I am
> counting the records in all copied tables).
>
> I realise that the knowledge that the publisher is "idle" is= a special
> case, but right now, my test for being "done" is:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* Number of records in copied tables matches= AND the publisher's
> pg_stat_subscription matches the subscriber's
> pg_stat_subscription.latest_end_lsn.
>
> Plus or minus the bit about replication lag, is there a better way? >
> Thanks, Shaheed
--00000000000051211e0610280b47--