Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sJUkf-005Cjb-LD for pgsql-general@arkaria.postgresql.org; Tue, 18 Jun 2024 09:03:41 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sJUkd-00AC2h-CD for pgsql-general@arkaria.postgresql.org; Tue, 18 Jun 2024 09:03:40 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sJUkc-00AC2Z-PF for pgsql-general@lists.postgresql.org; Tue, 18 Jun 2024 09:03:39 +0000 Received: from mail-vk1-xa33.google.com ([2607:f8b0:4864:20::a33]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sJUkV-001syt-TI for pgsql-general@lists.postgresql.org; Tue, 18 Jun 2024 09:03:38 +0000 Received: by mail-vk1-xa33.google.com with SMTP id 71dfb90a1353d-4ecf11aa0d6so1566237e0c.3 for ; Tue, 18 Jun 2024 02:03:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718701410; x=1719306210; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CjrMTyPAxAPX9GpNBTt062XTVsJ0ygOayV5NwAXoHXs=; b=LHfPzA+SdeNMQbSALKSRQh370dp5k3sRBk7VUMBx5DsGsRQa5N1+pO7vD8xyeSlM07 b8n81rUljMYOq1IU6g5qhKYbM66Vh3IlIFtMxPLKqKuLTDDRvqY7wz7kYKFOn7yo9Bq2 hfnZEVYuyxMg5//8D969UjecS2WhX7VjspT0LW0ZlI5CvrW81XuTNV68Lbvc/OS97hWW sA/2U6guY3auTC2kGWxcqMSCq/PWz5M3AVGoE3qXKlNdRhjPsGzTbL3NCQvlzLNc07WF c8EhXja1TmtdWJG+qCklHMGwT4n4aGvmAAARTtRFqGZdYz6L+dODTyiDVAle9aaimdRi 6t3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718701410; x=1719306210; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CjrMTyPAxAPX9GpNBTt062XTVsJ0ygOayV5NwAXoHXs=; b=gVLwFAGQj5l6uSg5ohD0ckO4QnYlOciSOXyNbbMHo6rJ7M/q4U1lILKNMXT4QmfuQL 1y0uzudEwb4q1PzMVNG/doVQRut4A0Q4Uf0IaurQs6kXBQZtl1TmKxMSD8itkl79cu8r iy5b3PL/6D5LhBOwqY2DPeM6X6dFwG+XNUBiXbg1un2+ayL0vJYMIppRKBkaLmWByCHs Sr9tYgNy4wTVk1K+3i/Y4ZPek5yjqYMpMiJAO0w/PnA/AAVND/wm1aMesV8/h6DOtEbu y1kmMmt02A6x6iiNkwDhLvFUR4H1lwQGZ1+t/8/RYr+SdvCLdelvQyKAKLuF7ecO5bDg hIbA== X-Gm-Message-State: AOJu0YxM3SiYK5puTYAqNSiTu3cth9L6p6WKgcDeQMNv7XkiFIzlGX0K rXtBrFfZ1uYORZVIveSBmBzMI2fKr4mnqVm3BZAcZKcSWw5dTPxwIwv8mbOWDCExPGg1IrxGE/t Ea3d/XqQZvkWcII5bjMpCym8KuylAidF0 X-Google-Smtp-Source: AGHT+IGGIiyEs3L5HeZw/ZaixunZXXVkUslDavbPAeruxyIjfT5Ko1o/KJp+dZNveRaT55i818liywmRghSpwUSDX6c= X-Received: by 2002:a05:6122:c9f:b0:4ec:fc20:a51a with SMTP id 71dfb90a1353d-4ee3e980ac8mr11124608e0c.9.1718701410212; Tue, 18 Jun 2024 02:03:30 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shaheed Haque Date: Tue, 18 Jun 2024 10:03:19 +0100 Message-ID: Subject: Re: Monitoring logical replication To: pgsql-general list Cc: Klaus Darilion Content-Type: multipart/alternative; boundary="00000000000007e4d3061b265b86" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000007e4d3061b265b86 Content-Type: text/plain; charset="UTF-8" Hi all, Is there an "official" pairing of LSN values on the publication and subscription sides that should be used to track the delta between the two systems? I ask because Google is full of different pairs being used. I tried to identify the highest level interface points exposed, i.e. what is documented on https://www.postgresql.org/docs/current/replication-origins.html, the pg_stat_subscription table, the pg_stat_publication table and the pg_current_wal_lsn() function on the publisher, but these seem to be barely used. Am I missing something? Thanks, Shaheed P.S. On a related note, I see a (stalled?) discussion on providing LSN -> timestamp conversion , I'd just like to say that something like that would be very useful. On Tue, 30 Jan 2024 at 11:27, Shaheed Haque wrote: > This is great, thank you for posting. I'm currently a subcontinent or two > away from my dev env, but will compare your approach with mine (you are > using some facilities of psql I'm not familiar with). At least you have > confirmed that LSNs are the place to start. > > Thanks again, Shaheed > > > On Tue, 30 Jan 2024, 05:15 Klaus Darilion, > wrote: > >> Hi Saheed! >> >> I monitor our replication this way: >> >> 1. Every 10 seconds i fetch the current LSN and write it into a table, >> next with the current timestamp. Further I fetch confirmend LSNs from >> the replication slots and delete old entries in lsn2data table. >> >> calculate_logical_replication_lag.php: >> >> > >> $path = realpath(dirname(__FILE__) . "/../inc"); >> set_include_path($path . PATH_SEPARATOR . get_include_path()); >> >> require_once('config.php'); >> $config_int['syslogprefix'] = basename(__FILE__); >> require_once('logging.php'); >> >> $dbuser="replication_lag_user"; >> $dbpass="XXXXXXXXXXXXXXXXXXXX"; >> if (!$dbconn = pg_pconnect('host='.$config_int['dbhost'].' >> dbname='.$config_int['dbname'].' user='.$dbuser.' password='.$dbpass)) { >> print "Sorry, database connection failed"; >> exit; >> } >> >> $accuracy = 10; // in seconds >> >> // >> // Preparations: >> // >> // CREATE TABLE lsn2data( >> // lsn pg_lsn PRIMARY KEY, >> // seen timestamp NOT NULL DEFAULT NOW() >> // ); >> // CREATE ROLE replication_lag_user WITH LOGIN PASSWORD >> 'XXXXXXXXXXXXXXXXXXX'; >> // GRANT ALL ON TABLE lsn2data TO replication_lag_user; >> // >> // CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE >> (subscriber name, lag bigint) AS >> // $BODY$ >> // DECLARE >> // subscriber name; >> // BEGIN >> // FOR subscriber IN >> // SELECT slot_name FROM pg_replication_slots >> // LOOP >> // RETURN QUERY SELECT slot_name, EXTRACT(EPOCH FROM >> NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE >> slot_name=subscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC >> LIMIT 1; >> // END LOOP; >> // RETURN; >> // END >> // $BODY$ >> // LANGUAGE plpgsql; >> // >> while (1) { >> $dbq = pg_query("INSERT INTO lsn2data (lsn) VALUES >> (pg_current_wal_lsn())"); >> if ($dbq === FALSE) { >> mylog(LOG_ERROR, "SQL query error: >> ".pg_last_error()."\n"); >> exit(1); >> } >> >> $dbq = pg_query("DELETE FROM lsn2data WHERE lsn < (". >> "SELECT lsn FROM lsn2data WHERE lsn < (". >> "SELECT confirmed_flush_lsn FROM >> pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1". >> ") ORDER BY lsn DESC LIMIT 1". >> ")" >> ); >> if ($dbq === FALSE) { >> mylog(LOG_ERROR, "SQL query error: >> ".pg_last_error()."\n"); >> exit(1); >> } >> sleep($accuracy); >> } >> >> 2. I graph the replications lags (converted from LSN to seconds) in my >> check_mk monitoring: >> >> #!/bin/bash >> >> # >> # Managed by Puppet: >> modules/base/files/monitoring/check_logical_replication_lag.sh >> # >> # Check the logical replication lag and export performance data for each >> subscriber >> # >> >> # exit on error >> #set -e >> >> #Make sure this script only runs one at a time >> ( >> >> ME=$0 >> MEBASE=`basename $0` >> >> mylog () { >> echo "$MEBASE: $1" >> logger -t "$MEBASE" "$1" >> } >> >> flock -x -w 1 200 >> if [ $? != "0" ]; then >> #echo "ERROR: $0 is already running ... exit" >> logger -t "$MEBASE" "ERROR: $0 is already running ... exit" >> exit 1 >> fi >> >> # Do stuff >> >> # Variablen fuer Monitoring >> CMK_SPOOLDIR=/var/lib/check_mk_agent/spool >> CMK_NAME=$MEBASE >> CMK_SPOOLFILE=600_`basename ${CMK_NAME}`.txt >> CMK_HEADER="<<>>" >> TMP_FILE="/tmp/logical_replication_lag.csv" >> >> # Schwellwerte >> warn=300 >> crit=600 >> >> final_output="$CMK_HEADER\nP $CMK_NAME " >> >> # move to a directory where user postgresl may reside (sudo) >> cd /tmp >> >> # Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt >> es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid. >> # Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause. >> rm -f "$TMP_FILE" >> sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM >> get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DESC) >> TO '$TMP_FILE' With CSV" 2>&1> /dev/null >> LC=$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM >> get_replication_lag();" | tr -d ' ') >> >> if [ $LC == "0" ]; then >> echo -e "$CMK_HEADER\n0 $CMK_NAME - No Slaves with Replication >> found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE >> exit 0; >> fi >> >> grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/null >> if [ $? != "0" ]; then >> echo -e "$CMK_HEADER\n2 $CMK_NAME - Postgres Output does not >> seem valid. Please check script $ME and output in $TMP_FILE" > >> $CMK_SPOOLDIR/$CMK_SPOOLFILE >> exit 1; >> fi >> >> # CSV in Array einlesen >> IFS=$'\n' read -d '' -r -a input_file < "$TMP_FILE" >> >> # Auswerten >> maxlag=0 >> for i in "${input_file[@]}"; do >> node=`echo $i | awk -F "," '{print $1}' | tr -- _ -` >> lag=`echo $i | awk -F "," '{print $2}'` >> final_output="$final_output$node=$lag;$warn;$crit|" >> # >> >> https://unix.stackexchange.com/questions/186663/is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers >> maxlag=$(( maxlag > lag ? maxlag : lag )) >> done >> final_output="${final_output}max-lag=$maxlag;$warn;$crit" >> >> # Letztes Pipe Zeichen rausschneiden >> #final_output=`echo $final_output | rev | cut -c 2- | rev` >> >> # Spool File schreiben >> echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE >> logger -t "$MEBASE" "$final_output" >> >> >> ) 200>/tmp/`basename $0`.exclusivelock >> >> >> 3. During initial sync I check the status on the subscriber. Once it has >> synced all tables of the publication, it will send me an email. >> # >> # Managed by Puppet: >> modules/pdns/templates/check_pglogical_subscription.sh.erb >> # >> >> # >> # This script checks and eventually creates the subscription, and wait >> until the initial sync is finished >> # >> >> PUB=regdns2020_pub >> SLEEP=5 >> PREFIX=check_pglogical_subscription.sh >> NUMTAB=175 >> >> SECONDS=0 >> date >> while true; do >> echo "SELECT * from pg_subscription;" | sudo -u postgres psql -t >> regdns | grep -q $PUB >> if [ $? -eq 0 ]; then >> echo "OK: Host is subscribed to '$PUB'. Checking for >> table count ..." >> break >> fi >> echo "ERROR: Host is not subscribed to '$PUB'. Subscribing to >> master ..." >> logger -t $PREFIX "ERROR: Host is not subscribed to '$PUB'. >> Subscribing to master ..." >> echo "CREATE SUBSCRIPTION `hostname -s|tr -- - _` CONNECTION >> 'host=XXXXX dbname=XXXX user=XXXXX password=XXXXXX PUBLICATION >> regdns2020_pub;" | sudo -u postgres psql regdns && touch >> /etc/regdns.schema_subscription.created >> echo "Re-Checking in $SLEEP seconds ..." >> logger -t $PREFIX "Re-Checking in $SLEEP seconds ..." >> sleep $SLEEP >> done >> >> while true; do >> COUNT=$(echo "SELECT count(*) from pg_subscription_rel;" | sudo >> -u postgres psql -t regdns | head -1 | xargs) >> if [ $COUNT -eq $NUMTAB ]; then >> echo "OK: Subscription '$PUB' contains $NUMTAB tables - >> that is OK. Checking for initial-sync status ..." >> logger -t $PREFIX "OK: Subscription '$PUB' contains >> $NUMTAB tables - that is OK. Checking for initial-sync status ..." >> break >> fi >> echo "ERROR: Subscription '$PUB' contains $COUNT tables, but >> should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..." >> logger -t $PREFIX "ERROR: Subscription '$PUB' contains $COUNT >> tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds >> ..." >> sleep $SLEEP >> done >> >> while true; do >> COUNTFIN=$(echo "SELECT count(*) from pg_subscription_rel WHERE >> srsubstate='r';" | sudo -u postgres psql -t regdns | head -1 | xargs) >> if [ $COUNTFIN -eq $NUMTAB ]; then >> echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables >> finished in $SECONDS seconds." >> logger -t $PREFIX "OK: Initial sync of $COUNTFIN/$NUMTAB >> tables finished in $SECONDS seconds." >> echo "OK: Initial sync of $COUNTFIN/$NUMTAB tables >> finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds to >> subscribe" -- root >> break >> fi >> echo "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB tables >> finished. Re-Checking in $SLEEP seconds ..." >> logger -t $PREFIX "PROGRESS: Initial sync of $COUNTFIN/$NUMTAB >> tables finished. Re-Checking in $SLEEP seconds ..." >> sleep $SLEEP >> done >> >> >> >> regards >> Klaus >> >> >> >> >> Am 2023-10-07 17:31, schrieb Shaheed Haque: >> > Hi, >> > >> > I've been playing with logical replication (currently on PG14), >> > specifically in an AWS RDS Postgres context, but NOT using AWS' own >> > replication tooling. I'm generally familiar with the challenges of >> > distributed systems (such causality, time synchronisation etc), but >> > not especially familiar with PG. >> > >> > In looking at how to tell how a given subscriber has caught up with >> > its publisher, there is plenty of advice around the Web, for example >> > >> https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn >> . >> > Like this example, much advice ends up talking about using separate >> > queries on the publisher and the subscriber to compare LSNs. First, (I >> > think) I understand the core difficulty that comparing LSNs is >> > inherently racy, but given that, I'm a bit unclear as to why a single >> > query on the publisher is not enough...IIUC: >> > >> > * Changes sent from the publisher to the subscriber are >> identified by >> > LSN. >> > * The publisher knows it's own current latest LSN >> > (pg_current_wal_lsn()), but this seems not to be exposed at the >> > subscriber. >> > * The subscriber knows what it has applied locally and even tells >> the >> > publisher (pg_stat_subscription.latest_end_lsn), but it does not seem >> > to be exposed at the publisher. >> > >> > Have I missed something? Is there a way to track the LSN delta (given >> > that this is known to be racy) just by querying one end? >> > >> > Second, how do folk "know" when replication is "done". For example, if >> > the two LSNs continued to match for 1 * replication lag? Or N * >> > replication lag? What would be a plausible N? >> > >> > Third, as we know when logical replication is started, the initial >> > table state is captured in a snapshot, and sent across using COPY >> > TABLE under the covers. Now, let's say that the publisher is idle >> > (i.e. no SQL writes to the user's schema...obviously pg_catalog might >> > change as replication is configured and enabled) and that the >> > replication starts with the publisher as LSN_start. How could one know >> > when the copying is done: >> > >> > * I initially assumed that the publisher's LSN would not change >> from >> > LSN_start, but as the copying proceeds, I see that it DOES change >> > (presumably because there are updates happening to pg_catalog, such as >> > the temporary slots coming and going). >> > * Is there some kind of singleton state on either publisher or >> > subscriber that could be checked to know? (At the moment, I am >> > counting the records in all copied tables). >> > >> > I realise that the knowledge that the publisher is "idle" is a special >> > case, but right now, my test for being "done" is: >> > >> > * Number of records in copied tables matches AND the publisher's >> > pg_stat_subscription matches the subscriber's >> > pg_stat_subscription.latest_end_lsn. >> > >> > Plus or minus the bit about replication lag, is there a better way? >> > >> > Thanks, Shaheed >> > --00000000000007e4d3061b265b86 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi all,

Is there an "of= ficial" pairing of LSN values on the publication and subscription side= s that should be used to track the delta between the two systems? I ask bec= ause Google is full of different pairs being used. I tried to identify the = highest level interface points exposed, i.e. what is documented on https= ://www.postgresql.org/docs/current/replication-origins.html,=C2=A0the p= g_stat_subscription table, the pg_stat_publication table and the=20 pg_current_wal_lsn() function on the publisher, but these seem to be barely= used.

Am I missing something?

Thanks, Shaheed

P.S. On a related note, I = see a (stalled?) discussion on providing LSN -> timestamp conve= rsion, I'd just like to say that something like that would be very = useful.

=C2=A0

On Tue, 30 Jan 2024 at 11:27, = Shaheed Haque <shaheedhaque@gm= ail.com> wrote:
This is great, thank you for posting. I'm= currently a subcontinent or two away from my dev env, but will compare you= r approach with mine (you are using some facilities of psql I'm not fam= iliar with). At least you have confirmed that LSNs are the place to start.= =C2=A0

Thanks again, Sha= heed


On Tue, 30 Jan 2024, 05:15 Klaus Dar= ilion, <klaus.mailinglists@pernau.at> wrote:
Hi Saheed!

I monitor our replication this way:

1. Every 10 seconds i fetch the current LSN and write it into a table,
next with the current timestamp. Further I fetch confirmend LSNs from
the replication slots and delete old entries in lsn2data table.

calculate_logical_replication_lag.php:

<?php

$path =3D realpath(dirname(__FILE__) . "/../inc");
set_include_path($path . PATH_SEPARATOR . get_include_path());

require_once('config.php');
$config_int['syslogprefix'] =3D basename(__FILE__);
require_once('logging.php');

$dbuser=3D"replication_lag_user";
$dbpass=3D"XXXXXXXXXXXXXXXXXXXX";
if (!$dbconn =3D pg_pconnect('host=3D'.$config_int['dbhost'= ].'
dbname=3D'.$config_int['dbname'].' user=3D'.$dbuser.= 9; password=3D'.$dbpass)) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0print "Sorry, database connection fa= iled";
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit;
}

$accuracy =3D 10; // in seconds

//
// Preparations:
//
// CREATE TABLE lsn2data(
//=C2=A0 =C2=A0 lsn pg_lsn PRIMARY KEY,
//=C2=A0 =C2=A0 seen timestamp NOT NULL DEFAULT NOW()
// );
// CREATE ROLE replication_lag_user WITH LOGIN PASSWORD
'XXXXXXXXXXXXXXXXXXX';
// GRANT ALL ON TABLE lsn2data TO replication_lag_user;
//
// CREATE OR REPLACE FUNCTION get_replication_lag() RETURNS TABLE
(subscriber name, lag bigint) AS
// $BODY$
// DECLARE
//=C2=A0 =C2=A0 =C2=A0subscriber name;
// BEGIN
//=C2=A0 =C2=A0 =C2=A0FOR subscriber IN
//=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0SELECT slot_name FROM pg_replication_sl= ots
//=C2=A0 =C2=A0 =C2=A0LOOP
//=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0RETURN QUERY SELECT slot_name, EXTRACT(= EPOCH FROM
NOW()-seen)::bigint lag from lsn2data,pg_replication_slots WHERE
slot_name=3Dsubscriber AND lsn < confirmed_flush_lsn ORDER BY lsn DESC <= br> LIMIT 1;
//=C2=A0 =C2=A0 =C2=A0END LOOP;
//=C2=A0 =C2=A0 =C2=A0RETURN;
// END
// $BODY$
// LANGUAGE plpgsql;
//
while (1) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0$dbq =3D pg_query("INSERT INTO lsn2d= ata (lsn) VALUES
(pg_current_wal_lsn())");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ($dbq =3D=3D=3D FALSE) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mylog(LOG_ERR= OR, "SQL query error:
".pg_last_error()."\n");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit(1);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0$dbq =3D pg_query("DELETE FROM lsn2d= ata WHERE lsn < (".
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0"SELECT lsn FROM lsn2data WHERE lsn < (&quo= t;.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0"SELECT confirmed_flush_lsn FROM
pg_replication_slots ORDER BY confirmed_flush_lsn ASC LIMIT 1".
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0") ORDER BY lsn DESC LIMIT 1".
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0")"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if ($dbq =3D=3D=3D FALSE) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mylog(LOG_ERR= OR, "SQL query error:
".pg_last_error()."\n");
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit(1);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep($accuracy);
}

2. I graph the replications lags (converted from LSN to seconds) in my
check_mk monitoring:

#!/bin/bash

#
# Managed by Puppet:
modules/base/files/monitoring/check_logical_replication_lag.sh
#
# Check the logical replication lag and export performance data for each subscriber
#

# exit on error
#set -e

#Make sure this script only runs one at a time
(

=C2=A0 =C2=A0ME=3D$0
=C2=A0 =C2=A0MEBASE=3D`basename $0`

=C2=A0 =C2=A0mylog () {
=C2=A0 =C2=A0 =C2=A0echo "$MEBASE: $1"
=C2=A0 =C2=A0 =C2=A0logger -t "$MEBASE" "$1"
=C2=A0 =C2=A0}

=C2=A0 =C2=A0flock -x -w 1 200
=C2=A0 =C2=A0if [ $? !=3D "0" ]; then
=C2=A0 =C2=A0 =C2=A0#echo "ERROR: $0 is already running ... exit"=
=C2=A0 =C2=A0 =C2=A0logger -t "$MEBASE" "ERROR: $0 is alread= y running ... exit"
=C2=A0 =C2=A0 =C2=A0exit 1
=C2=A0 =C2=A0fi

=C2=A0 =C2=A0# Do stuff

# Variablen fuer Monitoring
CMK_SPOOLDIR=3D/var/lib/check_mk_agent/spool
CMK_NAME=3D$MEBASE
CMK_SPOOLFILE=3D600_`basename ${CMK_NAME}`.txt
CMK_HEADER=3D"<<<local>>>"
TMP_FILE=3D"/tmp/logical_replication_lag.csv"

# Schwellwerte
warn=3D300
crit=3D600

final_output=3D"$CMK_HEADER\nP $CMK_NAME "

# move to a directory where user postgresl may reside (sudo)
cd /tmp

# Lag auslesen. Waehrend dem initialen aufsynchen eines Subscribers gibt es temporaere Subscriptions, mit dem Namen reg_xxx1-pid-sync-pid.
# Damit diese nicht getrackt werden gibt es die huebsche LIKE Clause.
rm -f "$TMP_FILE"
sudo -u postgres psql regdns -c "COPY (SELECT subscriber,lag FROM
get_replication_lag() WHERE subscriber LIKE '%\_____' ORDER BY 2 DE= SC)
TO '$TMP_FILE' With CSV" 2>&1> /dev/null
LC=3D$(sudo -u postgres psql -t regdns -c "SELECT count(*) FROM
get_replication_lag();" | tr -d ' ')

if [ $LC =3D=3D "0" ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo -e "$CMK_HEADER\n0 $CMK_NAME - = No Slaves with Replication
found - maybe we are a slave?" > $CMK_SPOOLDIR/$CMK_SPOOLFILE
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit 0;
fi

grep $(hostname | cut -d '-' -f2) "$TMP_FILE" > /dev/n= ull
if [ $? !=3D "0" ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo -e "$CMK_HEADER\n2 $CMK_NAME - = Postgres Output does not
seem valid. Please check script $ME and output in $TMP_FILE" >
$CMK_SPOOLDIR/$CMK_SPOOLFILE
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0exit 1;
fi

# CSV in Array einlesen
IFS=3D$'\n' read -d '' -r -a input_file < "$TMP_FIL= E"

# Auswerten
maxlag=3D0
for i in "${input_file[@]}"; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0node=3D`echo $i | awk -F=C2=A0 ",&qu= ot; '{print $1}' | tr -- _ -`
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0lag=3D`echo $i | awk -F=C2=A0 ",&quo= t; '{print $2}'`
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0final_output=3D"$final_output$node= =3D$lag;$warn;$crit|"
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0#
https://unix.stackexchange.com/questions/186663= /is-there-a-unix-command-that-gives-the-minimum-maximum-of-two-numbers<= br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0maxlag=3D$(( maxlag > lag ? maxlag : l= ag ))
done
final_output=3D"${final_output}max-lag=3D$maxlag;$warn;$crit"

# Letztes Pipe Zeichen rausschneiden
#final_output=3D`echo $final_output | rev | cut -c 2- | rev`

# Spool File schreiben
echo -e $final_output > $CMK_SPOOLDIR/$CMK_SPOOLFILE
logger -t "$MEBASE" "$final_output"


) 200>/tmp/`basename $0`.exclusivelock


3. During initial sync I check the status on the subscriber. Once it has synced all tables of the publication, it will send me an email.
#
# Managed by Puppet:
modules/pdns/templates/check_pglogical_subscription.sh.erb
#

#
# This script checks and eventually creates the subscription, and wait
until the initial sync is finished
#

PUB=3Dregdns2020_pub
SLEEP=3D5
PREFIX=3Dcheck_pglogical_subscription.sh
NUMTAB=3D175

SECONDS=3D0
date
while true; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "SELECT * from pg_subscription;= " | sudo -u postgres psql -t
regdns | grep -q $PUB
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if [ $? -eq 0 ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Host is subscribed to '$PUB'. Checking for
table count ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0fi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "ERROR: Host is not subscribed = to '$PUB'. Subscribing to
master ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX "ERROR: Host is no= t subscribed to '$PUB'.
Subscribing to master ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "CREATE SUBSCRIPTION `hostname = -s|tr -- - _` CONNECTION
'host=3DXXXXX dbname=3DXXXX user=3DXXXXX password=3DXXXXXX PUBLICATION =
regdns2020_pub;" | sudo -u postgres psql regdns && touch
/etc/regdns.schema_subscription.created
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "Re-Checking in $SLEEP seconds = ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX "Re-Checking in $S= LEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep $SLEEP
done

while true; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0COUNT=3D$(echo "SELECT count(*) from= pg_subscription_rel;" | sudo
-u postgres psql -t regdns | head -1 | xargs)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if [ $COUNT -eq $NUMTAB ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Subscription '$PUB' contains $NUMTAB tables -
that is OK. Checking for initial-sync status ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PR= EFIX "OK: Subscription '$PUB' contains
$NUMTAB tables - that is OK. Checking for initial-sync status ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0fi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "ERROR: Subscription '$PUB&= #39; contains $COUNT tables, but
should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX=C2=A0 "ERROR: Subs= cription '$PUB' contains $COUNT
tables, but should contain $NUMTAB table. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep $SLEEP
done

while true; do
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0COUNTFIN=3D$(echo "SELECT count(*) f= rom pg_subscription_rel WHERE
srsubstate=3D'r';" | sudo -u postgres psql -t regdns | head -1= | xargs)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if [ $COUNTFIN -eq $NUMTAB ]; then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Initial sync of $COUNTFIN/$NUMTAB tables
finished in $SECONDS seconds."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PR= EFIX "OK: Initial sync of $COUNTFIN/$NUMTAB
tables finished in $SECONDS seconds."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "OK= : Initial sync of $COUNTFIN/$NUMTAB tables
finished in $SECONDS seconds." | mailx -s "$HOST $SECONDS seconds= to
subscribe" -- root
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0break
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0fi
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0echo "PROGRESS: Initial sync of $COU= NTFIN/$NUMTAB tables
finished. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0logger -t $PREFIX=C2=A0 "PROGRESS: I= nitial sync of $COUNTFIN/$NUMTAB
tables finished. Re-Checking in $SLEEP seconds ..."
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sleep $SLEEP
done



regards
Klaus




Am 2023-10-07 17:31, schrieb Shaheed Haque:
> Hi,
>
> I've been playing with logical replication (currently on PG14), > specifically in an AWS RDS Postgres context, but NOT using AWS' ow= n
> replication tooling. I'm generally familiar with the challenges of=
> distributed systems (such causality, time synchronisation etc), but > not especially familiar with PG.
>
> In looking at how to tell how a given subscriber has caught up with > its publisher, there is plenty of advice around the Web, for example > = https://dba.stackexchange.com/questions/314324/monitor-logical-replication-= using-lsn.
> Like this example, much advice ends up talking about using separate > queries on the publisher and the subscriber to compare LSNs. First, (I=
> think) I understand the core difficulty that comparing LSNs is
> inherently racy, but given that, I'm a bit unclear as to why a sin= gle
> query on the publisher is not enough...IIUC:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* Changes sent from the publisher to the sub= scriber are identified by
> LSN.
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* The publisher knows it's own current l= atest LSN
> (pg_current_wal_lsn()), but this seems not to be exposed at the
> subscriber.
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* The subscriber knows what it has applied l= ocally and even tells the
> publisher (pg_stat_subscription.latest_end_lsn), but it does not seem<= br> > to be exposed at the publisher.
>
> Have I missed something? Is there a way to track the LSN delta (given<= br> > that this is known to be racy) just by querying one end?
>
> Second, how do folk "know" when replication is "done&qu= ot;. For example, if
> the two LSNs continued to match for 1 * replication lag? Or N *
> replication lag? What would be a plausible N?
>
> Third, as we know when logical replication is started, the initial
> table state is captured in a snapshot, and sent across using COPY
> TABLE under the covers. Now, let's say that the publisher is idle<= br> > (i.e. no SQL writes to the user's schema...obviously pg_catalog mi= ght
> change as replication is configured and enabled) and that the
> replication starts with the publisher as LSN_start. How could one know=
> when the copying is done:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* I initially assumed that the publisher'= ;s LSN would not change from
> LSN_start, but as the copying proceeds, I see that it DOES change
> (presumably because there are updates happening to pg_catalog, such as=
> the temporary slots coming and going).
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* Is there some kind of singleton state on e= ither publisher or
> subscriber that could be checked to know? (At the moment, I am
> counting the records in all copied tables).
>
> I realise that the knowledge that the publisher is "idle" is= a special
> case, but right now, my test for being "done" is:
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0* Number of records in copied tables matches= AND the publisher's
> pg_stat_subscription matches the subscriber's
> pg_stat_subscription.latest_end_lsn.
>
> Plus or minus the bit about replication lag, is there a better way? >
> Thanks, Shaheed
--00000000000007e4d3061b265b86--