Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1slTAd-007y10-U1 for pgsql-general@arkaria.postgresql.org; Tue, 03 Sep 2024 13:02:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1slTAd-00BWLD-3p for pgsql-general@arkaria.postgresql.org; Tue, 03 Sep 2024 13:02:07 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1slTAc-00BWL4-Iw for pgsql-general@lists.postgresql.org; Tue, 03 Sep 2024 13:02:06 +0000 Received: from mail-qt1-x82b.google.com ([2607:f8b0:4864:20::82b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1slTAZ-000S4j-Tt for pgsql-general@lists.postgresql.org; Tue, 03 Sep 2024 13:02:05 +0000 Received: by mail-qt1-x82b.google.com with SMTP id d75a77b69052e-456930b923fso21084411cf.1 for ; Tue, 03 Sep 2024 06:02:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725368523; x=1725973323; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=t3TP/p5hJp+eiCoaocJpBS/t9WzqQV8dMohusKXz5aQ=; b=nW9Q/0aK//n2SM4hWfUOXGaUG6E9HAMuu+aFWVr6+b64xpjas7y0+4suo6riKoFLpa 0Mc+oKA27yOTREDvtzX05LBWkqA3u1k3DypSgy34vl2E+UAfCnTFJTVpeWf4fJjHIFIb hhTsSmgo/SGUfCDYdwTGAkXrA1w/pUVWb2jT8mcp7e5cegEsnJGaz3e3J6/VCgsBORxJ 3LITcUsGiGLLJGw8CCmv2lSFghWldFsunt7ngC0DJn8aQE2RNqsH5dSDB+OvJ7Ucn3yL 5kI4+FFQP70qNcBud3KBwssb1VOJj3XYcw22Pni+re8rBskUl7JeGjZViTpy4d3L8pgo P+ZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725368523; x=1725973323; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=t3TP/p5hJp+eiCoaocJpBS/t9WzqQV8dMohusKXz5aQ=; b=fwlG/E9rju8CI9O1OGeD7d0Uqr+kV7gKn4hpNGkSExjGYeOIcdptu88u5GD6XLzafB mdm4welTvgrg3t+RF1Imol4Tbsnali9HnofvUsOgcEY8f5Rn8ZqgXapFOnOiiMNbcOl2 coS2P9ldW40RQNkyCMk4ZJnq5zdPNlyIfug5EvEukCDnCVBK7QlrRsVLObiqOWIGypDC mAYdR848gkSZDTV64VFpyGeWc3qgkY1hsoi2qc+JlNYpSjEQaMTxmjlcd+5GTps1OEWR AuNsiCruCsCjpFc6P4nNEO75A4NJsM0ak7ac5gBlo6HXkUIH5KRiAmP9JCunnUBrh7en 5FLg== X-Gm-Message-State: AOJu0Yx9V4IWb9/f1UzMrJilXwKyjUeiFwVSm82KJ5+EM97nX27X+nSN xiAQQ0tjn6hM7CfR6yilhU7GGRnGI29U4GE8+eKfZsSu7GroXzfPm3XDjHRM6+TS3YDiQHhsLR7 aedzQ49Nf/zh4WsK+07I4V/hs5O0= X-Google-Smtp-Source: AGHT+IHgU2eGukZKNyNKrlCROXst8UXn9mk4Pt9rFQZzjqy7kWu+vG9011UdpFYdQQK22Lnq6vsHdC4NoDeF9Hz+5FM= X-Received: by 2002:a05:622a:90b:b0:456:8102:5d64 with SMTP id d75a77b69052e-456f1669e8fmr155713761cf.3.1725368522272; Tue, 03 Sep 2024 06:02:02 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shaheed Haque Date: Tue, 3 Sep 2024 14:01:50 +0100 Message-ID: Subject: Re: Monitoring logical replication To: Ron Johnson Cc: pgsql-general list Content-Type: multipart/alternative; boundary="000000000000e0776e062136a975" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000e0776e062136a975 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Mostly to close the loop on this, now that I have things going seemingly reliably... On Tue, 18 Jun 2024 at 14:33, Ron Johnson wrote: > On Tue, Jun 18, 2024 at 5:03=E2=80=AFAM Shaheed Haque > wrote: > >> Hi all, >> >> Is there an "official" pairing of LSN values on the publication and >> subscription sides that should be used to track the delta between the tw= o >> systems? I ask because Google is full of different pairs being used. I >> tried to identify the highest level interface points exposed, i.e. what = is >> documented on >> https://www.postgresql.org/docs/current/replication-origins.html, the >> pg_stat_subscription table, the pg_stat_publication table and the >> pg_current_wal_lsn() function on the publisher, but these seem to be bar= ely >> used. >> > > The attached scripts (whose guts I took from a Stack Exchange post) might > be a good starting point. It certainly works for physical replication! > > >> P.S. On a related note, I see a (stalled?) discussion on providing LSN >> -> timestamp conversion >> , >> I'd just like to say that something like that would be very useful. >> > > Out of curiosity, how does that work? Is an instance's initial LSN reall= y > based on Epoch? > According to the docs at https://www.postgresql.org/docs/current/datatype-pg-lsn.html, the LSN is "a 64-bit integer, representing a byte position in the write-ahead log stream", so I guess some black magic is required to turn that into a timestamp. My use case might not be all that common, as I am NOT interested in a long-term replica. What I am doing is making a copy of an "old" Django deployment to a "new" deployment such that when the two ends are in close sync, I can freeze traffic to the old deployment, pause for any final catchup, and then run a Django migration on the new, before switching on the new (thereby minimising the down time for the app). That being said: - It turns out that one cannot use the LSN alone to check for sync since, during the initial "full table copying" phase of the replication, the LSNs at the two ends seem to be the same. - Instead, I track three metrics on each end: - On the publication end: - "count(*)" for each table - pg_current_wal_lsn() - tuples_processed from pg_stat_progress_copy for each table - On the subscription end: - "count(*)" for each table - latest_end_lsn from pg_stat_subscription - tuples_processed from pg_stat_progress_copy for each table - The sync is considered complete when all three metrics are aligned. - I then freeze activity on the "old" deployment, wait for any in-flight metric changes to come through, and then run the migration on the "new" deployment. As always perhaps that will be of use to somebody, but YMMV. Thanks all, Shaheed --000000000000e0776e062136a975 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Mostly to close the loop on this, no= w that I have things going seemingly reliably...

=
On Tue, 18= Jun 2024 at 14:33, Ron Johnson <ronljohnsonjr@gmail.com> wrote:
On Tue, Jun 18= , 2024 at 5:03=E2=80=AFAM Shaheed Haque <shaheedhaque@gmail.com> wrote:
Hi all,

Is there an "= ;official" pairing of LSN values on the publication and subscription s= ides that should be used to track the delta between the two systems? I ask = because Google is full of different pairs being used. I tried to identify t= he highest level interface points exposed, i.e. what is documented on https://www.postgresql.org/docs/current/replication-origins.= html,=C2=A0the pg_stat_subscription table, the pg_stat_publication tabl= e and the=20 pg_current_wal_lsn() function on the publisher, but these seem to be barely= used.

The attached scripts (wh= ose guts I took from a Stack Exchange post) might be a good starting point.= =C2=A0 It certainly works for physical replication!
=C2=A0
<= blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-l= eft:1px solid rgb(204,204,204);padding-left:1ex">
P.S.= On a related note, I see a (stalled?) discussion on providing LSN -> timestamp conversion, I'd just like to say= that something like that would be very useful.
=C2=A0
Out of curiosity, how does that work?=C2=A0 Is an instan= ce's initial LSN really based on Epoch?
<= div>
According to the docs at https://www.postgresql.org/docs/c= urrent/datatype-pg-lsn.html, the LSN is "a 64-bit integer, represe= nting a byte position in the write-ahead log stream", so I guess some = black magic is required to turn that into a timestamp.

My= use case might not be all that common, as I am NOT interested in a long-te= rm replica. What I am doing is making a copy of an "old" Django d= eployment to a "new" deployment=20 such that when the two ends are in close sync, I can freeze traffic to=20 the old deployment, pause for any final catchup, and then run a Django=20 migration on the new, before switching on the new (thereby minimising=20 the down time for the app). That being said:
  • It turns out that o= ne cannot use the LSN alone to check for sync since, during the initial &qu= ot;full table copying" phase of the replication, the LSNs at the two e= nds seem to be the same.
  • Instead, I track three metrics on each end= :
    • On the publication end:
      • "count(*)" for = each table
      • pg_current_wal_lsn()
      • tuples_processed from pg_st= at_progress_copy for each table
    • On the subscription end:
    • <= ul>
    • "count(*)" for each table
    • latest_end_lsn from pg_s= tat_subscription
    • tuples_processed from pg_stat_progress_copy for ea= ch table
  • The sync is considered complete when all three m= etrics are aligned.
  • I then freeze activity on the "old" d= eployment, wait for any in-flight metric changes to come through, and then = run the migration on the "new" deployment.
  • As always= perhaps that will be of use to somebody, but YMMV.

    Thanks all,

    Shaheed
    --000000000000e0776e062136a975--