Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sl1he-004b4w-Sg for pgsql-general@arkaria.postgresql.org; Mon, 02 Sep 2024 07:42:23 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sl1hd-009h2E-F5 for pgsql-general@arkaria.postgresql.org; Mon, 02 Sep 2024 07:42:21 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sl1hc-009h26-So for pgsql-general@lists.postgresql.org; Mon, 02 Sep 2024 07:42:21 +0000 Received: from mail-vk1-xa2f.google.com ([2607:f8b0:4864:20::a2f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sl1hZ-000FpD-Jq for pgsql-general@postgresql.org; Mon, 02 Sep 2024 07:42:19 +0000 Received: by mail-vk1-xa2f.google.com with SMTP id 71dfb90a1353d-4fce6fd54ebso1775350e0c.1 for ; Mon, 02 Sep 2024 00:42:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725262937; x=1725867737; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VNnSmPM8lbaAz9kFAZiASYbK2k1jIyq8jQGc2GTvzQ8=; b=FpPg0DewkQ0KEvSHBDdDIKzc9uJq732oaF4cg9YjsTVcTKIBZQPJo3v/OQarMK6amm Lq+z3PgRb1AwMyGkiWlVnhFWq3lxhaobUoBtzgJQlJcBbzG3gw90Oth9U2hU5Hcg2uGI 5Pf7DqreHhUchaY8DuLECtAI9fxQCRpd04iOtIuMZACPL4YW54ShgpQ5VVCv3cj560Sr +uj0VJSyhz8Buf1sxpJ/YP+tCMxjKIAIxyaqQ5RT4/4x73f2GJ1HWwSvmxvm5B+DTpCr zm7mfLAumT3bNq72kx89YcBab05gQLpCqMmlsjbXG6p1xKJJaFQNxjFIZsIS5wsqljV6 5wrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725262937; x=1725867737; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VNnSmPM8lbaAz9kFAZiASYbK2k1jIyq8jQGc2GTvzQ8=; b=ENckFwd935sur9UJln/2c5q4PZZaZYoZH/ouKtg0t2qCdFuX89sqvKPgyzq/ufGexc hbEosIY+59Ye6FcMRk7luf9XChJtzF3X7V3lYgESd27nQvnlW6OaId7HKNFXQrlNHEqK Dxs/Xpu4MeiUqfhQgCw+I3KGVdaIT3A7LqWnqA//CdrF6TMbvKeA+56fAMA3fB+amAbw c7OSLGk4xFrfTlGEyq5UEEJmqwxxYExxFcAif/xWrbUWkBpCGWUIhZ5Ivl1twbQtuKdF +ShSTNlWftTCaQfNsX9E6YdlckJNkVOnov4O13cd8gxBOMjSO9QquxP+dlCGGihGm0vN EJAQ== X-Forwarded-Encrypted: i=1; AJvYcCWM9CYyJsf8pORuH26LsYCsjQJCPYmhXqjcxDTtFPwoxPsJoTtyh8aEeh1oKYBBIh7UaK/+eJKOZshdoXYb@postgresql.org X-Gm-Message-State: AOJu0Yy7XAgj7AWWxHiUTT576sdfkBvbkGykTw/4+tQD0bAor1Dtij7s UMvrRDjLwANmG94ttQvDV7lFuckGwHAP5Lxo618lW9tJK3I1F3IDRBiF6eAS5DdX++sOxb5X3GC qzBuz17Aqa3bfTiyJmau8UlVHFP4= X-Google-Smtp-Source: AGHT+IFguvfROm72yWAUW6+R3y08EcKp8V82m97LKgyx/cC89jCMnizLaTZnLqI6jpX37LMtx9Bs/pppmubzDZ9Ng38= X-Received: by 2002:a05:6122:169c:b0:4fd:1632:d341 with SMTP id 71dfb90a1353d-5009ac0002emr7693146e0c.3.1725262936474; Mon, 02 Sep 2024 00:42:16 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shaheed Haque Date: Mon, 2 Sep 2024 08:42:05 +0100 Message-ID: Subject: Re: Postgres Logical Replication - how to see what subscriber is doing with received data? To: Muhammad Ikram Cc: Michael Jaskiewicz , pgsql-general@postgresql.org Content-Type: multipart/alternative; boundary="000000000000790bf206211e14a2" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000790bf206211e14a2 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Muhammad, On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, wrote: > Hi Shaheed, > > Maybe these considerations could help you or give any hint to the problem= ? > > > Check if wal_receiver_timeout being set to 0 could potentially cause > issues, like not detecting network issues quickly enough. Consider > re-evaluating this setting if you see connection issues. > > If you notice that some data is missing on subscriber then could you > increase max_slot_wal_keep_size on publisher so that WALs are not deleted > until they are applied on subscriber. > > Do you have flexibility to increase max_worker_processes and > max_logical_replication_workers, work_mem and maintenance_work_mem on > subscriber (In case bottleneck exists on subscriber) > > If there's significant lag, consider whether it might be more efficient t= o > drop the subscription and re-initialize it from scratch using a new base > backup, depending on the data volume and how long it might take for the > existing replication to catch up. > Thanks for the kind hints, I'll certainly look into those. My main interest however was with the "visibility" question, i.e. to get an understanding of the gap between the two ends of a replication slot, ideally in human terms (e.g. tables x records). I understand the difficulties of trying to produce a meaningful metric that spans two (or more) systems but let's be honest, trying to diagnose which knobs to tweak (whether in application, PG, the OS or the network) is basically black magic when all we really have is a pair of opaque LSNs. > > Regards, > Muhammad Ikram > > > On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque > wrote: > >> Since nobody more knowledgeable has replied... >> >> I'm very interested in this area and still surprised that there is no >> official/convenient/standard way to approach this (see >> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WX= CN9KjZOvuTNJaAAC_hg%40mail.gmail.com >> ). >> >> Based partly on that thread, I ended up with a script that connects to >> both ends of the replication, and basically loops while comparing the >> counts in each table. >> >> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, >> wrote: >> >>> I've got two Postgres 13 databases on AWS RDS. >>> >>> - One is a master, the other a slave using logical replication. >>> - Replication has fallen behind by about 350Gb. >>> - The slave was maxed out in terms of CPU for the past four days >>> because of some jobs that were ongoing so I'm not sure what logical >>> replication was able to replicate during that time. >>> - I killed those jobs and now CPU on the master and slave are both >>> low. >>> - I look at the subscriber via `select * from pg_stat_subscription;` >>> and see that latest_end_lsn is advancing albeit very slowly. >>> - The publisher says write/flush/replay lags are all 13 minutes >>> behind but it's been like that for most of the day. >>> - I see no errors in the logs on either the publisher or subscriber >>> outside of some simple SQL errors that users have been making. >>> - CloudWatch reports low CPU utilization, low I/O, and low network. >>> >>> >>> >>> Is there anything I can do here? Previously I set wal_receiver_timeout >>> timeout to 0 because I had replication issues, and that helped things. = I >>> wish I had *some* visibility here to get any kind of confidence that >>> it's going to pull through, but other than these lsn values and databas= e >>> logs, I'm not sure what to check. >>> >>> >>> >>> Sincerely, >>> >>> mj >>> >> > > -- > Muhammad Ikram > > --000000000000790bf206211e14a2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Muhammad,=C2=A0

On Mon, 2 Sep 2024, 07:08 Muhammad I= kram, <mmikram@gmail.com> wr= ote:
Hi Shaheed,
Maybe these considerations could help you or give any hin= t to the problem ?


Check if wal_receiver_timeout being set to 0 = could potentially cause issues, like not detecting network issues quickly e= nough. Consider re-evaluating this setting if you see connection issues.
If you notice that some data is missing on subscriber then could you i= ncrease max_slot_wal_keep_size on publisher so that WALs are not deleted un= til they are applied on subscriber.

Do you have flexibility to incre= ase max_worker_processes and max_logical_replication_workers, work_mem and = maintenance_work_mem on subscriber (In case bottleneck exists on subscriber= )

If there's significant lag, consider whether it might be more = efficient to drop the subscription and re-initialize it from scratch using = a new base backup, depending on the data volume and how long it might take = for the existing replication to catch up.

Thanks for the kind hi= nts, I'll certainly look into those.=C2=A0

<= /div>
My main interest however was with the "visibili= ty" question, i.e. to get an understanding of the gap between the two = ends of a replication slot, ideally in human terms (e.g. tables x records).=

I understand the diffic= ulties of trying to produce a meaningful metric that spans two (or more) sy= stems but let's be honest, trying to diagnose which knobs to tweak (whe= ther in application, PG, the OS or the network) is basically black magic wh= en all we really have is a pair of opaque LSNs.=C2=A0





=C2=A0Regards,
Muhammad Ikram


On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque <s= haheedhaque@gmail.com> wrote:

Since nobody more knowledgeable has rep= lied...

I'm very interested in this area and still surprised tha= t there is no official/convenient/standard way to approach this (see https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXC= N9KjZOvuTNJaAAC_hg%40mail.gmail.com).

Based partly on that thread, I ended up with a script that c= onnects to both ends of the replication, and basically loops while comparin= g the counts in each table.


On Fri= , 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com> w= rote:

I've got two Postgres 13 databases on AWS RDS.

  • One is a master, the other a slave using logical re= plication.
  • Replication has fallen= behind by about 350Gb.
  • The slave= was maxed out in terms of CPU for the past four days because of some jobs = that were ongoing so I'm not sure what logical replication was able to = replicate during that time.
  • I kil= led those jobs and now CPU on the master and slave are both low.<= /u>
  • I look at the subscriber via `select * from= pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver= y slowly.
  • The publisher says writ= e/flush/replay lags are all 13 minutes behind but it's been like that f= or most of the day.
  • I see no erro= rs in the logs on either the publisher or subscriber outside of some simple= SQL errors that users have been making.
  • CloudWatch reports low CPU utilization, low I/O, and low network.

=C2=A0

Is there anything I can do here? Previously I set wa= l_receiver_timeout timeout to 0 because I had replication issues, and that = helped things. I wish I had some visibility here to get any kind of confidence that it's goi= ng to pull through, but other than these lsn values and database logs, I= 9;m not sure what to check.

=C2=A0

Sincerely,

mj



--
Muhammad Ikram

--000000000000790bf206211e14a2--