Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ss5Ss-00601D-By for pgsql-general@arkaria.postgresql.org; Sat, 21 Sep 2024 19:08:19 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ss5Sp-008aZu-Pb for pgsql-general@arkaria.postgresql.org; Sat, 21 Sep 2024 19:08:17 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ss5Sp-008aZm-AT for pgsql-general@lists.postgresql.org; Sat, 21 Sep 2024 19:08:16 +0000 Received: from mail-oo1-xc2f.google.com ([2607:f8b0:4864:20::c2f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1ss5So-000MmI-6k for pgsql-general@lists.postgresql.org; Sat, 21 Sep 2024 19:08:15 +0000 Received: by mail-oo1-xc2f.google.com with SMTP id 006d021491bc7-5d5eec95a74so1505685eaf.1 for ; Sat, 21 Sep 2024 12:08:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726945693; x=1727550493; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=IIgiKNAmn9oYr8iySMTipIW81SihY9FSutRcKPBoJ74=; b=jboCnwcTsKmf2yi3NJEJ+nzWG8RIGzQEA9GwTesszeYKY0n/tZaW3mNUfknypgeXMS 8NE1knYgrF3Fw9k+pZNGOt8hjQ8MUCJPiQwc8+dIaWc272JFR2MhYKwk7bjG9jjDPyxX LzhNgpOgA6xmn+oz1PEFTQZ62WbI7nj3pxfIOSNnkfPXf6BEw1f/7oYoJzg0rGdcMLqo sodEySf/V2QT5dx8KT4L8y1JtrX7VnrQjxLWRL13srv7lcNUDdn0qEmVzwaeGhIbBmXF A5vRlNcFpkzmMC0tGEGrouxd3TQK70ThRq8Zk1mxHrMGXZZ/QgnaIfK0wzXP38ebBTgX ryLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726945693; x=1727550493; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=IIgiKNAmn9oYr8iySMTipIW81SihY9FSutRcKPBoJ74=; b=HpBC2pN/TOV2lnWA4t618lLowEe13YhzLc7n+VRJBGSxkLSS5rJB10PoklsYa7EB1d MoO1it+d/ioDLIsr3Ed+iGQQbMT4iNVzu6Nwbt80SviUD2gc+kd4SSyaDI3AB5neebKj LCiqY4FctgHmqlqris1QnSwmbehinJfHbRaiw5igGtNhPuoa36CoH4QuautoSRe2LmWb wPpp32RnxmeRF8Sk29ZUX8dCIj4fKAXZin9Z5yyMqPhu0XUIy9Bcf/AOJoJs5F3O2uJu rXqQA/p2wRF7tkG32ibqkPHNfSa+neYKKaX/0dVjojEvzzE/SSNCHeIRroSHyO7QpR9h S0LA== X-Gm-Message-State: AOJu0YyQF0JBCXnLrLd+ZzaPXLv6EQtWbFlwN1a7LaL9JagYhhHWQZ/p EnQNFMrVbgc8sEmtBBXnxN2/hpoKX5NyTWbm2hA/Mxs9OG79Mp5pgEuzu84ltYHQ3Ajjv5Pc97x oUTK8HeKJ+VxwLwvix2TbxZz0TZnA9TXr X-Google-Smtp-Source: AGHT+IG2jsJ99wifOxP+u+5F4CkRFfkqLPS9KSdpX3zQ+4d/N3zh+kiEvOJVTjfvWRq8rEqowluJ5rXlZg0ZAyHqV+I= X-Received: by 2002:a05:6870:1494:b0:260:e611:c09 with SMTP id 586e51a60fabf-2803a8e8724mr3766587fac.47.1726945693423; Sat, 21 Sep 2024 12:08:13 -0700 (PDT) MIME-Version: 1.0 From: Ramakrishna m Date: Sun, 22 Sep 2024 00:38:02 +0530 Message-ID: Subject: Logical Replication Delay To: pgsql-general@lists.postgresql.org Cc: ravisql09@gmail.com Content-Type: multipart/alternative; boundary="0000000000009a61090622a5e0c3" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000009a61090622a5e0c3 Content-Type: text/plain; charset="UTF-8" Hi Team, We have configured bidirectional replication (but traffic can only flow in one direction) between two data centers (distance: 1000 km, maximum Network latency: 100 ms) with an application TPS (transactions per second) of 700 at maximum. We are fine with handling up to 500 TPS without observing any lag between the two data centers. However, when TPS increases, we notice a lag in WAL files of over 100 GB (initially, it was 1 TB, but after tuning, it was reduced to 100 GB). During peak times, WAL files are generated at a rate of 4 GB per minute. All transactions (Tx) take less than 200 ms, with a maximum of 1 second at times (no long-running transactions). *Here are the configured parameters and resources:* - *OS*: Ubuntu - *RAM*: 376 GB - *CPU*: 64 cores - *Swap*: 32 GB - *PostgreSQL Version*: 16.4 (each side has 3 nodes with Patroni and etcd configured) - *DB Size*: 15 TB *Parameters configured on both sides:* Name Setting Unit log_replication_commands off logical_decoding_work_mem 524288 kB max_logical_replication_workers 16 max_parallel_apply_workers_per_subscription 2 max_replication_slots 20 max_sync_workers_per_subscription 2 max_wal_senders 20 max_worker_processes 40 wal_level logical wal_receiver_timeout 600000 ms wal_segment_size 1073741824 B wal_sender_timeout 600000 ms *Optimizations applied:* 1. Vacuum freeze is managed during off-hours; no aggressive vacuum is triggered during business hours. 2. Converted a few tables to unlogged. 3. Removed unwanted tables from publication. 4. Partitioned all large tables. *Pending:* 1. Turning off/tuning autovacuum parameters to avoid triggering during business hours. *Not possible: *We are running all tables in a single publication, and it is not possible to separate them. I would greatly appreciate any suggestions you may have to help avoid logical replication delays, whether through tuning database or operating system parameters, or any other recommendations -- Thanks & Regards, Ram. --0000000000009a61090622a5e0c3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Team,

We have configured bidirectional replication (but traffic can only= flow in one direction) between two data centers (distance: 1000 km, maximu= m Network latency: 100 ms) with an application TPS (transactions per second= ) of 700 at maximum.

We are fine with hand= ling up to 500 TPS without observing any lag between the two data centers. = However, when TPS increases, we notice a lag in WAL files of over 100 GB (i= nitially, it was 1 TB, but after tuning, it was reduced to 100 GB). During = peak times, WAL files are generated at a rate of 4 GB per minute.

All transactions (Tx) take less than 200 ms, with a= maximum of 1 second at times (no long-running transactions).

= Here are the configured parameters and resource= s:

  • OS: Ubun= tu
  • RAM: 376 GB
  • CPU: 64 cores
  • Swap: 32 GB
  • PostgreSQL Version: 16.4 (each side has 3 nodes w= ith Patroni and etcd configured)
  • DB Size: 15 TB

Par= ameters configured on both sides:

= Name=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Setting=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0Unit
= 20<= td>


log_replication_commandsoff
logi= cal_decoding_work_mem524288kB
max_= logical_replication_workers16
max_parallel_apply_workers_per_= subscription=C2=A02
max_replication_slots20
max= _sync_workers_per_subscription2<= /td>
max_wal_senders
max_worker_processes40
wal_levellogical
wal= _receiver_timeout600000= ms
wal_segm= ent_size1073741824B
wal_sender_tim= eout600000ms

= Optimizations applied:

  1. Vacuu= m freeze is managed during off-hours; no aggressive vacuum is triggered dur= ing business hours.
  2. Converted a few tabl= es to unlogged.
  3. Removed unwanted tables = from publication.
  4. Partitioned all large = tables.

Pending:

  1. Turning off/tuning autovacuum parameter= s to avoid triggering during business hours.

Not possible:=C2=A0We are running all tables in a= single publication, and it is not possible to separate them.

= I would greatly appreciate any suggestions you may have to help avoid logic= al replication delays, whether through tuning database or operating system = parameters, or any other recommendations


--
Thanks & Regards,
Ram.
--0000000000009a61090622a5e0c3--