public inbox for [email protected]help / color / mirror / Atom feed
Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary 6+ messages / 3 participants [nested] [flat]
* Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary @ 2026-06-04 21:13 Jorge Daniel <[email protected]> 0 siblings, 2 replies; 6+ messages in thread From: Jorge Daniel @ 2026-06-04 21:13 UTC (permalink / raw) To: [email protected] <[email protected]> Good day to everyone We're asking the PG-comunity for some help if it is possible. We have a primary with 2 secondaries: the primary went down and one of the secondaries was promoted. The orphaned Secondary reconnected to the new primary and is replicating ok. We had to reconstruct a new secondary, we did it as we always do with the basic and dependable: pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=fast --create-slot --slot=us_vento_replica_slot_aux It ran for 5hrs without problem. When it finished: $ pg_ctl start The recovery was running until a consistent point was reached, the database opened and started streaming the rest from the Primary. After an hour or so, the recovery got stuck in a certain wal file. No more log entries about it (debug 1) and after some hours the stream connection got disconnected. The secondary is hung on that wal file and not going forward with the rest of the wal file list. We re-tried this several times, changing the storage (just in case), with a new box with the same original Ubuntu 22.04 instead of 24.04 (just in case), but the result was the same. Even though we have the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the same file (everytime we re-created the stuck-wal-file changed, clearly). We're out of ideas of what's happening. Could you please shed some light here? Primary: uspgvento14r Version server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) pg_stat_replication: pid | 4177197 usesysid | 16474 usename | replicator application_name | 14_replica client_addr | 192.168.11.33 client_hostname | client_port | 53534 backend_start | 2026-06-04 01:25:18.100137-07 backend_xmin | state | streaming sent_lsn | 8EBA/CDA68930 write_lsn | 8EBA/CDA68930 flush_lsn | 8EBA/CDA68930 replay_lsn | 8EB7/968FEBE8 write_lag | 00:00:00.001543 flush_lag | 00:00:00.00225 replay_lag | 01:00:22.530041 sync_priority | 0 sync_state | async reply_time | 2026-06-04 03:33:28.037703-07 Log Primary 2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG: connection received: host=192.168.12.34 port=53534 2026-06-04 01:25:18 PDT [2858] LOG: background worker "logical replication worker" (PID 4177182) exited with exit code 1 2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: connection authenticated: identity="replicator" method=md5 (/etc/postgresql/14/main/pg_hba.conf:95) 2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: replication connection authorized: user=replicator application_name=14_uspgvento14Rb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256) -- 2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG: disconnection: session time: 2:37:08.575 user=replicator database= host=192.168.12.34 port=53534 Secondary : 14_uspgvento14Rb 192.168.12.34 Version: server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1) Ubuntu 14.23-1.pgdg24.04+1 ..... 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002D" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002E" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002F" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA400000030" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.011 ms 2026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.493 ms ...... 2026-06-04 08:25:18 UTC [197740] LOG: restored log file "0000000200008EA400000044" from archive 2026-06-04 08:25:18 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:18 UTC [197740] DEBUG: end of backup reached 2026-06-04 08:25:18 UTC [197740] CONTEXT: WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578 2026-06-04 08:25:18 UTC [197740] LOG: consistent recovery state reached at 8EA4/44680C30 2026-06-04 08:25:18 UTC [197738] LOG: database system is ready to accept read-only connections cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No such file or directory 2026-06-04 08:25:18 UTC [207957] LOG: started streaming WAL from primary at 8EA4/45000000 on timeline 2 2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG: connection received: host=[local] 2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG: connection authorized: user=postgres database=postgres application_name=psql 2026-06-04 08:26:15 UTC [197747] LOG: restartpoint starting: time 2026-06-04 08:26:15 UTC [197747] DEBUG: performing replication slot checkpoint ...... 22026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.006 ms 2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.572 ms 2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=6 file=base/6176124/6602460.56 time=1.791 ms ..... 2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=782 file=base/6176124/6601067.15 time=0.003 ms 2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=783 file=base/6176124/6374431.10 time=0.006 ms 2026-06-04 09:35:15 UTC [197747] LOG: restartpoint complete: wrote 99500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=239.934 s, sync=0.039 s, total=239.989 s; sync files=783, longest=0.001 s, average=0.001 s; distance=250897 kB, estimate=13117629 kB 2026-06-04 09:35:15 UTC [197747] LOG: recovery restart point at 8EB7/7550F228 2026-06-04 09:35:15 UTC [197747] DETAIL: Last completed transaction was at log time 2026-06-04 02:33:05.505911-07. ^@^@^@^@^@^@^@^@——> Forever<——— postgres=# select pg_last_wal_replay_lsn(); pg_last_wal_replay_lsn ------------------------ 8EB7/968FEBE8 postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal postgres 207957 197738 6 08:25 ? 00:04:52 postgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8 postgres 211309 163479 0 09:36 pts/12 00:00:00 grep wal postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco postgres 197740 197738 16 08:11 ? 00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0000000200008EB700000096 postgres 211311 163479 0 09:36 pts/12 00:00:00 grep reco Content of the wal : postgres@uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |grep 8FEBE8 rmgr: MultiXact len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh) rmgr: Heap len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258 There are still plenty of files to process: postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000091 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000092 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000093 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000094 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000095 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000096 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000097 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000098 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000099 -rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009A -rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009B Postgresql.conf Secondary (primary is the same, but with more memory) All non-default values: shared_buffers = 4GB # min 128kB temp_buffers = 512MB # min 800kB work_mem = 512MB # min 64kB maintenance_work_mem = 4GB # min 1MB autovacuum_work_mem = 512MB # min 1MB, or -1 to use maintenance_work_mem max_stack_depth = 7MB bgwriter_lru_maxpages = 1000 # max buffers written/round, 0 disables wal_level = logical # minimal, replica, or logical wal_log_hints = on # also do full page writes of non-critical updates checkpoint_completion_target = 0.8 # checkpoint target duration, 0.0 - 1.0 checkpoint_warning = 600s # 0 disables max_wal_size = 500GB min_wal_size = 50GB max_wal_senders = 10 # max number of walsender processes max_replication_slots = 30 # max number of replication slots wal_keep_size = 150GB # in megabytes; 0 disables 30720=30GB 1.5 day max_slot_wal_keep_size = 500GB # in megabytes; -1 disables 307200=300GB 4 days wal_sender_timeout = 600s # in milliseconds; 0 disables max_standby_archive_delay = 48h # max delay before canceling queries max_standby_streaming_delay = 48h # max delay before canceling queries hot_standby_feedback = on # send info from standby to prevent wal_receiver_timeout = 600s # time that receiver waits for ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary @ 2026-06-04 23:51 hello from Burnside Project <[email protected]> parent: Jorge Daniel <[email protected]> 1 sibling, 0 replies; 6+ messages in thread From: hello from Burnside Project @ 2026-06-04 23:51 UTC (permalink / raw) To: Jorge Daniel <[email protected]>; +Cc: [email protected] <[email protected]> Replication status On the primary: SELECT * FROM pg_stat_replication; Shows: * Is the standby connected? * What LSN has been sent? * What LSN has been replayed? 2. WAL receiver status On the standby: SELECT * FROM pg_stat_wal_receiver; Shows: * Is WAL still being received? * From which server? * Latest WAL location received? 3. Recovery progress On the standby: SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(); This tells you whether: * WAL is arriving but not replaying. * WAL is replaying slowly. * Replay has completely stopped. 4. PostgreSQL logs Look for messages such as: invalid record length PANIC could not read WAL requested timeline waiting for WAL Best regards, Arif Rahman mailto:[email protected] https://burnsideproject.ai https://github.com/orgs/burnside-project Capture, transform, and learn from PostgreSQL Email Disclaimer: This email and any attachments may contain privileged and confidential information intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying, or use of this email or its contents is strictly prohibited. If you have received this email in error, please notify the sender immediately by replying to this message and delete it from your system. Please note that any views or opinions expressed in this email are solely those of the author and do not necessarily represent those of Burnside Project LLC. Although we have taken precautions to ensure this email is free of viruses or other malicious software, we cannot guarantee the security or integrity of email communications. Recipients should verify attachments for possible threats. Thank you. Burnside Project From: Jorge Daniel <[email protected]> To: "[email protected]"<[email protected]> Date: Thu, 04 Jun 2026 14:13:45 -0700 Subject: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary Good day to everyone We're asking the PG-comunity for some help if it is possible. We have a primary with 2 secondaries: the primary went down and one of the secondaries was promoted. The orphaned Secondary reconnected to the new primary and is replicating ok. We had to reconstruct a new secondary, we did it as we always do with the basic and dependable: pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=fast --create-slot --slot=us_vento_replica_slot_aux It ran for 5hrs without problem. When it finished: $ pg_ctl start The recovery was running until a consistent point was reached, the database opened and started streaming the rest from the Primary. After an hour or so, the recovery got stuck in a certain wal file. No more log entries about it (debug 1) and after some hours the stream connection got disconnected. The secondary is hung on that wal file and not going forward with the rest of the wal file list. We re-tried this several times, changing the storage (just in case), with a new box with the same original Ubuntu 22.04 instead of 24.04 (just in case), but the result was the same. Even though we have the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the same file (everytime we re-created the stuck-wal-file changed, clearly). We're out of ideas of what's happening. Could you please shed some light here? Primary: uspgvento14r Version server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) pg_stat_replication: pid | 4177197 usesysid | 16474 usename | replicator application_name | 14_replica client_addr | 192.168.11.33 client_hostname | client_port | 53534 backend_start | 2026-06-04 01:25:18.100137-07 backend_xmin | state | streaming sent_lsn | 8EBA/CDA68930 write_lsn | 8EBA/CDA68930 flush_lsn | 8EBA/CDA68930 replay_lsn | 8EB7/968FEBE8 write_lag | 00:00:00.001543 flush_lag | 00:00:00.00225 replay_lag | 01:00:22.530041 sync_priority | 0 sync_state | async reply_time | 2026-06-04 03:33:28.037703-07 Log Primary 2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG: connection received: host=192.168.12.34 port=53534 2026-06-04 01:25:18 PDT [2858] LOG: background worker "logical replication worker" (PID 4177182) exited with exit code 1 2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: connection authenticated: identity="replicator" method=md5 (/etc/postgresql/14/main/pg_hba.conf:95) 2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG: replication connection authorized: user=replicator application_name=14_uspgvento14Rb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256) -- 2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG: disconnection: session time: 2:37:08.575 user=replicator database= host=192.168.12.34 port=53534 Secondary : 14_uspgvento14Rb 192.168.12.34 Version: server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1) Ubuntu 14.23-1.pgdg24.04+1 ..... 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002D" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002E" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA40000002F" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:14 UTC [197740] LOG: restored log file "0000000200008EA400000030" from archive 2026-06-04 08:25:14 UTC [197740] DEBUG: got WAL segment from archive 026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.011 ms 2026-06-04 08:25:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.493 ms ...... 2026-06-04 08:25:18 UTC [197740] LOG: restored log file "0000000200008EA400000044" from archive 2026-06-04 08:25:18 UTC [197740] DEBUG: got WAL segment from archive 2026-06-04 08:25:18 UTC [197740] DEBUG: end of backup reached 2026-06-04 08:25:18 UTC [197740] CONTEXT: WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578 2026-06-04 08:25:18 UTC [197740] LOG: consistent recovery state reached at 8EA4/44680C30 2026-06-04 08:25:18 UTC [197738] LOG: database system is ready to accept read-only connections cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No such file or directory 2026-06-04 08:25:18 UTC [207957] LOG: started streaming WAL from primary at 8EA4/45000000 on timeline 2 2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG: connection received: host=[local] 2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG: connection authorized: user=postgres database=postgres application_name=psql 2026-06-04 08:26:15 UTC [197747] LOG: restartpoint starting: time 2026-06-04 08:26:15 UTC [197747] DEBUG: performing replication slot checkpoint ...... 22026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.006 ms 2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.572 ms 2026-06-04 08:30:15 UTC [197747] DEBUG: checkpoint sync: number=6 file=base/6176124/6602460.56 time=1.791 ms ..... 2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=782 file=base/6176124/6601067.15 time=0.003 ms 2026-06-04 09:35:15 UTC [197747] DEBUG: checkpoint sync: number=783 file=base/6176124/6374431.10 time=0.006 ms 2026-06-04 09:35:15 UTC [197747] LOG: restartpoint complete: wrote 99500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=239.934 s, sync=0.039 s, total=239.989 s; sync files=783, longest=0.001 s, average=0.001 s; distance=250897 kB, estimate=13117629 kB 2026-06-04 09:35:15 UTC [197747] LOG: recovery restart point at 8EB7/7550F228 2026-06-04 09:35:15 UTC [197747] DETAIL: Last completed transaction was at log time 2026-06-04 02:33:05.505911-07. ^@^@^@^@^@^@^@^@——> Forever<——— postgres=# select pg_last_wal_replay_lsn(); pg_last_wal_replay_lsn ------------------------ 8EB7/968FEBE8 postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal postgres 207957 197738 6 08:25 ? 00:04:52 postgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8 postgres 211309 163479 0 09:36 pts/12 00:00:00 grep wal postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco postgres 197740 197738 16 08:11 ? 00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0000000200008EB700000096 postgres 211311 163479 0 09:36 pts/12 00:00:00 grep reco Content of the wal : postgres@uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |grep 8FEBE8 rmgr: MultiXact len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh) rmgr: Heap len (rec/tot): 54/ 54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258 There are still plenty of files to process: postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000091 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000092 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000093 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000094 -rw------- 1 postgres postgres 16777216 Jun 4 09:32 0000000200008EB700000095 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000096 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000097 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000098 -rw------- 1 postgres postgres 16777216 Jun 4 09:33 0000000200008EB700000099 -rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009A -rw------- 1 postgres postgres 16777216 Jun 4 09:34 0000000200008EB70000009B Postgresql.conf Secondary (primary is the same, but with more memory) All non-default values: shared_buffers = 4GB # min 128kB temp_buffers = 512MB # min 800kB work_mem = 512MB # min 64kB maintenance_work_mem = 4GB # min 1MB autovacuum_work_mem = 512MB # min 1MB, or -1 to use maintenance_work_mem max_stack_depth = 7MB bgwriter_lru_maxpages = 1000 # max buffers written/round, 0 disables wal_level = logical # minimal, replica, or logical wal_log_hints = on # also do full page writes of non-critical updates checkpoint_completion_target = 0.8 # checkpoint target duration, 0.0 - 1.0 checkpoint_warning = 600s # 0 disables max_wal_size = 500GB min_wal_size = 50GB max_wal_senders = 10 # max number of walsender processes max_replication_slots = 30 # max number of replication slots wal_keep_size = 150GB # in megabytes; 0 disables 30720=30GB 1.5 day max_slot_wal_keep_size = 500GB # in megabytes; -1 disables 307200=300GB 4 days wal_sender_timeout = 600s # in milliseconds; 0 disables max_standby_archive_delay = 48h # max delay before canceling queries max_standby_streaming_delay = 48h # max delay before canceling queries hot_standby_feedback = on # send info from standby to prevent wal_receiver_timeout = 600s # time that receiver waits for ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary @ 2026-06-05 09:04 Laurenz Albe <[email protected]> parent: Jorge Daniel <[email protected]> 1 sibling, 1 reply; 6+ messages in thread From: Laurenz Albe @ 2026-06-05 09:04 UTC (permalink / raw) To: Jorge Daniel <[email protected]>; [email protected] <[email protected]> On Thu, 2026-06-04 at 21:13 +0000, Jorge Daniel wrote: > server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) Try again with 14.23. Yours, Laurenz Albe ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary @ 2026-06-05 10:00 Jorge Daniel <[email protected]> parent: Laurenz Albe <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Jorge Daniel @ 2026-06-05 10:00 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected] <[email protected]> Hello Laurenz , thx u for the help . We tried on our first 3 attempts, then we downgraded to mirror the Primary Version , then we downgraded the OS version . We tried so far this combinations : * Ubuntu 24.04 + PG 14.23 and with 2 different storage devices (suspecting about an issue in the disk layer ) * Ubuntu 24.04 + PG 14.11 * Ubuntu 22.04 + PG 14.11 -> Same versions of the Primary In all these environments we had the same result : stopped replaying wal-files while still receiving via streaming . No log lines about it ,none We never saw this scenario ever Kind Regards Jorge Fernandez > El 5 jun 2026, a las 11:04, Laurenz Albe <[email protected]> escribió: > > On Thu, 2026-06-04 at 21:13 +0000, Jorge Daniel wrote: >> server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) > > Try again with 14.23. > > Yours, > Laurenz Albe ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary @ 2026-06-05 11:55 Laurenz Albe <[email protected]> parent: Jorge Daniel <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Laurenz Albe @ 2026-06-05 11:55 UTC (permalink / raw) To: Jorge Daniel <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected] <[email protected]> On Fri, 2026-06-05 at 10:00 +0000, Jorge Daniel wrote: > We tried on our first 3 attempts, then we downgraded to mirror the Primary Version , then we downgraded the OS version . > We tried so far this combinations : > > * Ubuntu 24.04 + PG 14.23 and with 2 different storage devices (suspecting about an issue in the disk layer ) > * Ubuntu 24.04 + PG 14.11 > * Ubuntu 22.04 + PG 14.11 -> Same versions of the Primary > > In all these environments we had the same result : stopped replaying wal-files while still receiving via streaming . > No log lines about it ,none > > We never saw this scenario ever Perhaps you are suffering from this bug [1] up to 14.22 and from this bug [2] in 12.23 (which was introduced by the previous bugfix). Yours, Laurenz Albe [1]: https://postgr.es/m/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com [2]: https://postgr.es/m/flat/19490-9c59c6a583513b99%40postgresql.org ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary @ 2026-06-08 10:28 Jorge Daniel <[email protected]> parent: Laurenz Albe <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Jorge Daniel @ 2026-06-08 10:28 UTC (permalink / raw) To: Laurenz Albe <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected] <[email protected]> Thanks for the help Laurenz After researching on those bugs (this was not exactly the same conditions but the recovery stopped ) we made another try with a minor version from the primary (14.11) and installed 14.09. The result was different with the primary connected and streaming 100% on the secondary , with 100% replay achieved , ie :recovery didn't stop working!! state | streaming sent_lsn | 8F32/C9A72F18 write_lsn | 8F32/C9A72F18 flush_lsn | 8F32/C9A72F18 replay_lsn | 8F32/C9A72F18 It was running the whole weekend normally without hanging up on syncing, but we detect that one particular table was not synced (the others seems to being updated normally) Primary pmx=# select now(), * from tstamp; -[ RECORD 1 ]------------------------- now | 2026-06-07 04:27:25.443001-07 tstamp | 2026-06-07 04:27:01.9337-07 ———>here Secondary pmx=# select now(), max(time) from activity_log; -[ RECORD 1 ]---------------------- now | 2026-06-07 04:27:01.368247-07 tstamp | 2026-06-04 18:02:02.446899-07———>here Did we hit another bug? > El 5 jun 2026, a las 13:55, Laurenz Albe <[email protected]> escribió: > > On Fri, 2026-06-05 at 10:00 +0000, Jorge Daniel wrote: >> We tried on our first 3 attempts, then we downgraded to mirror the Primary Version , then we downgraded the OS version . >> We tried so far this combinations : >> >> * Ubuntu 24.04 + PG 14.23 and with 2 different storage devices (suspecting about an issue in the disk layer ) >> * Ubuntu 24.04 + PG 14.11 >> * Ubuntu 22.04 + PG 14.11 -> Same versions of the Primary >> >> In all these environments we had the same result : stopped replaying wal-files while still receiving via streaming . >> No log lines about it ,none >> >> We never saw this scenario ever > > Perhaps you are suffering from this bug [1] up to 14.22 and from this bug [2] in 12.23 > (which was introduced by the previous bugfix). > > Yours, > Laurenz Albe > > > [1]: https://postgr.es/m/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com > [2]: https://postgr.es/m/flat/19490-9c59c6a583513b99%40postgresql.org ^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2026-06-08 10:28 UTC | newest] Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-06-04 21:13 Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary Jorge Daniel <[email protected]> 2026-06-04 23:51 ` Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary hello from Burnside Project <[email protected]> 2026-06-05 09:04 ` Laurenz Albe <[email protected]> 2026-06-05 10:00 ` Jorge Daniel <[email protected]> 2026-06-05 11:55 ` Laurenz Albe <[email protected]> 2026-06-08 10:28 ` Jorge Daniel <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox