Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary

public inbox for [email protected]  
help / color / mirror / Atom feed

Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary
6+ messages / 3 participants
[nested] [flat]

* Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary
@ 2026-06-04 21:13  Jorge Daniel <[email protected]>
  0 siblings, 2 replies; 6+ messages in thread

From: Jorge Daniel @ 2026-06-04 21:13 UTC (permalink / raw)
  To: [email protected] <[email protected]>

Good day to  everyone

We're  asking the PG-comunity for some help if it is possible.

We have a primary with 2 secondaries: the primary went down and one of the secondaries was promoted. The orphaned Secondary reconnected to the new primary and is replicating ok.
We had to reconstruct a new secondary, we did it as we always do with the basic and dependable:

pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=fast --create-slot --slot=us_vento_replica_slot_aux

It ran for 5hrs without problem. When it finished:
$ pg_ctl start 

The recovery was running until a consistent point was reached, the database opened and started streaming the rest from the Primary. 
After an hour or so, the recovery got stuck in a certain wal file. No more log entries about it (debug 1) and after some hours the stream connection got disconnected.
The secondary is hung on that wal file and not going forward with the rest of the wal file list.

We re-tried this several times, changing the storage (just in case), with a new box with the same original Ubuntu 22.04 instead of 24.04 (just in case),  but the result was the same.
Even though we have the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the same file (everytime we re-created the stuck-wal-file changed, clearly).
We're out of ideas of what's happening.
Could you please shed some light here?


Primary: uspgvento14r

Version 
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1)

pg_stat_replication: 

pid              | 4177197
usesysid         | 16474
usename          | replicator
application_name | 14_replica
client_addr      | 192.168.11.33
client_hostname  |
client_port      | 53534
backend_start    | 2026-06-04 01:25:18.100137-07
backend_xmin     |
state            | streaming
sent_lsn         | 8EBA/CDA68930
write_lsn        | 8EBA/CDA68930
flush_lsn        | 8EBA/CDA68930
replay_lsn       | 8EB7/968FEBE8
write_lag        | 00:00:00.001543
flush_lag        | 00:00:00.00225
replay_lag       | 01:00:22.530041
sync_priority    | 0
sync_state       | async
reply_time       | 2026-06-04 03:33:28.037703-07

Log Primary 

2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG:  connection received: host=192.168.12.34 port=53534
2026-06-04 01:25:18 PDT     [2858] LOG:  background worker "logical replication worker" (PID 4177182) exited with exit code 1
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG:  connection authenticated: identity="replicator" method=md5 (/etc/postgresql/14/main/pg_hba.conf:95)
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG:  replication connection authorized: user=replicator application_name=14_uspgvento14Rb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
--
2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG:  disconnection: session time: 2:37:08.575 user=replicator database= host=192.168.12.34 port=53534




Secondary  : 14_uspgvento14Rb 192.168.12.34

Version:
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1)

Ubuntu 14.23-1.pgdg24.04+1


.....
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002D" from archive
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002E" from archive
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002F" from archive
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA400000030" from archive
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive
026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.011 ms
2026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.493 ms
......
2026-06-04 08:25:18 UTC     [197740] LOG:  restored log file "0000000200008EA400000044" from archive
2026-06-04 08:25:18 UTC     [197740] DEBUG:  got WAL segment from archive
2026-06-04 08:25:18 UTC     [197740] DEBUG:  end of backup reached
2026-06-04 08:25:18 UTC     [197740] CONTEXT:  WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578
2026-06-04 08:25:18 UTC     [197740] LOG:  consistent recovery state reached at 8EA4/44680C30
2026-06-04 08:25:18 UTC     [197738] LOG:  database system is ready to accept read-only connections
cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No such file or directory
2026-06-04 08:25:18 UTC     [207957] LOG:  started streaming WAL from primary at 8EA4/45000000 on timeline 2
2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG:  connection received: host=[local]
2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG:  connection authorized: user=postgres database=postgres application_name=psql
2026-06-04 08:26:15 UTC     [197747] LOG:  restartpoint starting: time
2026-06-04 08:26:15 UTC     [197747] DEBUG:  performing replication slot checkpoint
......
22026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.006 ms
2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.572 ms
2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=6 file=base/6176124/6602460.56 time=1.791 ms
.....
2026-06-04 09:35:15 UTC     [197747] DEBUG:  checkpoint sync: number=782 file=base/6176124/6601067.15 time=0.003 ms
2026-06-04 09:35:15 UTC     [197747] DEBUG:  checkpoint sync: number=783 file=base/6176124/6374431.10 time=0.006 ms
2026-06-04 09:35:15 UTC     [197747] LOG:  restartpoint complete: wrote 99500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=239.934 s, sync=0.039 s, total=239.989 s; sync files=783, longest=0.001 s, average=0.001 s; distance=250897 kB, estimate=13117629 kB
2026-06-04 09:35:15 UTC     [197747] LOG:  recovery restart point at 8EB7/7550F228
2026-06-04 09:35:15 UTC     [197747] DETAIL:  Last completed transaction was at log time 2026-06-04 02:33:05.505911-07.
^@^@^@^@^@^@^@^@——> Forever<———	





postgres=# select pg_last_wal_replay_lsn();
 pg_last_wal_replay_lsn
------------------------
 8EB7/968FEBE8


postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal
postgres  207957  197738  6 08:25 ?        00:04:52 postgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8
postgres  211309  163479  0 09:36 pts/12   00:00:00 grep wal
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco
postgres  197740  197738 16 08:11 ?        00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0000000200008EB700000096
postgres  211311  163479  0 09:36 pts/12   00:00:00 grep reco



Content of the wal :

postgres@uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |grep 8FEBE8
rmgr: MultiXact   len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh)
rmgr: Heap        len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258


There are still plenty of files to process: 
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000091
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000092
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000093
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000094
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000095
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000096
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000097
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000098
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000099
-rw------- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB70000009A
-rw------- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB70000009B


Postgresql.conf Secondary (primary is the same, but with more memory)

All non-default values:

shared_buffers = 4GB                    # min 128kB
temp_buffers = 512MB                    # min 800kB
work_mem = 512MB                                # min 64kB
maintenance_work_mem = 4GB              # min 1MB
autovacuum_work_mem =  512MB            # min 1MB, or -1 to use maintenance_work_mem
max_stack_depth = 7MB
bgwriter_lru_maxpages = 1000            # max buffers written/round, 0 disables

wal_level = logical                     # minimal, replica, or logical
wal_log_hints = on                      # also do full page writes of non-critical updates

checkpoint_completion_target = 0.8      # checkpoint target duration, 0.0 - 1.0
checkpoint_warning = 600s               # 0 disables
max_wal_size = 500GB
min_wal_size = 50GB

max_wal_senders = 10            # max number of walsender processes
max_replication_slots = 30      # max number of replication slots
wal_keep_size = 150GB           # in megabytes; 0 disables  30720=30GB  1.5 day
max_slot_wal_keep_size = 500GB  # in megabytes; -1 disables 307200=300GB 4 days
wal_sender_timeout = 600s       # in milliseconds; 0 disables

max_standby_archive_delay = 48h         # max delay before canceling queries
max_standby_streaming_delay = 48h       # max delay before canceling queries
hot_standby_feedback = on               # send info from standby to prevent
wal_receiver_timeout = 600s             # time that receiver waits for



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary
@ 2026-06-04 23:51  hello from Burnside Project <[email protected]>
  parent: Jorge Daniel <[email protected]>
  1 sibling, 0 replies; 6+ messages in thread

From: hello from Burnside Project @ 2026-06-04 23:51 UTC (permalink / raw)
  To: Jorge Daniel <[email protected]>; +Cc: [email protected] <[email protected]>

Replication status



On the primary:



SELECT * FROM pg_stat_replication;



Shows:



* Is the standby connected?

* What LSN has been sent?

* What LSN has been replayed?





2. WAL receiver status



On the standby:

SELECT * FROM pg_stat_wal_receiver;



Shows:



* Is WAL still being received?

* From which server?

* Latest WAL location received?



3. Recovery progress



On the standby:



SELECT

    pg_last_wal_receive_lsn(),

    pg_last_wal_replay_lsn(),

    pg_last_xact_replay_timestamp();

   



This tells you whether:



* WAL is arriving but not replaying.

* WAL is replaying slowly.

* Replay has completely stopped.



4. PostgreSQL logs



Look for messages such as:



invalid record length

PANIC

could not read WAL

requested timeline

waiting for WAL



Best regards,  

Arif Rahman 



mailto:[email protected]  

https://burnsideproject.ai  



https://github.com/orgs/burnside-project  



Capture, transform, and learn from PostgreSQL



Email Disclaimer:



This email and any attachments may contain privileged and confidential information intended solely for the use of the individual or entity to whom it is addressed. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying, or use of this email or its contents is strictly prohibited. If you have received this email in error, please notify the sender immediately by replying to this message and delete it from your system.



Please note that any views or opinions expressed in this email are solely those of the author and do not necessarily represent those of Burnside Project LLC. Although we have taken precautions to ensure this email is free of viruses or other malicious software, we cannot guarantee the security or integrity of email communications. Recipients should verify attachments for possible threats.



Thank you.

Burnside Project









From: Jorge Daniel <[email protected]>
To: "[email protected]"<[email protected]>
Date: Thu, 04 Jun 2026 14:13:45 -0700
Subject: Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary



Good day to  everyone 
 
We're  asking the PG-comunity for some help if it is possible. 
 
We have a primary with 2 secondaries: the primary went down and one of the secondaries was promoted. The orphaned Secondary reconnected to the new primary and is replicating ok. 
We had to reconstruct a new secondary, we did it as we always do with the basic and dependable: 
 
pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=fast --create-slot --slot=us_vento_replica_slot_aux 
 
It ran for 5hrs without problem. When it finished: 
$ pg_ctl start 
 
The recovery was running until a consistent point was reached, the database opened and started streaming the rest from the Primary. 
After an hour or so, the recovery got stuck in a certain wal file. No more log entries about it (debug 1) and after some hours the stream connection got disconnected. 
The secondary is hung on that wal file and not going forward with the rest of the wal file list. 
 
We re-tried this several times, changing the storage (just in case), with a new box with the same original Ubuntu 22.04 instead of 24.04 (just in case),  but the result was the same. 
Even though we have the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the same file (everytime we re-created the stuck-wal-file changed, clearly). 
We're out of ideas of what's happening. 
Could you please shed some light here? 
 
 
Primary: uspgvento14r 
 
Version 
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) 
 
pg_stat_replication: 
 
pid              | 4177197 
usesysid         | 16474 
usename          | replicator 
application_name | 14_replica 
client_addr      | 192.168.11.33 
client_hostname  | 
client_port      | 53534 
backend_start    | 2026-06-04 01:25:18.100137-07 
backend_xmin     | 
state            | streaming 
sent_lsn         | 8EBA/CDA68930 
write_lsn        | 8EBA/CDA68930 
flush_lsn        | 8EBA/CDA68930 
replay_lsn       | 8EB7/968FEBE8 
write_lag        | 00:00:00.001543 
flush_lag        | 00:00:00.00225 
replay_lag       | 01:00:22.530041 
sync_priority    | 0 
sync_state       | async 
reply_time       | 2026-06-04 03:33:28.037703-07 
 
Log Primary 
 
2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG:  connection received: host=192.168.12.34 port=53534 
2026-06-04 01:25:18 PDT     [2858] LOG:  background worker "logical replication worker" (PID 4177182) exited with exit code 1 
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG:  connection authenticated: identity="replicator" method=md5 (/etc/postgresql/14/main/pg_hba.conf:95) 
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177197] LOG:  replication connection authorized: user=replicator application_name=14_uspgvento14Rb SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256) 
-- 
2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG:  disconnection: session time: 2:37:08.575 user=replicator database= host=192.168.12.34 port=53534 
 
 
 
 
Secondary  : 14_uspgvento14Rb 192.168.12.34 
 
Version: 
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1) 
 
Ubuntu 14.23-1.pgdg24.04+1 
 
 
..... 
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive 
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002D" from archive 
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive 
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002E" from archive 
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive 
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002F" from archive 
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive 
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA400000030" from archive 
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive 
026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.011 ms 
2026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.493 ms 
...... 
2026-06-04 08:25:18 UTC     [197740] LOG:  restored log file "0000000200008EA400000044" from archive 
2026-06-04 08:25:18 UTC     [197740] DEBUG:  got WAL segment from archive 
2026-06-04 08:25:18 UTC     [197740] DEBUG:  end of backup reached 
2026-06-04 08:25:18 UTC     [197740] CONTEXT:  WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578 
2026-06-04 08:25:18 UTC     [197740] LOG:  consistent recovery state reached at 8EA4/44680C30 
2026-06-04 08:25:18 UTC     [197738] LOG:  database system is ready to accept read-only connections 
cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No such file or directory 
2026-06-04 08:25:18 UTC     [207957] LOG:  started streaming WAL from primary at 8EA4/45000000 on timeline 2 
2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG:  connection received: host=[local] 
2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG:  connection authorized: user=postgres database=postgres application_name=psql 
2026-06-04 08:26:15 UTC     [197747] LOG:  restartpoint starting: time 
2026-06-04 08:26:15 UTC     [197747] DEBUG:  performing replication slot checkpoint 
...... 
22026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=4 file=base/6176124/2840_fsm time=0.006 ms 
2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=5 file=base/6176124/6599312.33 time=1.572 ms 
2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=6 file=base/6176124/6602460.56 time=1.791 ms 
..... 
2026-06-04 09:35:15 UTC     [197747] DEBUG:  checkpoint sync: number=782 file=base/6176124/6601067.15 time=0.003 ms 
2026-06-04 09:35:15 UTC     [197747] DEBUG:  checkpoint sync: number=783 file=base/6176124/6374431.10 time=0.006 ms 
2026-06-04 09:35:15 UTC     [197747] LOG:  restartpoint complete: wrote 99500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=239.934 s, sync=0.039 s, total=239.989 s; sync files=783, longest=0.001 s, average=0.001 s; distance=250897 kB, estimate=13117629 kB 
2026-06-04 09:35:15 UTC     [197747] LOG:  recovery restart point at 8EB7/7550F228 
2026-06-04 09:35:15 UTC     [197747] DETAIL:  Last completed transaction was at log time 2026-06-04 02:33:05.505911-07. 
^@^@^@^@^@^@^@^@——> Forever<———     
 
 
 
 
 
postgres=# select pg_last_wal_replay_lsn(); 
 pg_last_wal_replay_lsn 
------------------------ 
 8EB7/968FEBE8 
 
 
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal 
postgres  207957  197738  6 08:25 ?        00:04:52 postgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8 
postgres  211309  163479  0 09:36 pts/12   00:00:00 grep wal 
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco 
postgres  197740  197738 16 08:11 ?        00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0000000200008EB700000096 
postgres  211311  163479  0 09:36 pts/12   00:00:00 grep reco 
 
 
 
Content of the wal : 
 
postgres@uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |grep 8FEBE8 
rmgr: MultiXact   len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh) 
rmgr: Heap        len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258 
 
 
There are still plenty of files to process: 
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096 
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000091 
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000092 
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000093 
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000094 
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB700000095 
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000096 
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000097 
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000098 
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB700000099 
-rw------- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB70000009A 
-rw------- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB70000009B 
 
 
Postgresql.conf Secondary (primary is the same, but with more memory) 
 
All non-default values: 
 
shared_buffers = 4GB                    # min 128kB 
temp_buffers = 512MB                    # min 800kB 
work_mem = 512MB                                # min 64kB 
maintenance_work_mem = 4GB              # min 1MB 
autovacuum_work_mem =  512MB            # min 1MB, or -1 to use maintenance_work_mem 
max_stack_depth = 7MB 
bgwriter_lru_maxpages = 1000            # max buffers written/round, 0 disables 
 
wal_level = logical                     # minimal, replica, or logical 
wal_log_hints = on                      # also do full page writes of non-critical updates 
 
checkpoint_completion_target = 0.8      # checkpoint target duration, 0.0 - 1.0 
checkpoint_warning = 600s               # 0 disables 
max_wal_size = 500GB 
min_wal_size = 50GB 
 
max_wal_senders = 10            # max number of walsender processes 
max_replication_slots = 30      # max number of replication slots 
wal_keep_size = 150GB           # in megabytes; 0 disables  30720=30GB  1.5 day 
max_slot_wal_keep_size = 500GB  # in megabytes; -1 disables 307200=300GB 4 days 
wal_sender_timeout = 600s       # in milliseconds; 0 disables 
 
max_standby_archive_delay = 48h         # max delay before canceling queries 
max_standby_streaming_delay = 48h       # max delay before canceling queries 
hot_standby_feedback = on               # send info from standby to prevent 
wal_receiver_timeout = 600s             # time that receiver waits for

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary
@ 2026-06-05 09:04  Laurenz Albe <[email protected]>
  parent: Jorge Daniel <[email protected]>
  1 sibling, 1 reply; 6+ messages in thread

From: Laurenz Albe @ 2026-06-05 09:04 UTC (permalink / raw)
  To: Jorge Daniel <[email protected]>; [email protected] <[email protected]>

On Thu, 2026-06-04 at 21:13 +0000, Jorge Daniel wrote:
> server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1)

Try again with 14.23.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary
@ 2026-06-05 10:00  Jorge Daniel <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Jorge Daniel @ 2026-06-05 10:00 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected] <[email protected]>

Hello Laurenz , thx u for the help . 

We tried on our first 3 attempts, then we downgraded to mirror the Primary Version , then we downgraded the OS version . 
We tried so far this combinations : 

* Ubuntu 24.04 + PG 14.23  and with  2 different storage devices (suspecting  about an issue in  the disk layer )
* Ubuntu 24.04 + PG  14.11 
* Ubuntu 22.04 + PG  14.11 -> Same versions of the Primary 

In all these environments  we had the same result : stopped replaying wal-files while still receiving via streaming . 
No log lines about it ,none

We never saw this scenario ever 

Kind Regards 
Jorge Fernandez 

> El 5 jun 2026, a las 11:04, Laurenz Albe <[email protected]> escribió:
> 
> On Thu, 2026-06-04 at 21:13 +0000, Jorge Daniel wrote:
>> server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1)
> 
> Try again with 14.23.
> 
> Yours,
> Laurenz Albe



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary
@ 2026-06-05 11:55  Laurenz Albe <[email protected]>
  parent: Jorge Daniel <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Laurenz Albe @ 2026-06-05 11:55 UTC (permalink / raw)
  To: Jorge Daniel <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected] <[email protected]>

On Fri, 2026-06-05 at 10:00 +0000, Jorge Daniel wrote:
> We tried on our first 3 attempts, then we downgraded to mirror the Primary Version , then we downgraded the OS version . 
> We tried so far this combinations : 
> 
> * Ubuntu 24.04 + PG 14.23  and with  2 different storage devices (suspecting  about an issue in  the disk layer )
> * Ubuntu 24.04 + PG  14.11 
> * Ubuntu 22.04 + PG  14.11 -> Same versions of the Primary 
> 
> In all these environments  we had the same result : stopped replaying wal-files while still receiving via streaming . 
> No log lines about it ,none
> 
> We never saw this scenario ever 

Perhaps you are suffering from this bug [1] up to 14.22 and from this bug [2] in 12.23
(which was introduced by the previous bugfix).

Yours,
Laurenz Albe


 [1]: https://postgr.es/m/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com
 [2]: https://postgr.es/m/flat/19490-9c59c6a583513b99%40postgresql.org






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary
@ 2026-06-08 10:28  Jorge Daniel <[email protected]>
  parent: Laurenz Albe <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Jorge Daniel @ 2026-06-08 10:28 UTC (permalink / raw)
  To: Laurenz Albe <[email protected]>; +Cc: [email protected] <[email protected]>; [email protected] <[email protected]>

Thanks for the help Laurenz 

After researching on those bugs (this was not exactly the same conditions but the recovery stopped )  we made another try with a minor version from   the primary (14.11)  and installed 14.09. 
The result was different  with the primary connected and streaming 100% on the secondary , with 100% replay achieved , ie :recovery didn't stop working!! 

state            | streaming
sent_lsn         | 8F32/C9A72F18
write_lsn        | 8F32/C9A72F18
flush_lsn        | 8F32/C9A72F18
replay_lsn       | 8F32/C9A72F18

It was running the whole weekend normally without hanging up  on syncing,  but we detect that  one particular  table was not synced (the others seems to being updated normally) 

Primary 

pmx=# select now(), * from tstamp;
-[ RECORD 1 ]-------------------------
now    | 2026-06-07 04:27:25.443001-07
tstamp | 2026-06-07 04:27:01.9337-07 ———>here  

Secondary 
pmx=# select now(), max(time) from activity_log;
-[ RECORD 1 ]----------------------
now    | 2026-06-07 04:27:01.368247-07
tstamp | 2026-06-04 18:02:02.446899-07———>here 


Did we hit another bug?  


> El 5 jun 2026, a las 13:55, Laurenz Albe <[email protected]> escribió:
> 
> On Fri, 2026-06-05 at 10:00 +0000, Jorge Daniel wrote:
>> We tried on our first 3 attempts, then we downgraded to mirror the Primary Version , then we downgraded the OS version . 
>> We tried so far this combinations : 
>> 
>> * Ubuntu 24.04 + PG 14.23  and with  2 different storage devices (suspecting  about an issue in  the disk layer )
>> * Ubuntu 24.04 + PG  14.11 
>> * Ubuntu 22.04 + PG  14.11 -> Same versions of the Primary 
>> 
>> In all these environments  we had the same result : stopped replaying wal-files while still receiving via streaming . 
>> No log lines about it ,none
>> 
>> We never saw this scenario ever
> 
> Perhaps you are suffering from this bug [1] up to 14.22 and from this bug [2] in 12.23
> (which was introduced by the previous bugfix).
> 
> Yours,
> Laurenz Albe
> 
> 
> [1]: https://postgr.es/m/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%40mail.gmail.com
> [2]: https://postgr.es/m/flat/19490-9c59c6a583513b99%40postgresql.org



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-06-08 10:28 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-06-04 21:13 Pg14  replication issue , recovery stucks in a random file without advancing  while streaming from primary Jorge Daniel <[email protected]>
2026-06-04 23:51 ` Re:Pg14 replication issue , recovery stucks in a random file without advancing while streaming from primary hello from Burnside Project <[email protected]>
2026-06-05 09:04 ` Laurenz Albe <[email protected]>
2026-06-05 10:00   ` Jorge Daniel <[email protected]>
2026-06-05 11:55     ` Laurenz Albe <[email protected]>
2026-06-08 10:28       ` Jorge Daniel <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox