Date: Thu, 04 Jun 2026 16:51:25 -0700
From: hello from Burnside Project <hello@burnsideproject.ai>
To: "Jorge Daniel" <elgaita@hotmail.com>
Cc: "pgsql-admin@lists.postgresql.org" <pgsql-admin@lists.postgresql.org>
Message-Id: 
 <19e950c8337.61bdc2a03840616.4821016053268419825@burnsideproject.ai>
In-Reply-To: <A05772F4-7C72-4151-9562-7C1A84A886E2@hotmail.com>
References: <A05772F4-7C72-4151-9562-7C1A84A886E2@hotmail.com>
Subject: Re:Pg14 replication issue , recovery stucks in a random file
 without advancing while streaming from primary
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_13096318_1715792670.1780617085752"
Importance: Medium
Disposition-Notification-To: "hello from Burnside Project"
 <hello@burnsideproject.ai>
User-Agent: Zoho Mail
Archived-At: 
 <https://www.postgresql.org/message-id/19e950c8337.61bdc2a03840616.4821016053268419825%40burnsideproject.ai>
Precedence: bulk

------=_Part_13096318_1715792670.1780617085752
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Replication status


On the primary:


SELECT * FROM pg_stat_replication;


Shows:


* Is the standby connected?

* What LSN has been sent?

* What LSN has been replayed?


2. WAL receiver status


On the standby:

SELECT * FROM pg_stat_wal_receiver;


Shows:


* Is WAL still being received?

* From which server?

* Latest WAL location received?


3. Recovery progress


On the standby:


SELECT

=C2=A0=C2=A0=C2=A0 pg_last_wal_receive_lsn(),

=C2=A0=C2=A0=C2=A0 pg_last_wal_replay_lsn(),

=C2=A0=C2=A0=C2=A0 pg_last_xact_replay_timestamp();

=C2=A0=C2=A0=C2=A0


This tells you whether:


* WAL is arriving but not replaying.

* WAL is replaying slowly.

* Replay has completely stopped.


4. PostgreSQL logs


Look for messages such as:


invalid record length

PANIC

could not read WAL

requested timeline

waiting for WAL


Best regards,=C2=A0=20

Arif Rahman=C2=A0


mailto:arif.rahman@burnsideproject.ai =C2=A0

https://burnsideproject.ai =C2=A0


https://github.com/orgs/burnside-project =C2=A0


Capture, transform, and learn from PostgreSQL


Email Disclaimer:


This email and any attachments may contain privileged and confidential info=
rmation intended solely for the use of the individual or entity to whom it =
is addressed. If you are not the intended recipient, you are hereby notifie=
d that any dissemination, distribution, copying, or use of this email or it=
s contents is strictly prohibited. If you have received this email in error=
, please notify the sender immediately by replying to this message and dele=
te it from your system.


Please note that any views or opinions expressed in this email are solely t=
hose of the author and do not necessarily represent those of Burnside Proje=
ct LLC. Although we have taken precautions to ensure this email is free of =
viruses or other malicious software, we cannot guarantee the security or in=
tegrity of email communications. Recipients should verify attachments for p=
ossible threats.


Thank you.

Burnside Project


From: Jorge Daniel <elgaita@hotmail.com>
To: "pgsql-admin@lists.postgresql.org"<pgsql-admin@lists.postgresql.org>
Date: Thu, 04 Jun 2026 14:13:45 -0700
Subject: Pg14 replication issue , recovery stucks in a random file without =
advancing while streaming from primary


Good day to  everyone=20
=20
We're  asking the PG-comunity for some help if it is possible.=20
=20
We have a primary with 2 secondaries: the primary went down and one of the =
secondaries was promoted. The orphaned Secondary reconnected to the new pri=
mary and is replicating ok.=20
We had to reconstruct a new secondary, we did it as we always do with the b=
asic and dependable:=20
=20
pg_basebackup -h uspgvento14r.us.local -U replicator -p 5432 -D $PGDATA -Fp=
 -Xs -P -R --checkpoint=3Dfast --create-slot --slot=3Dus_vento_replica_slot=
_aux=20
=20
It ran for 5hrs without problem. When it finished:=20
$ pg_ctl start=20
=20
The recovery was running until a consistent point was reached, the database=
 opened and started streaming the rest from the Primary.=20
After an hour or so, the recovery got stuck in a certain wal file. No more =
log entries about it (debug 1) and after some hours the stream connection g=
ot disconnected.=20
The secondary is hung on that wal file and not going forward with the rest =
of the wal file list.=20
=20
We re-tried this several times, changing the storage (just in case), with a=
 new box with the same original Ubuntu 22.04 instead of 24.04 (just in case=
),  but the result was the same.=20
Even though we have the 22.04 and 24.04 in parallel, we saw both replica en=
gines freeze on the same file (everytime we re-created the stuck-wal-file c=
hanged, clearly).=20
We're out of ideas of what's happening.=20
Could you please shed some light here?=20
=20
=20
Primary: uspgvento14r=20
=20
Version=20
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1)=20
=20
pg_stat_replication:=20
=20
pid              | 4177197=20
usesysid         | 16474=20
usename          | replicator=20
application_name | 14_replica=20
client_addr      | 192.168.11.33=20
client_hostname  |=20
client_port      | 53534=20
backend_start    | 2026-06-04 01:25:18.100137-07=20
backend_xmin     |=20
state            | streaming=20
sent_lsn         | 8EBA/CDA68930=20
write_lsn        | 8EBA/CDA68930=20
flush_lsn        | 8EBA/CDA68930=20
replay_lsn       | 8EB7/968FEBE8=20
write_lag        | 00:00:00.001543=20
flush_lag        | 00:00:00.00225=20
replay_lag       | 01:00:22.530041=20
sync_priority    | 0=20
sync_state       | async=20
reply_time       | 2026-06-04 03:33:28.037703-07=20
=20
Log Primary=20
=20
2026-06-04 01:25:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [417719=
7] LOG:  connection received: host=3D192.168.12.34 port=3D53534=20
2026-06-04 01:25:18 PDT     [2858] LOG:  background worker "logical replica=
tion worker" (PID 4177182) exited with exit code 1=20
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [41771=
97] LOG:  connection authenticated: identity=3D"replicator" method=3Dmd5 (/=
etc/postgresql/14/main/pg_hba.conf:95)=20
2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [41771=
97] LOG:  replication connection authorized: user=3Dreplicator application_=
name=3D14_uspgvento14Rb SSL enabled (protocol=3DTLSv1.3, cipher=3DTLS_AES_2=
56_GCM_SHA384, bits=3D256)=20
--=20
2026-06-04 04:02:26 PDT [unknown] replicator 192.168.12.34 14_uspgvento14Rb=
 [4177197] LOG:  disconnection: session time: 2:37:08.575 user=3Dreplicator=
 database=3D host=3D192.168.12.34 port=3D53534=20
=20
=20
=20
=20
Secondary  : 14_uspgvento14Rb 192.168.12.34=20
=20
Version:=20
server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.04.1)=20
=20
Ubuntu 14.23-1.pgdg24.04+1=20
=20
=20
.....=20
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive=
=20
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008=
EA40000002D" from archive=20
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive=
=20
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008=
EA40000002E" from archive=20
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive=
=20
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008=
EA40000002F" from archive=20
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive=
=20
2026-06-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008=
EA400000030" from archive=20
2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive=
=20
026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D4 fil=
e=3Dbase/6176124/2840_fsm time=3D0.011 ms=20
2026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D5 fi=
le=3Dbase/6176124/6599312.33 time=3D1.493 ms=20
......=20
2026-06-04 08:25:18 UTC     [197740] LOG:  restored log file "0000000200008=
EA400000044" from archive=20
2026-06-04 08:25:18 UTC     [197740] DEBUG:  got WAL segment from archive=
=20
2026-06-04 08:25:18 UTC     [197740] DEBUG:  end of backup reached=20
2026-06-04 08:25:18 UTC     [197740] CONTEXT:  WAL redo at 8EA4/44680C08 fo=
r XLOG/BACKUP_END: 8E91/8B065578=20
2026-06-04 08:25:18 UTC     [197740] LOG:  consistent recovery state reache=
d at 8EA4/44680C30=20
2026-06-04 08:25:18 UTC     [197738] LOG:  database system is ready to acce=
pt read-only connections=20
cp: cannot stat '/pg_data/pg14_wal_archive/0000000200008EA400000045': No su=
ch file or directory=20
2026-06-04 08:25:18 UTC     [207957] LOG:  started streaming WAL from prima=
ry at 8EA4/45000000 on timeline 2=20
2026-06-04 08:25:39 UTC [unknown] [unknown] [local] [unknown] [208039] LOG:=
  connection received: host=3D[local]=20
2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] LOG:  =
connection authorized: user=3Dpostgres database=3Dpostgres application_name=
=3Dpsql=20
2026-06-04 08:26:15 UTC     [197747] LOG:  restartpoint starting: time=20
2026-06-04 08:26:15 UTC     [197747] DEBUG:  performing replication slot ch=
eckpoint=20
......=20
22026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D4 f=
ile=3Dbase/6176124/2840_fsm time=3D0.006 ms=20
2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D5 fi=
le=3Dbase/6176124/6599312.33 time=3D1.572 ms=20
2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D6 fi=
le=3Dbase/6176124/6602460.56 time=3D1.791 ms=20
.....=20
2026-06-04 09:35:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D782 =
file=3Dbase/6176124/6601067.15 time=3D0.003 ms=20
2026-06-04 09:35:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D783 =
file=3Dbase/6176124/6374431.10 time=3D0.006 ms=20
2026-06-04 09:35:15 UTC     [197747] LOG:  restartpoint complete: wrote 995=
00 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=3D23=
9.934 s, sync=3D0.039 s, total=3D239.989 s; sync files=3D783, longest=3D0.0=
01 s, average=3D0.001 s; distance=3D250897 kB, estimate=3D13117629 kB=20
2026-06-04 09:35:15 UTC     [197747] LOG:  recovery restart point at 8EB7/7=
550F228=20
2026-06-04 09:35:15 UTC     [197747] DETAIL:  Last completed transaction wa=
s at log time 2026-06-04 02:33:05.505911-07.=20
^@^@^@^@^@^@^@^@=E2=80=94=E2=80=94> Forever<=E2=80=94=E2=80=94=E2=80=94=C2=
=A0=C2=A0=C2=A0=C2=A0=20
=20
=20
=20
=20
=20
postgres=3D# select pg_last_wal_replay_lsn();=20
 pg_last_wal_replay_lsn=20
------------------------=20
 8EB7/968FEBE8=20
=20
=20
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep wal=20
postgres  207957  197738  6 08:25 ?        00:04:52 postgres: 14_uspgvento1=
4Rb: walreceiver streaming 8EB7/A39875E8=20
postgres  211309  163479  0 09:36 pts/12   00:00:00 grep wal=20
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco=20
postgres  197740  197738 16 08:11 ?        00:14:26 postgres: 14_uspgvento1=
4Rb: startup recovering 0000000200008EB700000096=20
postgres  211311  163479  0 09:36 pts/12   00:00:00 grep reco=20
=20
=20
=20
Content of the wal :=20
=20
postgres@uspgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700=
000096 |grep 8FEBE8=20
rmgr: MultiXact   len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/9=
68FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 5217620 nmember=
s 2: 2681967400 (keysh) 2681967401 (keysh)=20
rmgr: Heap        len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/9=
68FEC20, prev 8EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_=
MULTI LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758=
258=20
=20
=20
There are still plenty of files to process:=20
postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 000000=
0200008EB700000096=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB7000000=
91=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB7000000=
92=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB7000000=
93=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB7000000=
94=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB7000000=
95=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB7000000=
96=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB7000000=
97=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB7000000=
98=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:33 0000000200008EB7000000=
99=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB7000000=
9A=20
-rw------- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB7000000=
9B=20
=20
=20
Postgresql.conf Secondary (primary is the same, but with more memory)=20
=20
All non-default values:=20
=20
shared_buffers =3D 4GB                    # min 128kB=20
temp_buffers =3D 512MB                    # min 800kB=20
work_mem =3D 512MB                                # min 64kB=20
maintenance_work_mem =3D 4GB              # min 1MB=20
autovacuum_work_mem =3D  512MB            # min 1MB, or -1 to use maintenan=
ce_work_mem=20
max_stack_depth =3D 7MB=20
bgwriter_lru_maxpages =3D 1000            # max buffers written/round, 0 di=
sables=20
=20
wal_level =3D logical                     # minimal, replica, or logical=20
wal_log_hints =3D on                      # also do full page writes of non=
-critical updates=20
=20
checkpoint_completion_target =3D 0.8      # checkpoint target duration, 0.0=
 - 1.0=20
checkpoint_warning =3D 600s               # 0 disables=20
max_wal_size =3D 500GB=20
min_wal_size =3D 50GB=20
=20
max_wal_senders =3D 10            # max number of walsender processes=20
max_replication_slots =3D 30      # max number of replication slots=20
wal_keep_size =3D 150GB           # in megabytes; 0 disables  30720=3D30GB =
 1.5 day=20
max_slot_wal_keep_size =3D 500GB  # in megabytes; -1 disables 307200=3D300G=
B 4 days=20
wal_sender_timeout =3D 600s       # in milliseconds; 0 disables=20
=20
max_standby_archive_delay =3D 48h         # max delay before canceling quer=
ies=20
max_standby_streaming_delay =3D 48h       # max delay before canceling quer=
ies=20
hot_standby_feedback =3D on               # send info from standby to preve=
nt=20
wal_receiver_timeout =3D 600s             # time that receiver waits for
------=_Part_13096318_1715792670.1780617085752
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>=
<meta content=3D"text/html;charset=3DUTF-8" http-equiv=3D"Content-Type"></h=
ead><body ><div style=3D"font-family: Verdana, Arial, Helvetica, sans-serif=
; font-size: 10pt;"><div>Replication status<br></div><div><br></div><div>On=
 the primary:<br></div><div><br></div><div>SELECT * FROM pg_stat_replicatio=
n;<br></div><div><br></div><div>Shows:<br></div><div><br></div><div>* Is th=
e standby connected?<br></div><div>* What LSN has been sent?<br></div><div>=
* What LSN has been replayed?<br></div><div><br></div><div><br></div><div>2=
. WAL receiver status<br></div><div><br></div><div>On the standby:<br></div=
><div>SELECT * FROM pg_stat_wal_receiver;<br></div><div><br></div><div>Show=
s:<br></div><div><br></div><div>* Is WAL still being received?<br></div><di=
v>* From which server?<br></div><div>* Latest WAL location received?<br></d=
iv><div><br></div><div>3. Recovery progress<br></div><div><br></div><div>On=
 the standby:<br></div><div><br></div><div>SELECT<br></div><div>&nbsp;&nbsp=
;&nbsp; pg_last_wal_receive_lsn(),<br></div><div>&nbsp;&nbsp;&nbsp; pg_last=
_wal_replay_lsn(),<br></div><div>&nbsp;&nbsp;&nbsp; pg_last_xact_replay_tim=
estamp();<br></div><div>&nbsp;&nbsp;&nbsp;<br></div><div><br></div><div>Thi=
s tells you whether:<br></div><div><br></div><div>* WAL is arriving but not=
 replaying.<br></div><div>* WAL is replaying slowly.<br></div><div>* Replay=
 has completely stopped.<br></div><div><br></div><div>4. PostgreSQL logs<br=
></div><div><br></div><div>Look for messages such as:<br></div><div><br></d=
iv><div>invalid record length<br></div><div>PANIC<br></div><div>could not r=
ead WAL<br></div><div>requested timeline<br></div><div>waiting for WAL<br><=
/div><div><br></div><div id=3D"Zm-_Id_-Sgn" data-sigid=3D"771820715" data-z=
bluepencil-ignore=3D"true"><div><span class=3D"size" style=3D"font-size:16p=
x">Best regards,&nbsp; </span><br></div><div><span class=3D"size" style=3D"=
font-size:16px">Arif Rahman&nbsp;</span><br></div><div><br></div><div><a ta=
rget=3D"_blank" href=3D"mailto:arif.rahman@burnsideproject.ai"><span class=
=3D"size" style=3D"font-size:16px">arif.rahman@burnsideproject.ai</span></a=
><span class=3D"size" style=3D"font-size:16px">&nbsp;</span><br></div><div>=
<a target=3D"_blank" href=3D"https://burnsideproject.ai"><span class=3D"siz=
e" style=3D"font-size:16px">https://burnsideproject.ai</span></a><span clas=
s=3D"size" style=3D"font-size:16px">&nbsp;</span><br></div><div><br></div><=
div><a target=3D"_blank" href=3D"https://github.com/orgs/burnside-project">=
<span class=3D"size" style=3D"font-size:16px">https://github.com/orgs/burns=
ide-project</span></a><span class=3D"size" style=3D"font-size:16px">&nbsp;<=
/span><br></div><div><br></div><div><span class=3D"size" style=3D"font-size=
:16px">Capture, transform, and learn from PostgreSQL</span><br></div><div><=
br></div><p class=3D"p1" style=3D"margin: 0px; font-style: normal; font-var=
iant-caps: normal; font-weight: 400; font-width: normal; line-height: norma=
l; font-size-adjust: none; font-kerning: auto; font-variant-alternates: nor=
mal; font-variant-ligatures: normal; font-variant-numeric: normal; font-var=
iant-east-asian: normal; font-variant-position: normal; font-feature-settin=
gs: normal; font-optical-sizing: auto; font-variation-settings: normal; let=
ter-spacing: normal; text-align: start; text-indent: 0px; text-transform: n=
one; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px=
; text-decoration: none;"><span class=3D"colour" style=3D"color:rgb(14, 14,=
 14)"><b><span class=3D"size" style=3D"font-size:16px">Email Disclaimer:</s=
pan></b><span class=3D"size" style=3D"font-size:16px"><br></span></span></p=
><p class=3D"p2" style=3D"margin: 0px; font-style: normal; font-variant-cap=
s: normal; font-weight: 400; font-width: normal; line-height: normal; font-=
size-adjust: none; font-kerning: auto; font-variant-alternates: normal; fon=
t-variant-ligatures: normal; font-variant-numeric: normal; font-variant-eas=
t-asian: normal; font-variant-position: normal; font-feature-settings: norm=
al; font-optical-sizing: auto; font-variation-settings: normal; min-height:=
 19.7px; letter-spacing: normal; text-align: start; text-indent: 0px; text-=
transform: none; white-space: normal; word-spacing: 0px; -webkit-text-strok=
e-width: 0px; text-decoration: none;"><span class=3D"colour" style=3D"color=
:rgb(14, 14, 14)"><span class=3D"size" style=3D"font-size:16px"><br></span>=
</span></p><p class=3D"p1" style=3D"margin: 0px; font-style: normal; font-v=
ariant-caps: normal; font-weight: 400; font-width: normal; line-height: nor=
mal; font-size-adjust: none; font-kerning: auto; font-variant-alternates: n=
ormal; font-variant-ligatures: normal; font-variant-numeric: normal; font-v=
ariant-east-asian: normal; font-variant-position: normal; font-feature-sett=
ings: normal; font-optical-sizing: auto; font-variation-settings: normal; l=
etter-spacing: normal; text-align: start; text-indent: 0px; text-transform:=
 none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0=
px; text-decoration: none;"><span class=3D"colour" style=3D"color:rgb(14, 1=
4, 14)"><span class=3D"size" style=3D"font-size:16px">This email and any at=
tachments may contain privileged and confidential information intended sole=
ly for the use of the individual or entity to whom it is addressed. If you =
are not the intended recipient, you are hereby notified that any disseminat=
ion, distribution, copying, or use of this email or its contents is strictl=
y prohibited. If you have received this email in error, please notify the s=
ender immediately by replying to this message and delete it from your syste=
m.<br></span></span></p><p class=3D"p2" style=3D"margin: 0px; font-style: n=
ormal; font-variant-caps: normal; font-weight: 400; font-width: normal; lin=
e-height: normal; font-size-adjust: none; font-kerning: auto; font-variant-=
alternates: normal; font-variant-ligatures: normal; font-variant-numeric: n=
ormal; font-variant-east-asian: normal; font-variant-position: normal; font=
-feature-settings: normal; font-optical-sizing: auto; font-variation-settin=
gs: normal; min-height: 19.7px; letter-spacing: normal; text-align: start; =
text-indent: 0px; text-transform: none; white-space: normal; word-spacing: =
0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><span class=3D=
"colour" style=3D"color:rgb(14, 14, 14)"><span class=3D"size" style=3D"font=
-size:16px"><br></span></span></p><p class=3D"p1" style=3D"margin: 0px; fon=
t-style: normal; font-variant-caps: normal; font-weight: 400; font-width: n=
ormal; line-height: normal; font-size-adjust: none; font-kerning: auto; fon=
t-variant-alternates: normal; font-variant-ligatures: normal; font-variant-=
numeric: normal; font-variant-east-asian: normal; font-variant-position: no=
rmal; font-feature-settings: normal; font-optical-sizing: auto; font-variat=
ion-settings: normal; letter-spacing: normal; text-align: start; text-inden=
t: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webk=
it-text-stroke-width: 0px; text-decoration: none;"><span class=3D"colour" s=
tyle=3D"color:rgb(14, 14, 14)"><span class=3D"size" style=3D"font-size:16px=
">Please note that any views or opinions expressed in this email are solely=
 those of the author and do not necessarily represent those of Burnside Pro=
ject LLC. Although we have taken precautions to ensure this email is free o=
f viruses or other malicious software, we cannot guarantee the security or =
integrity of email communications. Recipients should verify attachments for=
 possible threats.<br></span></span></p><p class=3D"p2" style=3D"margin: 0p=
x; font-style: normal; font-variant-caps: normal; font-weight: 400; font-wi=
dth: normal; line-height: normal; font-size-adjust: none; font-kerning: aut=
o; font-variant-alternates: normal; font-variant-ligatures: normal; font-va=
riant-numeric: normal; font-variant-east-asian: normal; font-variant-positi=
on: normal; font-feature-settings: normal; font-optical-sizing: auto; font-=
variation-settings: normal; min-height: 19.7px; letter-spacing: normal; tex=
t-align: start; text-indent: 0px; text-transform: none; white-space: normal=
; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;=
"><span class=3D"colour" style=3D"color:rgb(14, 14, 14)"><span class=3D"siz=
e" style=3D"font-size:16px"><br></span></span></p><p class=3D"p1" style=3D"=
margin: 0px; font-style: normal; font-variant-caps: normal; font-weight: 40=
0; font-width: normal; line-height: normal; font-size-adjust: none; font-ke=
rning: auto; font-variant-alternates: normal; font-variant-ligatures: norma=
l; font-variant-numeric: normal; font-variant-east-asian: normal; font-vari=
ant-position: normal; font-feature-settings: normal; font-optical-sizing: a=
uto; font-variation-settings: normal; letter-spacing: normal; text-align: s=
tart; text-indent: 0px; text-transform: none; white-space: normal; word-spa=
cing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><span cl=
ass=3D"colour" style=3D"color:rgb(14, 14, 14)"><span class=3D"size" style=
=3D"font-size:16px">Thank you.<br></span></span></p><div><span class=3D"siz=
e" style=3D"font-size:16px">Burnside Project</span><br></div><div><br></div=
></div><div><br></div><div class=3D"zmail_extra_hr" style=3D"border-top: 1p=
x solid rgb(204, 204, 204); height: 0px; margin-top: 10px; margin-bottom: 1=
0px; line-height: 0px; display: none;"><br></div><div class=3D"zmail_extra"=
 data-zbluepencil-ignore=3D"true" style=3D"clear: both;"><div><br></div><di=
v id=3D"Zm-_Id_-Sgn1">From: Jorge Daniel &lt;elgaita@hotmail.com&gt;<br>To:=
 "pgsql-admin@lists.postgresql.org"&lt;pgsql-admin@lists.postgresql.org&gt;=
<br>Date: Thu, 04 Jun 2026 14:13:45 -0700<br>Subject: Pg14 replication issu=
e , recovery stucks in a random file without advancing while streaming from=
 primary<br></div><div><br></div><blockquote id=3D"blockquote_zmail" style=
=3D"margin: 0px;"><div>Good day to  everyone <br> <br>We're  asking the PG-=
comunity for some help if it is possible. <br> <br>We have a primary with 2=
 secondaries: the primary went down and one of the secondaries was promoted=
. The orphaned Secondary reconnected to the new primary and is replicating =
ok. <br>We had to reconstruct a new secondary, we did it as we always do wi=
th the basic and dependable: <br> <br>pg_basebackup -h uspgvento14r.us.loca=
l -U replicator -p 5432 -D $PGDATA -Fp -Xs -P -R --checkpoint=3Dfast --crea=
te-slot --slot=3Dus_vento_replica_slot_aux <br> <br>It ran for 5hrs without=
 problem. When it finished: <br>$ pg_ctl start <br> <br>The recovery was ru=
nning until a consistent point was reached, the database opened and started=
 streaming the rest from the Primary. <br>After an hour or so, the recovery=
 got stuck in a certain wal file. No more log entries about it (debug 1) an=
d after some hours the stream connection got disconnected. <br>The secondar=
y is hung on that wal file and not going forward with the rest of the wal f=
ile list. <br> <br>We re-tried this several times, changing the storage (ju=
st in case), with a new box with the same original Ubuntu 22.04 instead of =
24.04 (just in case),  but the result was the same. <br>Even though we have=
 the 22.04 and 24.04 in parallel, we saw both replica engines freeze on the=
 same file (everytime we re-created the stuck-wal-file changed, clearly). <=
br>We're out of ideas of what's happening. <br>Could you please shed some l=
ight here? <br> <br> <br>Primary: uspgvento14r <br> <br>Version <br>server_=
version | 14.11 (Ubuntu 14.11-0ubuntu0.22.04.1) <br> <br>pg_stat_replicatio=
n: <br> <br>pid              | 4177197 <br>usesysid         | 16474 <br>use=
name          | replicator <br>application_name | 14_replica <br>client_add=
r      | 192.168.11.33 <br>client_hostname  | <br>client_port      | 53534 =
<br>backend_start    | 2026-06-04 01:25:18.100137-07 <br>backend_xmin     |=
 <br>state            | streaming <br>sent_lsn         | 8EBA/CDA68930 <br>=
write_lsn        | 8EBA/CDA68930 <br>flush_lsn        | 8EBA/CDA68930 <br>r=
eplay_lsn       | 8EB7/968FEBE8 <br>write_lag        | 00:00:00.001543 <br>=
flush_lag        | 00:00:00.00225 <br>replay_lag       | 01:00:22.530041 <b=
r>sync_priority    | 0 <br>sync_state       | async <br>reply_time       | =
2026-06-04 03:33:28.037703-07 <br> <br>Log Primary <br> <br>2026-06-04 01:2=
5:18 PDT [unknown] [unknown] 192.168.12.34 [unknown] [4177197] LOG:  connec=
tion received: host=3D192.168.12.34 port=3D53534 <br>2026-06-04 01:25:18 PD=
T     [2858] LOG:  background worker "logical replication worker" (PID 4177=
182) exited with exit code 1 <br>2026-06-04 01:25:18 PDT [unknown] replicat=
or 192.168.12.34 [unknown] [4177197] LOG:  connection authenticated: identi=
ty=3D"replicator" method=3Dmd5 (/etc/postgresql/14/main/pg_hba.conf:95) <br=
>2026-06-04 01:25:18 PDT [unknown] replicator 192.168.12.34 [unknown] [4177=
197] LOG:  replication connection authorized: user=3Dreplicator application=
_name=3D14_uspgvento14Rb SSL enabled (protocol=3DTLSv1.3, cipher=3DTLS_AES_=
256_GCM_SHA384, bits=3D256) <br>-- <br>2026-06-04 04:02:26 PDT [unknown] re=
plicator 192.168.12.34 14_uspgvento14Rb [4177197] LOG:  disconnection: sess=
ion time: 2:37:08.575 user=3Dreplicator database=3D host=3D192.168.12.34 po=
rt=3D53534 <br> <br> <br> <br> <br>Secondary  : 14_uspgvento14Rb 192.168.12=
.34 <br> <br>Version: <br>server_version | 14.11 (Ubuntu 14.11-0ubuntu0.24.=
04.1) <br> <br>Ubuntu 14.23-1.pgdg24.04+1 <br> <br> <br>..... <br>2026-06-0=
4 08:25:14 UTC     [197740] DEBUG:  got WAL segment from archive <br>2026-0=
6-04 08:25:14 UTC     [197740] LOG:  restored log file "0000000200008EA4000=
0002D" from archive <br>2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WA=
L segment from archive <br>2026-06-04 08:25:14 UTC     [197740] LOG:  resto=
red log file "0000000200008EA40000002E" from archive <br>2026-06-04 08:25:1=
4 UTC     [197740] DEBUG:  got WAL segment from archive <br>2026-06-04 08:2=
5:14 UTC     [197740] LOG:  restored log file "0000000200008EA40000002F" fr=
om archive <br>2026-06-04 08:25:14 UTC     [197740] DEBUG:  got WAL segment=
 from archive <br>2026-06-04 08:25:14 UTC     [197740] LOG:  restored log f=
ile "0000000200008EA400000030" from archive <br>2026-06-04 08:25:14 UTC    =
 [197740] DEBUG:  got WAL segment from archive <br>026-06-04 08:25:15 UTC  =
   [197747] DEBUG:  checkpoint sync: number=3D4 file=3Dbase/6176124/2840_fs=
m time=3D0.011 ms <br>2026-06-04 08:25:15 UTC     [197747] DEBUG:  checkpoi=
nt sync: number=3D5 file=3Dbase/6176124/6599312.33 time=3D1.493 ms <br>....=
.. <br>2026-06-04 08:25:18 UTC     [197740] LOG:  restored log file "000000=
0200008EA400000044" from archive <br>2026-06-04 08:25:18 UTC     [197740] D=
EBUG:  got WAL segment from archive <br>2026-06-04 08:25:18 UTC     [197740=
] DEBUG:  end of backup reached <br>2026-06-04 08:25:18 UTC     [197740] CO=
NTEXT:  WAL redo at 8EA4/44680C08 for XLOG/BACKUP_END: 8E91/8B065578 <br>20=
26-06-04 08:25:18 UTC     [197740] LOG:  consistent recovery state reached =
at 8EA4/44680C30 <br>2026-06-04 08:25:18 UTC     [197738] LOG:  database sy=
stem is ready to accept read-only connections <br>cp: cannot stat '/pg_data=
/pg14_wal_archive/0000000200008EA400000045': No such file or directory <br>=
2026-06-04 08:25:18 UTC     [207957] LOG:  started streaming WAL from prima=
ry at 8EA4/45000000 on timeline 2 <br>2026-06-04 08:25:39 UTC [unknown] [un=
known] [local] [unknown] [208039] LOG:  connection received: host=3D[local]=
 <br>2026-06-04 08:25:39 UTC postgres postgres [local] [unknown] [208039] L=
OG:  connection authorized: user=3Dpostgres database=3Dpostgres application=
_name=3Dpsql <br>2026-06-04 08:26:15 UTC     [197747] LOG:  restartpoint st=
arting: time <br>2026-06-04 08:26:15 UTC     [197747] DEBUG:  performing re=
plication slot checkpoint <br>...... <br>22026-06-04 08:30:15 UTC     [1977=
47] DEBUG:  checkpoint sync: number=3D4 file=3Dbase/6176124/2840_fsm time=
=3D0.006 ms <br>2026-06-04 08:30:15 UTC     [197747] DEBUG:  checkpoint syn=
c: number=3D5 file=3Dbase/6176124/6599312.33 time=3D1.572 ms <br>2026-06-04=
 08:30:15 UTC     [197747] DEBUG:  checkpoint sync: number=3D6 file=3Dbase/=
6176124/6602460.56 time=3D1.791 ms <br>..... <br>2026-06-04 09:35:15 UTC   =
  [197747] DEBUG:  checkpoint sync: number=3D782 file=3Dbase/6176124/660106=
7.15 time=3D0.003 ms <br>2026-06-04 09:35:15 UTC     [197747] DEBUG:  check=
point sync: number=3D783 file=3Dbase/6176124/6374431.10 time=3D0.006 ms <br=
>2026-06-04 09:35:15 UTC     [197747] LOG:  restartpoint complete: wrote 99=
500 buffers (19.0%); 0 WAL file(s) added, 0 removed, 25 recycled; write=3D2=
39.934 s, sync=3D0.039 s, total=3D239.989 s; sync files=3D783, longest=3D0.=
001 s, average=3D0.001 s; distance=3D250897 kB, estimate=3D13117629 kB <br>=
2026-06-04 09:35:15 UTC     [197747] LOG:  recovery restart point at 8EB7/7=
550F228 <br>2026-06-04 09:35:15 UTC     [197747] DETAIL:  Last completed tr=
ansaction was at log time 2026-06-04 02:33:05.505911-07. <br>^@^@^@^@^@^@^@=
^@=E2=80=94=E2=80=94&gt; Forever&lt;=E2=80=94=E2=80=94=E2=80=94&nbsp;&nbsp;=
&nbsp;&nbsp; <br> <br> <br> <br> <br> <br>postgres=3D# select pg_last_wal_r=
eplay_lsn(); <br> pg_last_wal_replay_lsn <br>------------------------ <br> =
8EB7/968FEBE8 <br> <br> <br>postgres@uspgvento14Rb:/pg_data/data14/pg_wal$ =
ps -ef |grep wal <br>postgres  207957  197738  6 08:25 ?        00:04:52 po=
stgres: 14_uspgvento14Rb: walreceiver streaming 8EB7/A39875E8 <br>postgres =
 211309  163479  0 09:36 pts/12   00:00:00 grep wal <br>postgres@uspgvento1=
4Rb:/pg_data/data14/pg_wal$ ps -ef |grep reco <br>postgres  197740  197738 =
16 08:11 ?        00:14:26 postgres: 14_uspgvento14Rb: startup recovering 0=
000000200008EB700000096 <br>postgres  211311  163479  0 09:36 pts/12   00:0=
0:00 grep reco <br> <br> <br> <br>Content of the wal : <br> <br>postgres@us=
pgvento14b:/pg_data/data14/pg_wal$ pg_waldump 0000000200008EB700000096 |gre=
p 8FEBE8 <br>rmgr: MultiXact   len (rec/tot):     54/    54, tx: 2681967401=
, lsn: 8EB7/968FEBE8, prev 8EB7/968FEBA8, desc: CREATE_ID 2439167 offset 52=
17620 nmembers 2: 2681967400 (keysh) 2681967401 (keysh) <br>rmgr: Heap     =
   len (rec/tot):     54/    54, tx: 2681967401, lsn: 8EB7/968FEC20, prev 8=
EB7/968FEBE8, desc: LOCK off 35: xid 2439167: flags 0x00 IS_MULTI LOCK_ONLY=
 KEYSHR_LOCK , blkref #0: rel 1663/6176124/6474188 blk 29758258 <br> <br> <=
br>There are still plenty of files to process: <br>postgres@uspgvento14Rb:/=
pg_data/data14/pg_wal$ ls -ltr |grep -A5 -B5 0000000200008EB700000096 <br>-=
rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB70000009=
1 <br>-rw------- 1 postgres postgres 16777216 Jun  4 09:32 0000000200008EB7=
00000092 <br>-rw------- 1 postgres postgres 16777216 Jun  4 09:32 000000020=
0008EB700000093 <br>-rw------- 1 postgres postgres 16777216 Jun  4 09:32 00=
00000200008EB700000094 <br>-rw------- 1 postgres postgres 16777216 Jun  4 0=
9:32 0000000200008EB700000095 <br>-rw------- 1 postgres postgres 16777216 J=
un  4 09:33 0000000200008EB700000096 <br>-rw------- 1 postgres postgres 167=
77216 Jun  4 09:33 0000000200008EB700000097 <br>-rw------- 1 postgres postg=
res 16777216 Jun  4 09:33 0000000200008EB700000098 <br>-rw------- 1 postgre=
s postgres 16777216 Jun  4 09:33 0000000200008EB700000099 <br>-rw------- 1 =
postgres postgres 16777216 Jun  4 09:34 0000000200008EB70000009A <br>-rw---=
---- 1 postgres postgres 16777216 Jun  4 09:34 0000000200008EB70000009B <br=
> <br> <br>Postgresql.conf Secondary (primary is the same, but with more me=
mory) <br> <br>All non-default values: <br> <br>shared_buffers =3D 4GB     =
               # min 128kB <br>temp_buffers =3D 512MB                    # =
min 800kB <br>work_mem =3D 512MB                                # min 64kB =
<br>maintenance_work_mem =3D 4GB              # min 1MB <br>autovacuum_work=
_mem =3D  512MB            # min 1MB, or -1 to use maintenance_work_mem <br=
>max_stack_depth =3D 7MB <br>bgwriter_lru_maxpages =3D 1000            # ma=
x buffers written/round, 0 disables <br> <br>wal_level =3D logical         =
            # minimal, replica, or logical <br>wal_log_hints =3D on        =
              # also do full page writes of non-critical updates <br> <br>c=
heckpoint_completion_target =3D 0.8      # checkpoint target duration, 0.0 =
- 1.0 <br>checkpoint_warning =3D 600s               # 0 disables <br>max_wa=
l_size =3D 500GB <br>min_wal_size =3D 50GB <br> <br>max_wal_senders =3D 10 =
           # max number of walsender processes <br>max_replication_slots =
=3D 30      # max number of replication slots <br>wal_keep_size =3D 150GB  =
         # in megabytes; 0 disables  30720=3D30GB  1.5 day <br>max_slot_wal=
_keep_size =3D 500GB  # in megabytes; -1 disables 307200=3D300GB 4 days <br=
>wal_sender_timeout =3D 600s       # in milliseconds; 0 disables <br> <br>m=
ax_standby_archive_delay =3D 48h         # max delay before canceling queri=
es <br>max_standby_streaming_delay =3D 48h       # max delay before canceli=
ng queries <br>hot_standby_feedback =3D on               # send info from s=
tandby to prevent <br>wal_receiver_timeout =3D 600s             # time that=
 receiver waits for <br> <br></div></blockquote></div><div><br></div></div>=
<br></body></html>
------=_Part_13096318_1715792670.1780617085752--