MIME-Version: 1.0
From: Dennis White <dwhite@seawardmoon.com>
Date: Tue, 15 Oct 2024 11:36:26 -0400
Message-ID: 
 <CAE=rie9H6p51y8S=nCqzY-3rv0rvgQJcx8Qjoy1ryyhdKcty-w@mail.gmail.com>
Subject: Logical replication slot wal_status "lost" with
 max_slot_wal_keep_size
 = -1
To: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="0000000000000e3537062485b873"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE%3Drie9H6p51y8S%3DnCqzY-3rv0rvgQJcx8Qjoy1ryyhdKcty-w%40mail.gmail.com>
Precedence: bulk

--0000000000000e3537062485b873
Content-Type: text/plain; charset="UTF-8"

My project's replication is failing with the following error:

2024-10-15 14:03:38.446 UTC [2840947] STATEMENT:  SELECT
pg_catalog.set_config('search_path', '', false);
2024-10-15 14:03:38.446 UTC [2840947] ERROR:  cannot read from logical
replication slot "track_subscription"
2024-10-15 14:03:38.446 UTC [2840947] DETAIL:  This slot has been
invalidated because it exceeded the maximum reserved size.
2024-10-15 14:03:38.446 UTC [2840947] STATEMENT:  START_REPLICATION SLOT
"track_subscription" LOGICAL 1380B/CBFAEFF0 (proto_version '2',
publication_names '"track_ingestion"')


trackdb=# select * from pg_replication_slots;
     slot_name      |  plugin  | slot_type | datoid | database | temporary
| active | active_pid | xmin |
 catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status |
safe_wal_size | two_phase
--------------------+----------+-----------+--------+----------+-----------+--------+------------+------+
--------------+-------------+---------------------+------------+---------------+-----------
 track_subscription | pgoutput | logical   |  16402 | trackdb  | f
| f      |            |      |
    406428081 |             | 1380B/BAB7B328      | lost       |
    | f

Publisher and Subscriber DB versions:
PostgreSQL 14.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
20210514 (Red Hat 8.5.0-22), 64-bit

Publisher System settings:
max_slot_wal_keep_size = -1
max_wal_size = 12GB
wal_keep_size = 0

I have controls in place to prevent the replication lag from growing too
much but was surprised to see the wal_status become "lost" given what I
read about the default value for max_slot_keep_size.
My search of this problem suggests I should increase max_wal_size to 96GB
and perhaps set max_slot_wal_keep_size = 0.
Is this correct or is there something else I should do to prevent this from
*ever* happening again?

Thanks,
Dennis

--0000000000000e3537062485b873
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><br></div><div>My project&#39;s replication is failin=
g with the following error:</div><div><br></div><div>2024-10-15 14:03:38.44=
6 UTC [2840947] STATEMENT: =C2=A0SELECT pg_catalog.set_config(&#39;search_p=
ath&#39;, &#39;&#39;, false);<br>2024-10-15 14:03:38.446 UTC [2840947] ERRO=
R: =C2=A0cannot read from logical replication slot &quot;track_subscription=
&quot;<br>2024-10-15 14:03:38.446 UTC [2840947] DETAIL: =C2=A0This slot has=
 been invalidated because it exceeded the maximum reserved size.<br>2024-10=
-15 14:03:38.446 UTC [2840947] STATEMENT: =C2=A0START_REPLICATION SLOT &quo=
t;track_subscription&quot; LOGICAL 1380B/CBFAEFF0 (proto_version &#39;2&#39=
;, publication_names &#39;&quot;track_ingestion&quot;&#39;)<br><br><br>trac=
kdb=3D# select * from pg_replication_slots;<br>=C2=A0 =C2=A0 =C2=A0slot_nam=
e =C2=A0 =C2=A0 =C2=A0| =C2=A0plugin =C2=A0| slot_type | datoid | database =
| temporary | active | active_pid | xmin |<br>=C2=A0catalog_xmin | restart_=
lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase <br>----=
----------------+----------+-----------+--------+----------+-----------+---=
-----+------------+------+<br>--------------+-------------+----------------=
-----+------------+---------------+-----------<br>=C2=A0track_subscription =
| pgoutput | logical =C2=A0 | =C2=A016402 | trackdb =C2=A0| f =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 | f =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0| =C2=A0 =C2=A0 =C2=A0|<br>=C2=A0 =C2=A0 406428081 | =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 | 1380B/BAB7B328 =C2=A0 =C2=A0 =C2=A0| lost =C2=
=A0 =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | f<br=
><br>Publisher and Subscriber DB versions:<br>PostgreSQL 14.12 on x86_64-pc=
-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22), 64-bit=
<br></div><div><br></div><div>Publisher System settings:<br>max_slot_wal_ke=
ep_size =3D -1<br>max_wal_size =3D 12GB<br>wal_keep_size =3D 0<br></div><di=
v><br></div><div>I have controls in place to prevent the replication lag fr=
om growing too much but was surprised to see the wal_status become &quot;lo=
st&quot; given what I read about the default value for max_slot_keep_size.<=
br></div><div>My search of this problem suggests I should increase max_wal_=
size to 96GB and perhaps set max_slot_wal_keep_size =3D 0.</div><div>Is thi=
s correct or is there something else I should do to prevent this from <b>ev=
er</b> happening again?</div><div><br></div><div>Thanks,</div><div>Dennis<b=
r></div><div><br></div><div><br></div><div><br></div></div>

--0000000000000e3537062485b873--