MIME-Version: 1.0
References: 
 <CADzfLwURy8_BYyqrvr6rhTXsW3=5QMRLHuNati3CgY0nKRSwyw@mail.gmail.com>
 <202603242222.5i7awkn7jpdr@alvherre.pgsql>
In-Reply-To: <202603242222.5i7awkn7jpdr@alvherre.pgsql>
From: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Date: Wed, 25 Mar 2026 08:27:04 +0530
Message-ID: 
 <CAFC+b6pK9ogeSpMA8hg18XhC1eNPcsKWBwoC5OySXi4iTxwtRw@mail.gmail.com>
Subject: Re: Adding REPACK [concurrently]
To: Alvaro Herrera <alvherre@alvh.no-ip.org>
Cc: Mihail Nikalayeu <mihailnikalayeu@gmail.com>,
 Antonin Houska <ah@cybertec.at>,
	Matthias van de Meent <boekewurm+postgres@gmail.com>,
	Pg Hackers <pgsql-hackers@lists.postgresql.org>,
 Robert Treat <rob@xzilla.net>
Content-Type: multipart/alternative; boundary="000000000000f6971a064dd06c59"
Archived-At: 
 <https://www.postgresql.org/message-id/CAFC%2Bb6pK9ogeSpMA8hg18XhC1eNPcsKWBwoC5OySXi4iTxwtRw%40mail.gmail.com>
Precedence: bulk

--000000000000f6971a064dd06c59
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hallo Alvaro,

On Wed, Mar 25, 2026 at 4:02=E2=80=AFAM Alvaro Herrera <alvherre@alvh.no-ip=
.org>
wrote:

>
> Many thanks for the review.  I have applied fixes for these, so here's
> v44.
>

Thanks for the patches.

- 0004 is Antonin's bugfix from the crash reported by Srinath.
>

I think it's "0004 is Srinath's bugfix from the crash reported by Srinath."
;-)
after i provided the analysis and fix for the crash[1], Antonin tried to
reproduce
this crash using isolation tester , for this he even proposed changes to
isolation tester (so cool ... btw i reviewed it) [2] .

i have done another round of stress testing on V43 , this time with more
tests,
as i did previously [3] did concurrency test - went well,

crash test:
after crash i observed that repack worker files are cleaned during server
restart
using RemovePgTempFiles but the transient table relation files remains
not cleaned up, maybe we can do cleanup for this as well during server
restart,
I will think about this more.

physical replication test where I did REPACK (concurrently) on primary and
checked if data is intact using the 4 verifications I did here [3] on
replica - went well.

Then as suggested by Alvaro off-list I checked the lock upgrade behavior
during the table swap phase. I observed that if another transaction holds a
conflicting lock on the table when the swap is attempted, it can lead to
=E2=80=9Ctransient table=E2=80=9D data loss during a manual or timeout abor=
t.
when a REPACK (concurrent) waits for a conflicting lock to be released and
eventually hits a
lock_timeout (or is cancelled via ctrl+c), the transaction aborts. During
this abort,
the cleanup process triggers smgrDoPendingDeletes. This results in the
removal
of all transient table relfiles and decoder worker files created during the
process.
This effectively wipes out the work done by the transient table creation
before
the swap could successfully complete, this happens because during transient
table creation we add the table to the PendingRelDelete list.


rebuild_relation=E2=86=92make_new_heap->heap_create_with_catalog=E2=86=92he=
ap_create=E2=86=92table_relation_set_new_filelocator=E2=86=92RelationCreate=
Storage
/*
* Add the relation to the list of stuff to delete at abort, if we are
* asked to do so.
*/
if (register_delete)
{
PendingRelDelete *pending;
pending =3D (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->rlocator =3D rlocator;
pending->procNumber =3D procNumber;
pending->atCommit =3D false; /* delete if abort */
pending->nestLevel =3D GetCurrentTransactionNestLevel();
pending->next =3D pendingDeletes;
pendingDeletes =3D pending;
}
and the base/5/pgsql_tmp/ files also gets unlinked during the decoding
worker cleanup,
I think this cleanup of transient table relfiles and decoder files makes
sense because
we don=E2=80=99t have any resume operation in which we can re-use the trans=
ient
table=E2=80=99s files,
please correct me if I am not getting your point here.

test scenario:

session 1:
postgres=3D# repack (concurrently) stress_victim;
had a breakpoint rebuild_relation_finish_concurrent->
LockRelationOid(old_table_oid, AccessExclusiveLock); just before getting
the exclusive lock.
with lock_timeout =3D 5s

session 2:
postgres=3D# BEGIN;
SELECT * FROM stress_victim LIMIT 1;
-- left it open
BEGIN
 id  | balance |
           payload
-----+---------+---------------------------------
-------------------------------------------------
-------------------------------------------------
-------------------------------------------------
--------------
 170 |      65 | d12f400c4d0d3c49818f88597e16cf29
d12f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c498
18f88597e16cf29d12f400c4d0d3c49818f88597e16cf29d1
2f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c49818
f88597e16cf29
(1 row)
-- this gets us a conflicting lock (AccessShareLock) on the same table,
REPACK (concurrently) is running on.

session 1:
release the breakpoint and now the backend waits for the conflicting lock
to be released.
in between if lock_timeout occurs then transaction aborts.
postgres=3D# repack (concurrently) stress_victim;
ERROR:  canceling statement due to lock timeout
CONTEXT:  waiting for AccessExclusiveLock on relation 16637 of database 5

Now we can see the transient table relfiles and decoder worker files
getting cleaned up.

[1] -
https://www.postgresql.org/message-id/CAFC%2Bb6qk3-DQTi43QMqvVLP%2BsudPV4vs=
LQm5iHfcCeObrNaVyA%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/flat/4703.1774250534%40localhos=
t
[3] -
https://www.postgresql.org/message-id/CAFC%2Bb6o2yzA80YmfEhmMO9puN8qvGRvr-1=
5BBLn3UmJxPfpr2w%40mail.gmail.com

--=20
Thanks,
Srinath Reddy Sadipiralla
EDB: https://www.enterprisedb.com/

--000000000000f6971a064dd06c59
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Hallo Alvaro,</div><br><div class=3D"gmai=
l_quote gmail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Wed=
, Mar 25, 2026 at 4:02=E2=80=AFAM Alvaro Herrera &lt;<a href=3D"mailto:alvh=
erre@alvh.no-ip.org">alvherre@alvh.no-ip.org</a>&gt; wrote:<br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex"><br>
Many thanks for the review.=C2=A0 I have applied fixes for these, so here&#=
39;s<br>
v44.=C2=A0<br></blockquote><div><br>Thanks for the patches.<br><br></div><b=
lockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-le=
ft:1px solid rgb(204,204,204);padding-left:1ex">
- 0004 is Antonin&#39;s bugfix from the crash reported by Srinath.<br></blo=
ckquote></div><div><br>I think it&#39;s=C2=A0&quot;0004 is Srinath&#39;s bu=
gfix from the crash reported by Srinath.&quot; ;-)<br>after i provided the =
analysis and fix for the crash[1], Antonin tried to reproduce<br>this crash=
 using isolation tester , for this he even proposed changes to<br>isolation=
 tester (so cool ... btw i reviewed it) [2] .<br><br>i have done another ro=
und of stress testing on V43 , this time with more tests,<br>as i did previ=
ously [3] did concurrency test - went well, <br><br>crash test:=C2=A0<br>af=
ter crash i observed=C2=A0that repack worker files are cleaned during serve=
r restart<br>using=C2=A0RemovePgTempFiles but the transient table relation =
files remains<br>not cleaned up, maybe we can do cleanup for this as well d=
uring server restart,<br>I will think about this more.</div><div><br>physic=
al replication test where I did REPACK (concurrently) on primary and<br>che=
cked if data is intact using the 4 verifications I did here [3] on replica =
- went well.<br><br>Then as suggested by Alvaro off-list I checked the lock=
 upgrade behavior<br>during the table swap phase. I observed that if anothe=
r transaction holds a<br>conflicting lock on the table when the swap is att=
empted, it can lead to<br>=E2=80=9Ctransient table=E2=80=9D data loss durin=
g a manual or timeout abort.<br>when a REPACK (concurrent) waits for a conf=
licting lock to be released and eventually hits a<br>lock_timeout (or is ca=
ncelled via ctrl+c), the transaction aborts. During this abort,<br>the clea=
nup process triggers smgrDoPendingDeletes. This results in the removal<br>o=
f all transient table relfiles and decoder worker files created during the =
process.<br>This effectively wipes out the work done by the transient table=
 creation before<br>the swap could successfully complete, this happens beca=
use during transient<br>table creation we add the table to the PendingRelDe=
lete list.<br><br><br>rebuild_relation=E2=86=92make_new_heap-&gt;heap_creat=
e_with_catalog=E2=86=92heap_create=E2=86=92table_relation_set_new_filelocat=
or=E2=86=92RelationCreateStorage<br>	/*<br>	 * Add the relation to the list=
 of stuff to delete at abort, if we are<br>	 * asked to do so.<br>	 */<br>	=
if (register_delete)<br>	{<br>		PendingRelDelete *pending;<br>		pending =3D=
 (PendingRelDelete *)<br>			MemoryContextAlloc(TopMemoryContext, sizeof(Pen=
dingRelDelete));<br>		pending-&gt;rlocator =3D rlocator;<br>		pending-&gt;p=
rocNumber =3D procNumber;<br>		pending-&gt;atCommit =3D false;	/* delete if=
 abort */<br>		pending-&gt;nestLevel =3D GetCurrentTransactionNestLevel();<=
br>		pending-&gt;next =3D pendingDeletes;<br>		pendingDeletes =3D pending;<=
br>	}<br>and the base/5/pgsql_tmp/ files also gets unlinked during the deco=
ding worker cleanup,<br>I think this cleanup of transient table relfiles an=
d decoder files makes sense because<br>we don=E2=80=99t have any resume ope=
ration in which we can re-use the transient table=E2=80=99s files,<br>pleas=
e correct me if I am not getting your point here.<br><br>test scenario:<br>=
<br>session 1:<br>postgres=3D# repack (concurrently) stress_victim;<br>had =
a breakpoint rebuild_relation_finish_concurrent-&gt; LockRelationOid(old_ta=
ble_oid, AccessExclusiveLock); just before getting the exclusive lock.<br>w=
ith lock_timeout =3D 5s<br><br>session 2:<br>postgres=3D# BEGIN;<br>SELECT =
* FROM stress_victim LIMIT 1;<br>-- left it open<br>BEGIN<br>=C2=A0id =C2=
=A0| balance | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0payload =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>-----+--=
-------+---------------------------------<br>------------------------------=
-------------------<br>-------------------------------------------------<br=
>-------------------------------------------------<br>--------------<br>=C2=
=A0170 | =C2=A0 =C2=A0 =C2=A065 | d12f400c4d0d3c49818f88597e16cf29<br>d12f4=
00c4d0d3c49818f88597e16cf29d12f400c4d0d3c498<br>18f88597e16cf29d12f400c4d0d=
3c49818f88597e16cf29d1<br>2f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c49818=
<br>f88597e16cf29<br>(1 row)<br>-- this gets us a conflicting lock (AccessS=
hareLock) on the same table, REPACK (concurrently) is running on.<br><br>se=
ssion 1:<br>release the breakpoint and now the backend waits for the confli=
cting lock to be released.<br>in between if lock_timeout occurs then transa=
ction aborts.<br>postgres=3D# repack (concurrently) stress_victim;<br>ERROR=
: =C2=A0canceling statement due to lock timeout<br>CONTEXT: =C2=A0waiting f=
or AccessExclusiveLock on relation 16637 of database 5<br><br>Now we can se=
e the transient table relfiles and decoder worker files getting cleaned up.=
<br><br>[1] -=C2=A0<a href=3D"https://www.postgresql.org/message-id/CAFC%2B=
b6qk3-DQTi43QMqvVLP%2BsudPV4vsLQm5iHfcCeObrNaVyA%40mail.gmail.com">https://=
www.postgresql.org/message-id/CAFC%2Bb6qk3-DQTi43QMqvVLP%2BsudPV4vsLQm5iHfc=
CeObrNaVyA%40mail.gmail.com</a><br>[2] -=C2=A0<a href=3D"https://www.postgr=
esql.org/message-id/flat/4703.1774250534%40localhost">https://www.postgresq=
l.org/message-id/flat/4703.1774250534%40localhost</a><br>[3] -=C2=A0<a href=
=3D"https://www.postgresql.org/message-id/CAFC%2Bb6o2yzA80YmfEhmMO9puN8qvGR=
vr-15BBLn3UmJxPfpr2w%40mail.gmail.com">https://www.postgresql.org/message-i=
d/CAFC%2Bb6o2yzA80YmfEhmMO9puN8qvGRvr-15BBLn3UmJxPfpr2w%40mail.gmail.com</a=
></div><div><br></div><span class=3D"gmail_signature_prefix">-- </span><br>=
<div dir=3D"ltr" class=3D"gmail_signature"><div dir=3D"ltr"><span style=3D"=
color:rgb(34,34,34)">Thanks,</span><br style=3D"color:rgb(34,34,34)"><span =
style=3D"color:rgb(34,34,34)">Srinath Reddy Sadipiralla</span><br style=3D"=
color:rgb(34,34,34)"><span style=3D"color:rgb(34,34,34)">EDB:=C2=A0</span><=
a href=3D"https://www.enterprisedb.com/" style=3D"color:rgb(17,85,204)" tar=
get=3D"_blank">https://www.enterprisedb.com/</a></div></div></div>

--000000000000f6971a064dd06c59--