Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5EQs-0032Mg-2b for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Mar 2026 02:57:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w5EQq-00AmK5-36 for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Mar 2026 02:57:21 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w5EQq-00AmJx-1W for pgsql-hackers@lists.postgresql.org; Wed, 25 Mar 2026 02:57:21 +0000 Received: from mail-ua1-x92f.google.com ([2607:f8b0:4864:20::92f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w5EQo-00000000tTy-3CTU for pgsql-hackers@lists.postgresql.org; Wed, 25 Mar 2026 02:57:20 +0000 Received: by mail-ua1-x92f.google.com with SMTP id a1e0cc1a2514c-94dd7178d63so3351849241.3 for ; Tue, 24 Mar 2026 19:57:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774407437; cv=none; d=google.com; s=arc-20240605; b=RitFubpRtmbYGzNGH7Pv8DNiVlRDRGWoXi62fppKVLe60uBAWhqJsU3Vls8pp4ZNDV KVz9gxkg0JxtJhy0MhnrL3QCc2uRqR/ZfcMwckufxqVW9ZfuTSGUcwLWEWV8QqJ/oGl9 i9VXJjOJG7M7/WUBOLoslo3z5K6YxGVJZDZleCTaFEYDGvHQniwJHitaTjji/JknwUAU rfv77nPzrg8YBI1ofFNQo+cQS3Z2MG+s5juTCy2VVw54PvQ/JD2C/XXhYXm3TSBwmyZq LUDjqiFOt39UScdfbcSWcA+EHPUT0Ozn6oVCHBdj+hOEK09ROCW8iEt8jlg+lXUeuak2 dIqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=GRtp+kv1XZ1rGzqFaKZqrwQlhR9UScCTWShMt2dn/Tc=; fh=bRYQZExFA7GL37u250kfA6woEDmq7jrNDm5U4jlRkKM=; b=dfMo5B3HFKQff8+hTbxwl6DyFl2UMx6F75HVZLEJigLMEcgBm0JzQGqBSeR3cGyjWU j5JreJ9SUEpb3+paM3B2y7qT8wCUopEeyPPIk3Cj8Y8Z6E7ZREfyZstJCeqRniaCtrWQ z2hKBGFBl0TGkQZheDOYzrRdcT16kyuZ7galJlqgMyy9ckBUccHKx5QgHggDVOKXH2av yMF5yLxYQoB0z7u9d7Q00jC0F2fBAjHL2aJflSNveJPpRBDPoA/xHIWyiyhg/d1k9Lb8 icCsWZFIhmXJ/5jQ9N0nuq6UM3GJ0pc37mGXZac5h1yXaMeithgUbnB4+a3OpnlL1QhS rPRw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774407437; x=1775012237; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=GRtp+kv1XZ1rGzqFaKZqrwQlhR9UScCTWShMt2dn/Tc=; b=AQHzLrdkvi1QkD4TDG7XI9JY3KMN2Pm+4/siVNDiWlyIXnTucyBW5NSEyXkenmApBb J3KVmLu2gCg/E+M9orVJ6hqdGVRUtC2vyNPFO+v5WzdkzvpUUaHb+Jh2VeDntCGiSXEi xgmpQv1pXvDqbZ2RYmWcP8vyb96PLpuM/L3gB8KS1uIp6d1+F/TgfBB6s4akiwr71IrK NAdOfrDkgdrxN6beadwp/R/Gx6Jajd97E9rwcpeWbUJTgnv7NtS2YZk66bTUYPaCoHcM iok4fQCXr1cVyvhslpbE+LtglFcuPbzTWYJzfvSvUx2UYmH6oZ2ui9gL6I3P+sCNieed UGOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774407437; x=1775012237; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=GRtp+kv1XZ1rGzqFaKZqrwQlhR9UScCTWShMt2dn/Tc=; b=U8XTSTSaYzKOuQQEfIwp72qz9MXThI7KcpPF1gK1+cWhZZXZQvv6AULeoA/0W88+pY dgcGFpYpkGOQLq6owEvBAsVFmFYQzRGCOTwD/7qgqy3Nh4uFjn1RGF5Zqpx16hO2+MaL BzFFnJmiVUZo0IeFA+5IrmtjIkE22zc6kD+xo54jTcywKHNPHT3fVmxUOtHidLfL9dkt ziEyBgSd1ZLFYe88PwT4nQV0vH8Ljk2fhAkjUXVfFSWserIFqdkFKzwAOCK143Eozmft VDy5q9jfvr4h3Wn8DggaCwm5TAMC3rxXBkhXDGhmYrUf8GzLv8o4gSjDJKD0miINGcb3 xaJw== X-Forwarded-Encrypted: i=1; AJvYcCX0Vn6CQdzvwHzVo0gt6LlgYp3/lHBB4014MPVlpyZFI+1kt2ey94GAXCGUmUPIMck0dcyw/DLV+x5kWr+x@lists.postgresql.org X-Gm-Message-State: AOJu0Yz5v6t0zIQDVR9r3/Cal9AVZp1sIDDFVmLUb6S+MtL0Og3H7Pzd PMeeWgRxbxPTA97p9JMm3VtLYPTLsggG5jmjq+nSPWhKrxg4JfMb3iYrPA5Q8abaY4OI8/Nqfpm 4dls8zZ7M4a4ae9JqM1bcKMRGGNI+iG0= X-Gm-Gg: ATEYQzzWthkX9d+wtZEMIbVlyhTdBsMl9g0AZnnkeGds8hS0HwQIpHOa/4zv5/mh75Y OdbLSdwrtjH6WdzF9GPHIzfQ9M7PG+eZDeiJR6FuthkDPIpDQD+BmnNqtA0c8iOxiZpn8TgS6Rh QErJufIyK1mN6mO5ay9SdlfU+UTKzpBgTcWiN2bdCGhmRq60f8A9xs9SGmQ3UIWkAIPoPR4jByc FvkctFHR3VS+aaYjtzQmX2+YJz+/Q62q3YwpX1ve4UovAWTFDs3k2Jswn19aleuj2f9HXV6HzyZ lSw72JzJTY6Vf4ShE+3/aaaGbw7/HSF27HYloT2cpc4gkZ9zXukcWmWFo4wjO6ICCdXV4IQbZ96 LsF4JYpxmwlyUtC1HPYV+l2K+7A== X-Received: by 2002:a05:6102:2041:b0:601:f79e:b1a1 with SMTP id ada2fe7eead31-603872ef2femr735072137.32.1774407436898; Tue, 24 Mar 2026 19:57:16 -0700 (PDT) MIME-Version: 1.0 References: <202603242222.5i7awkn7jpdr@alvherre.pgsql> In-Reply-To: <202603242222.5i7awkn7jpdr@alvherre.pgsql> From: Srinath Reddy Sadipiralla Date: Wed, 25 Mar 2026 08:27:04 +0530 X-Gm-Features: AQROBzCCwK87GO7uIk5ImBBvhYXmyC47CAjSP_Bwiv7r-51_zxs5btBuYJsA3Qw Message-ID: Subject: Re: Adding REPACK [concurrently] To: Alvaro Herrera Cc: Mihail Nikalayeu , Antonin Houska , Matthias van de Meent , Pg Hackers , Robert Treat Content-Type: multipart/alternative; boundary="000000000000f6971a064dd06c59" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000f6971a064dd06c59 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hallo Alvaro, On Wed, Mar 25, 2026 at 4:02=E2=80=AFAM Alvaro Herrera wrote: > > Many thanks for the review. I have applied fixes for these, so here's > v44. > Thanks for the patches. - 0004 is Antonin's bugfix from the crash reported by Srinath. > I think it's "0004 is Srinath's bugfix from the crash reported by Srinath." ;-) after i provided the analysis and fix for the crash[1], Antonin tried to reproduce this crash using isolation tester , for this he even proposed changes to isolation tester (so cool ... btw i reviewed it) [2] . i have done another round of stress testing on V43 , this time with more tests, as i did previously [3] did concurrency test - went well, crash test: after crash i observed that repack worker files are cleaned during server restart using RemovePgTempFiles but the transient table relation files remains not cleaned up, maybe we can do cleanup for this as well during server restart, I will think about this more. physical replication test where I did REPACK (concurrently) on primary and checked if data is intact using the 4 verifications I did here [3] on replica - went well. Then as suggested by Alvaro off-list I checked the lock upgrade behavior during the table swap phase. I observed that if another transaction holds a conflicting lock on the table when the swap is attempted, it can lead to =E2=80=9Ctransient table=E2=80=9D data loss during a manual or timeout abor= t. when a REPACK (concurrent) waits for a conflicting lock to be released and eventually hits a lock_timeout (or is cancelled via ctrl+c), the transaction aborts. During this abort, the cleanup process triggers smgrDoPendingDeletes. This results in the removal of all transient table relfiles and decoder worker files created during the process. This effectively wipes out the work done by the transient table creation before the swap could successfully complete, this happens because during transient table creation we add the table to the PendingRelDelete list. rebuild_relation=E2=86=92make_new_heap->heap_create_with_catalog=E2=86=92he= ap_create=E2=86=92table_relation_set_new_filelocator=E2=86=92RelationCreate= Storage /* * Add the relation to the list of stuff to delete at abort, if we are * asked to do so. */ if (register_delete) { PendingRelDelete *pending; pending =3D (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->rlocator =3D rlocator; pending->procNumber =3D procNumber; pending->atCommit =3D false; /* delete if abort */ pending->nestLevel =3D GetCurrentTransactionNestLevel(); pending->next =3D pendingDeletes; pendingDeletes =3D pending; } and the base/5/pgsql_tmp/ files also gets unlinked during the decoding worker cleanup, I think this cleanup of transient table relfiles and decoder files makes sense because we don=E2=80=99t have any resume operation in which we can re-use the trans= ient table=E2=80=99s files, please correct me if I am not getting your point here. test scenario: session 1: postgres=3D# repack (concurrently) stress_victim; had a breakpoint rebuild_relation_finish_concurrent-> LockRelationOid(old_table_oid, AccessExclusiveLock); just before getting the exclusive lock. with lock_timeout =3D 5s session 2: postgres=3D# BEGIN; SELECT * FROM stress_victim LIMIT 1; -- left it open BEGIN id | balance | payload -----+---------+--------------------------------- ------------------------------------------------- ------------------------------------------------- ------------------------------------------------- -------------- 170 | 65 | d12f400c4d0d3c49818f88597e16cf29 d12f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c498 18f88597e16cf29d12f400c4d0d3c49818f88597e16cf29d1 2f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c49818 f88597e16cf29 (1 row) -- this gets us a conflicting lock (AccessShareLock) on the same table, REPACK (concurrently) is running on. session 1: release the breakpoint and now the backend waits for the conflicting lock to be released. in between if lock_timeout occurs then transaction aborts. postgres=3D# repack (concurrently) stress_victim; ERROR: canceling statement due to lock timeout CONTEXT: waiting for AccessExclusiveLock on relation 16637 of database 5 Now we can see the transient table relfiles and decoder worker files getting cleaned up. [1] - https://www.postgresql.org/message-id/CAFC%2Bb6qk3-DQTi43QMqvVLP%2BsudPV4vs= LQm5iHfcCeObrNaVyA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/flat/4703.1774250534%40localhos= t [3] - https://www.postgresql.org/message-id/CAFC%2Bb6o2yzA80YmfEhmMO9puN8qvGRvr-1= 5BBLn3UmJxPfpr2w%40mail.gmail.com --=20 Thanks, Srinath Reddy Sadipiralla EDB: https://www.enterprisedb.com/ --000000000000f6971a064dd06c59 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hallo Alvaro,

On Wed= , Mar 25, 2026 at 4:02=E2=80=AFAM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Many thanks for the review.=C2=A0 I have applied fixes for these, so here&#= 39;s
v44.=C2=A0

Thanks for the patches.

- 0004 is Antonin's bugfix from the crash reported by Srinath.

I think it's=C2=A0"0004 is Srinath's bu= gfix from the crash reported by Srinath." ;-)
after i provided the = analysis and fix for the crash[1], Antonin tried to reproduce
this crash= using isolation tester , for this he even proposed changes to
isolation= tester (so cool ... btw i reviewed it) [2] .

i have done another ro= und of stress testing on V43 , this time with more tests,
as i did previ= ously [3] did concurrency test - went well,

crash test:=C2=A0
af= ter crash i observed=C2=A0that repack worker files are cleaned during serve= r restart
using=C2=A0RemovePgTempFiles but the transient table relation = files remains
not cleaned up, maybe we can do cleanup for this as well d= uring server restart,
I will think about this more.

physic= al replication test where I did REPACK (concurrently) on primary and
che= cked if data is intact using the 4 verifications I did here [3] on replica = - went well.

Then as suggested by Alvaro off-list I checked the lock= upgrade behavior
during the table swap phase. I observed that if anothe= r transaction holds a
conflicting lock on the table when the swap is att= empted, it can lead to
=E2=80=9Ctransient table=E2=80=9D data loss durin= g a manual or timeout abort.
when a REPACK (concurrent) waits for a conf= licting lock to be released and eventually hits a
lock_timeout (or is ca= ncelled via ctrl+c), the transaction aborts. During this abort,
the clea= nup process triggers smgrDoPendingDeletes. This results in the removal
o= f all transient table relfiles and decoder worker files created during the = process.
This effectively wipes out the work done by the transient table= creation before
the swap could successfully complete, this happens beca= use during transient
table creation we add the table to the PendingRelDe= lete list.


rebuild_relation=E2=86=92make_new_heap->heap_creat= e_with_catalog=E2=86=92heap_create=E2=86=92table_relation_set_new_filelocat= or=E2=86=92RelationCreateStorage
/*
* Add the relation to the list= of stuff to delete at abort, if we are
* asked to do so.
*/
= if (register_delete)
{
PendingRelDelete *pending;
pending =3D= (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(Pen= dingRelDelete));
pending->rlocator =3D rlocator;
pending->p= rocNumber =3D procNumber;
pending->atCommit =3D false; /* delete if= abort */
pending->nestLevel =3D GetCurrentTransactionNestLevel();<= br> pending->next =3D pendingDeletes;
pendingDeletes =3D pending;<= br> }
and the base/5/pgsql_tmp/ files also gets unlinked during the deco= ding worker cleanup,
I think this cleanup of transient table relfiles an= d decoder files makes sense because
we don=E2=80=99t have any resume ope= ration in which we can re-use the transient table=E2=80=99s files,
pleas= e correct me if I am not getting your point here.

test scenario:
=
session 1:
postgres=3D# repack (concurrently) stress_victim;
had = a breakpoint rebuild_relation_finish_concurrent-> LockRelationOid(old_ta= ble_oid, AccessExclusiveLock); just before getting the exclusive lock.
w= ith lock_timeout =3D 5s

session 2:
postgres=3D# BEGIN;
SELECT = * FROM stress_victim LIMIT 1;
-- left it open
BEGIN
=C2=A0id =C2= =A0| balance | =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0payload =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
-----+--= -------+---------------------------------
------------------------------= -------------------
--------------------------------------------------------------------------------------------------
--------------
=C2= =A0170 | =C2=A0 =C2=A0 =C2=A065 | d12f400c4d0d3c49818f88597e16cf29
d12f4= 00c4d0d3c49818f88597e16cf29d12f400c4d0d3c498
18f88597e16cf29d12f400c4d0d= 3c49818f88597e16cf29d1
2f400c4d0d3c49818f88597e16cf29d12f400c4d0d3c49818=
f88597e16cf29
(1 row)
-- this gets us a conflicting lock (AccessS= hareLock) on the same table, REPACK (concurrently) is running on.

se= ssion 1:
release the breakpoint and now the backend waits for the confli= cting lock to be released.
in between if lock_timeout occurs then transa= ction aborts.
postgres=3D# repack (concurrently) stress_victim;
ERROR= : =C2=A0canceling statement due to lock timeout
CONTEXT: =C2=A0waiting f= or AccessExclusiveLock on relation 16637 of database 5

Now we can se= e the transient table relfiles and decoder worker files getting cleaned up.=

[1] -=C2=A0https://= www.postgresql.org/message-id/CAFC%2Bb6qk3-DQTi43QMqvVLP%2BsudPV4vsLQm5iHfc= CeObrNaVyA%40mail.gmail.com
[2] -=C2=A0https://www.postgresq= l.org/message-id/flat/4703.1774250534%40localhost
[3] -=C2=A0https://www.postgresql.org/message-i= d/CAFC%2Bb6o2yzA80YmfEhmMO9puN8qvGRvr-15BBLn3UmJxPfpr2w%40mail.gmail.com

--
=
--000000000000f6971a064dd06c59--