Adding REPACK [concurrently]

public inbox for [email protected]  
help / color / mirror / Atom feed

Adding REPACK [concurrently]
106+ messages / 13 participants
[nested] [flat]

* Adding REPACK [concurrently]
@ 2025-07-26 21:56  Alvaro Herrera <[email protected]>
  0 siblings, 7 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-07-26 21:56 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; +Cc: Antonin Houska <[email protected]>

Hello,

Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it.
This is coming from [1].  The ultimate goal is to have an in-core tool
to allow concurrent table rewrite to get rid of bloat; right now, VACUUM
FULL does that, but it's not concurrent.  Users have resorted to using
the pg_repack third-party tool, which is ancient and uses a weird
internal implementation, as well as pg_squeeze, which uses logical
decoding to capture changes that occur during the table rewrite.  The
patch submitted here, largely by Antonin Houska with some changes by me,
is based on the the pg_squeeze code which he authored, and first
introduces a new command called REPACK to absorb both VACUUM FULL and
CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms
of REPACK to operate online using logical decoding.

Essentially, this first patch just reshuffles the CLUSTER code to create
the REPACK command.

I made a few changes from Antonin's original at [2].  First, I modified
the grammar to support "REPACK [tab] USING INDEX" without specifying the
index name.  With this change, all possibilities of the old commands are
covered, which gives us the chance to flag them as obsolete.  (This is
good, because having VACUUM FULL do something completely different from
regular VACUUM confuses users all the time; and on the other hand,
having a command called CLUSTER which is at odds with what most people
think of as a "database cluster" is also confusing.)

Here's a list of existing commands, and how to write them in the current
patch's proposal for REPACK:

-- re-clusters all tables that have a clustered index set
CLUSTER                     -> REPACK USING INDEX

-- clusters the given table using the given index
CLUSTER tab USING idx       -> REPACK tab USING INDEX idx

-- clusters this table using a clustered index; error if no index clustered
CLUSTER tab                 -> REPACK tab USING INDEX

-- vacuum-full all tables
VACUUM FULL                 -> REPACK

-- vacuum-full the specified table
VACUUM FULL tab             -> REPACK tab

My other change to Antonin's patch is that I made REPACK USING INDEX set
the 'indisclustered' flag to the index being used, so REPACK behaves
identically to CLUSTER.  We can discuss whether we really want this.
For instance we could add an option so that by default REPACK omits
persisting the clustered index, and instead it only does that when you
give it some special option, say something like
  "REPACK (persist_clustered_index=true) tab USING INDEX idx"
Overall I'm not sure this is terribly interesting, since clustered
indexes are not very useful for most users anyway.

I made a few other minor changes not worthy of individual mention, and
there are a few others pending, such as updates to the
pg_stat_progress_repack view infrastructure, as well as phasing out
pg_stat_progress_cluster (maybe the latter would offer a subset of the
former; not yet sure about this.)  Also, I'd like to work on adding a
`repackdb` command for completeness.

On repackdb: I think is going to be very similar to vacuumdb, mostly in
that it is going to need to be able to run tasks in parallel; but there
are things it doesn't have to deal with, such as analyze-in-stages,
which I think is a large burden.  I estimate about 1k LOC there,
extremely similar to vacuumdb.  Maybe it makes sense to share the source
code and make the new executable a symlink instead, with some additional
code to support the two different modes.  Again, I'm not sure about
this -- I like the idea, but I'd have to see the implementation.

I'll be rebasing the rest of Antonin's patch series afterwards,
including the logical decoding changes necessary for CONCURRENTLY.  In
the meantime, if people want to review those, which would be very
valuable, they can go back to branch master from around the time he
submitted it and apply the old patches there.

[1] https://postgr.es/m/76278.1724760050@antos
[2] https://postgr.es/m/152010.1751307725@localhost

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-07-27 01:59  Robert Treat <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 1 reply; 106+ messages in thread

From: Robert Treat @ 2025-07-27 01:59 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

On Sat, Jul 26, 2025 at 5:56 PM Alvaro Herrera <[email protected]> wrote:
>
> Hello,
>
> Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it.
> This is coming from [1].  The ultimate goal is to have an in-core tool
> to allow concurrent table rewrite to get rid of bloat; right now, VACUUM
> FULL does that, but it's not concurrent.  Users have resorted to using
> the pg_repack third-party tool, which is ancient and uses a weird
> internal implementation, as well as pg_squeeze, which uses logical
> decoding to capture changes that occur during the table rewrite.  The
> patch submitted here, largely by Antonin Houska with some changes by me,
> is based on the the pg_squeeze code which he authored, and first
> introduces a new command called REPACK to absorb both VACUUM FULL and
> CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms
> of REPACK to operate online using logical decoding.
>
> Essentially, this first patch just reshuffles the CLUSTER code to create
> the REPACK command.
>

Thanks for keeping this ball rolling.

>
> My other change to Antonin's patch is that I made REPACK USING INDEX set
> the 'indisclustered' flag to the index being used, so REPACK behaves
> identically to CLUSTER.  We can discuss whether we really want this.
> For instance we could add an option so that by default REPACK omits
> persisting the clustered index, and instead it only does that when you
> give it some special option, say something like
>   "REPACK (persist_clustered_index=true) tab USING INDEX idx"
> Overall I'm not sure this is terribly interesting, since clustered
> indexes are not very useful for most users anyway.
>

I think I would lean towards having it work like CLUSTER (preserve the
index), since that helps people making the transition, and it doesn't
feel terribly useful to invent new syntax for a feature that I would
agree isn't very useful for most people.

> I made a few other minor changes not worthy of individual mention, and
> there are a few others pending, such as updates to the
> pg_stat_progress_repack view infrastructure, as well as phasing out
> pg_stat_progress_cluster (maybe the latter would offer a subset of the
> former; not yet sure about this.)  Also, I'd like to work on adding a
> `repackdb` command for completeness.
>
> On repackdb: I think is going to be very similar to vacuumdb, mostly in
> that it is going to need to be able to run tasks in parallel; but there
> are things it doesn't have to deal with, such as analyze-in-stages,
> which I think is a large burden.  I estimate about 1k LOC there,
> extremely similar to vacuumdb.  Maybe it makes sense to share the source
> code and make the new executable a symlink instead, with some additional
> code to support the two different modes.  Again, I'm not sure about
> this -- I like the idea, but I'd have to see the implementation.
>
> I'll be rebasing the rest of Antonin's patch series afterwards,
> including the logical decoding changes necessary for CONCURRENTLY.  In
> the meantime, if people want to review those, which would be very
> valuable, they can go back to branch master from around the time he
> submitted it and apply the old patches there.
>

For clarity, are you intending to commit this patch before having the
other parts ready? (If that sounds like an objection, it isn't) After
a first pass, I think there's some confusing bits in the new docs that
could use straightening out, but there likely going to overlap changes
once concurrently is brought in, so it might make sense to hold off on
those. Either way I definitely want to dive into this a bit deeper
with some fresh eyes, there's a lot to digest... speaking of, for this
bit in src/backend/commands/cluster.c

+    switch (cmd)
+    {
+        case REPACK_COMMAND_REPACK:
+            return "REPACK";
+        case REPACK_COMMAND_VACUUMFULL:
+            return "VACUUM";
+        case REPACK_COMMAND_CLUSTER:
+            return "VACUUM";
+    }
+    return "???";

The last one should return "CLUSTER" no?


Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-07-27 06:00  Fujii Masao <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 0 replies; 106+ messages in thread

From: Fujii Masao @ 2025-07-27 06:00 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

On Sun, Jul 27, 2025 at 6:56 AM Alvaro Herrera <[email protected]> wrote:
>
> Hello,
>
> Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it.
> This is coming from [1].  The ultimate goal is to have an in-core tool
> to allow concurrent table rewrite to get rid of bloat;

+1


> right now, VACUUM
> FULL does that, but it's not concurrent.  Users have resorted to using
> the pg_repack third-party tool, which is ancient and uses a weird
> internal implementation, as well as pg_squeeze, which uses logical
> decoding to capture changes that occur during the table rewrite.  The
> patch submitted here, largely by Antonin Houska with some changes by me,
> is based on the the pg_squeeze code which he authored, and first
> introduces a new command called REPACK to absorb both VACUUM FULL and
> CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms
> of REPACK to operate online using logical decoding.

Does this mean REPACK CONCURRENTLY requires wal_level = logical,
while plain REPACK (without CONCURRENTLY) works with any wal_level
setting? If we eventually deprecate VACUUM FULL and CLUSTER,
I think plain REPACK should still be allowed with wal_level = minimal
or replica, so users with those settings can perform equivalent
processing.


+ if (!cluster_is_permitted_for_relation(tableOid, userid,
+    CLUSTER_COMMAND_CLUSTER))

As for the patch you attached, it seems to be an early WIP and
might not be ready for review yet?? BTW, I got the following
compilation failure and probably CLUSTER_COMMAND_CLUSTER
the above should be GetUserId().

-----------------
cluster.c:455:14: error: use of undeclared identifier 'CLUSTER_COMMAND_CLUSTER'
  455 |
                    CLUSTER_COMMAND_CLUSTER))
      |
                    ^
1 error generated.
-----------------

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-07-31 16:50  Alvaro Herrera <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Alvaro Herrera @ 2025-07-31 16:50 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; Fujii Masao <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

On 2025-Jul-26, Robert Treat wrote:

> For clarity, are you intending to commit this patch before having the
> other parts ready? (If that sounds like an objection, it isn't) After
> a first pass, I think there's some confusing bits in the new docs that
> could use straightening out, but there likely going to overlap changes
> once concurrently is brought in, so it might make sense to hold off on
> those.

I'm aiming at getting 0001 committed during the September commitfest,
and the CONCURRENTLY flag addition later in the pg19 cycle.  But I'd
rather have good-enough docs at every step of the way.  They don't have
to be *perfect* if we want to get everything in pg19, but I'd rather not
leave anything openly confusing even transiently.

That said, I did not review the docs this time around, so here's them
the same as they were in the previous post.  But if you want to suggest
changes for the docs in 0001, please do.  Just don't get too carried
away.

> speaking of, for this bit in src/backend/commands/cluster.c
> 
> +    switch (cmd)
> +    {
> +        case REPACK_COMMAND_REPACK:
> +            return "REPACK";
> +        case REPACK_COMMAND_VACUUMFULL:
> +            return "VACUUM";
> +        case REPACK_COMMAND_CLUSTER:
> +            return "VACUUM";
> +    }
> +    return "???";
> 
> The last one should return "CLUSTER" no?

Absolutely -- my blunder.

On 2025-Jul-27, Fujii Masao wrote:

> > The patch submitted here, largely by Antonin Houska with some
> > changes by me, is based on the the pg_squeeze code which he
> > authored, and first introduces a new command called REPACK to absorb
> > both VACUUM FULL and CLUSTER, followed by addition of a CONCURRENTLY
> > flag to allow some forms of REPACK to operate online using logical
> > decoding.
> 
> Does this mean REPACK CONCURRENTLY requires wal_level = logical, while
> plain REPACK (without CONCURRENTLY) works with any wal_level setting?
> If we eventually deprecate VACUUM FULL and CLUSTER, I think plain
> REPACK should still be allowed with wal_level = minimal or replica, so
> users with those settings can perform equivalent processing.

Absolutely.

One of the later patches in the series, which I have not included yet,
intends to implement the idea of transiently enabling wal_level=logical
for the table being repacked concurrently, so that you can still use
the concurrent mode if you have a non-logical-wal_level instance.

> + if (!cluster_is_permitted_for_relation(tableOid, userid,
> +    CLUSTER_COMMAND_CLUSTER))
> 
> As for the patch you attached, it seems to be an early WIP and
> might not be ready for review yet?? BTW, I got the following
> compilation failure and probably CLUSTER_COMMAND_CLUSTER
> the above should be GetUserId().

This was a silly merge mistake, caused by my squashing Antonin's 0004
(trivial code restructuring) into 0001 at the last minute and failing to
"git add" the compile fixes before doing git-format-patch.

Here's v17.  (I decided that calling my previous one "v1" after Antonin
had gone all the way to v15 was stupid on my part.)  The important part
here is that I rebased Antonin 0004's, that is, the addition of the
CONCURRENTLY flag, plus 0005 regression tests.

The only interesting change here is that I decided to not mess with the
grammar by allowing an unparenthesized CONCURRENTLY keyword; if you want
concurrent, you have to say "REPACK (CONCURRENTLY)".  This is at odds
with the way we use the keyword in other commands, but ISTM we don't
_need_ to support that legacy syntax.  Anyway, this is easy to put back
afterwards, if enough people find it not useless.

I've not reviewed 0003 in depth yet, just rebased it.  But it works to
the point that CI is happy with it.

I've not yet included Antonin's 0006 and 0007.

TODO list for 0001:

- addition of src/bin/scripts/repackdb
- clean up the progress report infrastructure
- doc review

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-01 11:07  Fujii Masao <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Fujii Masao @ 2025-08-01 11:07 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

On Fri, Aug 1, 2025 at 1:50 AM Alvaro Herrera <[email protected]> wrote:
> One of the later patches in the series, which I have not included yet,
> intends to implement the idea of transiently enabling wal_level=logical
> for the table being repacked concurrently, so that you can still use
> the concurrent mode if you have a non-logical-wal_level instance.

Sounds good to me!


> Here's v17.

I just tried REPACK command and observed a few things:

When I repeatedly ran REPACK on the regression database
while make installcheck was running, I got the following error:

        ERROR:  StartTransactionCommand: unexpected state STARTED

"REPACK (VERBOSE);" failed with the following error.

        ERROR:  syntax error at or near ";"

REPACK (CONCURRENTLY) USING INDEX failed with the following error,
while the same command without CONCURRENTLY completed successfully:

        =# REPACK (CONCURRENTLY) parallel_vacuum_table using index
regular_sized_index ;
        ERROR:  cannot process relation "parallel_vacuum_table"
        HINT:  Relation "parallel_vacuum_table" has no identity index.

When I ran REPACK (CONCURRENTLY) on a table that's also a logical
replication target, I saw the following log messages. Is this expected?

        =# REPACK (CONCURRENTLY) t;
        LOG:  logical decoding found consistent point at 1/00021F20
        DETAIL:  There are no running transactions.
        STATEMENT:  REPACK (CONCURRENTLY) t;

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-04 23:21  Mihail Nikalayeu <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-04 23:21 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

Hello Álvaro,

Should we skip non-ready indexes in build_new_indexes?

Also, in the current implementation, concurrent mode is marked as
non-MVCC safe. From my point of view, this is a significant limitation
for practical use.
Should we consider an option to exchange non-MVCC issues to short
exclusive lock of ProcArrayLock + cancellation of some transactions
with older xmin?

Best regards,
Mikhail

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-05 08:58  Antonin Houska <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-05 08:58 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>

Alvaro Herrera <[email protected]> wrote:

> I made a few changes from Antonin's original at [2].  First, I modified
> the grammar to support "REPACK [tab] USING INDEX" without specifying the
> index name.  With this change, all possibilities of the old commands are
> covered,

...

> Here's a list of existing commands, and how to write them in the current
> patch's proposal for REPACK:
> 
> -- re-clusters all tables that have a clustered index set
> CLUSTER                     -> REPACK USING INDEX
> 
> -- clusters the given table using the given index
> CLUSTER tab USING idx       -> REPACK tab USING INDEX idx
> 
> -- clusters this table using a clustered index; error if no index clustered
> CLUSTER tab                 -> REPACK tab USING INDEX
> 
> -- vacuum-full all tables
> VACUUM FULL                 -> REPACK
> 
> -- vacuum-full the specified table
> VACUUM FULL tab             -> REPACK tab
> 

Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the
options of VACUUM FULL. I found two items not supported by REPACK (but also
not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just
let's mention that in the user documentation of REPACK?

(Besides that, VACUUM FULL accepts TRUNCATE and INDEX_CLEANUP options, but I
think these have no effect.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-09 12:55  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-09 12:55 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

Hello!

One more thing - I think build_new_indexes and
index_concurrently_create_copy are very close in semantics, so it
might be a good idea to refactor them a bit.

I’m still concerned about MVCC-related issues. For multiple
applications, this is a dealbreaker, because in some cases correctness
is a higher priority than availability.

Possible options:

1) Terminate connections with old snapshots.

Add a flag to terminate all connections with snapshots during the
ExclusiveLock period for the swap. From the application’s perspective,
this is not a big deal - it's similar to a primary switch. We would
also need to prevent new snapshots from being taken during the swap
transaction, so a short exclusive lock on ProcArrayLock would also be
required.

2) MVCC-safe two-phase approach (inspired by CREATE INDEX).

- copy the data from T1 to the new table T2.
- apply the log.
- take a table-exclusive lock on T1
- apply the log again.
- instead of swapping, mark the T2 as a kind of shadow table - any
transaction applying changes to T1 must also apply them to T2, while
reads still use T1 as the source of truth.
- commit (and record the transaction ID as XID1).
- at this point, all changes are applied to both tables with the same
XIDs because of the "shadow table" mechanism.
- wait until older snapshots no longer treat XID1 as uncommitted.
- now the tables are identical from the MVCC perspective.
- take an exclusive lock on both T1 and T2.
- perform the swap and drop T1.
- commit.

This is more complex and would require implementing some sort of
"shadow table" mechanism, so it might not be worth the effort. Option
1 feels more appealing to me.

If others think this is a good idea, I might try implementing a proof
of concept.

Best regards,
Mikhail

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-09 13:33  Alvaro Herrera <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-09 13:33 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

On 2025-Aug-09, Mihail Nikalayeu wrote:

> Hello!
> 
> One more thing - I think build_new_indexes and
> index_concurrently_create_copy are very close in semantics, so it
> might be a good idea to refactor them a bit.
> 
> I’m still concerned about MVCC-related issues. For multiple
> applications, this is a dealbreaker, because in some cases correctness
> is a higher priority than availability.

Please note that Antonin already implemented this.  See his patches
here:
https://www.postgresql.org/message-id/77690.1725610115%40antos
I proposed to leave this part out initially, which is why it hasn't been
reposted.  We can review and discuss after the initial patches are in.
Because having an MVCC-safe mode has drawbacks, IMO we should make it
optional.

But you're welcome to review that part specifically if you're so
inclined, and offer feedback on it.  (I suggest to rewind back your
checked-out tree to branch master at the time that patch was posted, for
easy application.  We can deal with a rebase later.)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-11 14:22  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-11 14:22 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Fujii Masao <[email protected]>; Alvaro Herrera <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> One more thing - I think build_new_indexes and
> index_concurrently_create_copy are very close in semantics, so it
> might be a good idea to refactor them a bit.

You're right. I think I even used the latter for reference when writing the
first.

0002 in the attached series tries to fix that. build_new_indexes() (in 0004)
is simpler now.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-15 12:32  Antonin Houska <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-15 12:32 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Antonin Houska <[email protected]> wrote:

> Mihail Nikalayeu <[email protected]> wrote:
> 
> > One more thing - I think build_new_indexes and
> > index_concurrently_create_copy are very close in semantics, so it
> > might be a good idea to refactor them a bit.
> 
> You're right. I think I even used the latter for reference when writing the
> first.
> 
> 0002 in the attached series tries to fix that. build_new_indexes() (in 0004)
> is simpler now.

This is v18 again. Parts 0001 through 0004 are unchanged, however 0005 is
added. It implements a new client application pg_repackdb. (If I posted 0005
alone its regression tests would not work. I wonder if the cfbot handles the
repeated occurence of the 'v18-' prefix correctly.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-15 12:48  Alvaro Herrera <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-15 12:48 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

On 2025-Aug-15, Antonin Houska wrote:

> This is v18 again.

Thanks for this!

> Parts 0001 through 0004 are unchanged, however 0005 is added. It
> implements a new client application pg_repackdb. (If I posted 0005
> alone its regression tests would not work. I wonder if the cfbot
> handles the repeated occurence of the 'v18-' prefix correctly.)

Yeah, the cfbot is just going to take the attachments from the latest
email in the thread that has any, and assume they are the whole that
make up the patch.  It wouldn't work to post just v18-0005 and assume
that the bot is going grab patches 0001 through 0004 from a previous
email, if that's what you're thinking.  In short, what you did is
correct and necessary.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-16 13:41  Robert Treat <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Robert Treat @ 2025-08-16 13:41 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>

On Tue, Aug 5, 2025 at 4:59 AM Antonin Houska <[email protected]> wrote:
>
> Alvaro Herrera <[email protected]> wrote:
>
> > I made a few changes from Antonin's original at [2].  First, I modified
> > the grammar to support "REPACK [tab] USING INDEX" without specifying the
> > index name.  With this change, all possibilities of the old commands are
> > covered,
>
> ...
>
> > Here's a list of existing commands, and how to write them in the current
> > patch's proposal for REPACK:
> >
> > -- re-clusters all tables that have a clustered index set
> > CLUSTER                     -> REPACK USING INDEX
> >
> > -- clusters the given table using the given index
> > CLUSTER tab USING idx       -> REPACK tab USING INDEX idx
> >
> > -- clusters this table using a clustered index; error if no index clustered
> > CLUSTER tab                 -> REPACK tab USING INDEX
> >

In the v18 patch, the docs say that repack doesn't remember the index,
but it seems we are still calling mark_index_clustered, so I think the
above is true but we need to update the docs(?).

> > -- vacuum-full all tables
> > VACUUM FULL                 -> REPACK
> >
> > -- vacuum-full the specified table
> > VACUUM FULL tab             -> REPACK tab
> >
>
> Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the
> options of VACUUM FULL. I found two items not supported by REPACK (but also
> not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just
> let's mention that in the user documentation of REPACK?
>

I would note that both pg_repack and pg_squeeze analyze by default,
and running "vacuum full analyze" is the recommended behavior, so not
having analyze included is a step backwards.

> (Besides that, VACUUM FULL accepts TRUNCATE and INDEX_CLEANUP options, but I
> think these have no effect.)
>

Yeah, these seem safe to ignore.

Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-19 12:22  Alvaro Herrera <[email protected]>
  parent: Robert Treat <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-19 12:22 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: Antonin Houska <[email protected]>; Pg Hackers <[email protected]>

On 2025-Aug-16, Robert Treat wrote:

> On Tue, Aug 5, 2025 at 4:59 AM Antonin Houska <[email protected]> wrote:

> > Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the
> > options of VACUUM FULL. I found two items not supported by REPACK (but also
> > not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just
> > let's mention that in the user documentation of REPACK?
> 
> I would note that both pg_repack and pg_squeeze analyze by default,
> and running "vacuum full analyze" is the recommended behavior, so not
> having analyze included is a step backwards.

Make sense to add ANALYZE as an option to repack, yeah.

So if I repack a single table with
  REPACK (ANALYZE) table USING INDEX;

then do you expect that this would first cluster the table under
AccessExclusiveLock, then release the lock to do the analyze step, or
would the analyze be done under the same lock?  This is significant for
a query that starts while repack is running, because if we release the
AEL then the query is planned when there are no stats for the table,
which might be bad.

I think the time to run the analyze step should be considerable shorter
than the time to run the repacking step, so running both together under
the same lock should be okay.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Computing is too important to be left to men." (Karen Spärck Jones)

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-19 12:23  Alvaro Herrera <[email protected]>
  parent: Robert Treat <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-19 12:23 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: Antonin Houska <[email protected]>; Pg Hackers <[email protected]>

On 2025-Aug-16, Robert Treat wrote:

> In the v18 patch, the docs say that repack doesn't remember the index,
> but it seems we are still calling mark_index_clustered, so I think the
> above is true but we need to update the docs(?).

Yes, the docs are obsolete on this point, I'm in the process of updating
them.  Thanks for pointing this out.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-19 18:53  Álvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 2 replies; 106+ messages in thread

From: Álvaro Herrera @ 2025-08-19 18:53 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; +Cc: Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Hello,

Here's a second cut of the initial REPACK work.  Antonin added an
implementation of pg_repackdb, and there's also a couple of bug fixes
that were reported in the thread.  I also added support for the ANALYZE
option as noted by Robert Treat, though it only works if you specify a
single non-partitioned table.  Adding for the multi-table case is likely
easy, but I didn't try.

I purposefully do not include the CONCURRENTLY work yet -- I want to get
this part commitable-clean first, then we can continue work on the
logical decoding work on top of that.

Note choice of shell command name: though all the other programs in
src/bin/scripts do not use the "pg_" prefix, this one does; we thought
it made no sense to follow the old programs as precedent because there
seems to be a lament for the lack of pg_ prefix in those, and we only
keep what they are because of their long history.  This one has no
history.

Still on pg_repackdb, the implementation here is to install a symlink
called pg_repackdb which points to vacuumdb, and make the program behave
differently when called in this way.  The amount of additional code for
this is relatively small, so I think this is a worthy technique --
assuming it works.  If it doesn't, Antonin proposed a separate binary
that just calls some functions from vacuumdb.  Or maybe we could have a
common source file that both utilities call.

I edited the docs a bit, limiting the exposure of CLUSTER and VACUUM
FULL, and instead redirecting the user to the REPACK docs.  In the
REPACK docs I modified things for additional clarity.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Selbst das größte Genie würde nicht weit kommen, wenn es
alles seinem eigenen Innern verdanken wollte." (Johann Wolfgang von Goethe)
               Ni aún el genio más grande llegaría muy lejos si
                    quisiera sacarlo todo de su propio interior.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-20 08:33  Antonin Houska <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-08-20 08:33 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Alvaro Herrera <[email protected]> wrote:

> On 2025-Aug-16, Robert Treat wrote:
> 
> > On Tue, Aug 5, 2025 at 4:59 AM Antonin Houska <[email protected]> wrote:
> 
> > > Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the
> > > options of VACUUM FULL. I found two items not supported by REPACK (but also
> > > not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just
> > > let's mention that in the user documentation of REPACK?
> > 
> > I would note that both pg_repack and pg_squeeze analyze by default,
> > and running "vacuum full analyze" is the recommended behavior, so not
> > having analyze included is a step backwards.
> 
> Make sense to add ANALYZE as an option to repack, yeah.
> 
> So if I repack a single table with
>   REPACK (ANALYZE) table USING INDEX;
> 
> then do you expect that this would first cluster the table under
> AccessExclusiveLock, then release the lock to do the analyze step, or
> would the analyze be done under the same lock?  This is significant for
> a query that starts while repack is running, because if we release the
> AEL then the query is planned when there are no stats for the table,
> which might be bad.
> 
> I think the time to run the analyze step should be considerable shorter
> than the time to run the repacking step, so running both together under
> the same lock should be okay.

AFAICS, VACUUM FULL first releases the AEL, then it analyzes the table. If
users did not complain so far, I'd assume that vacuum_rel() (effectively
cluster_rel() in the FULL case) does not change the stats that much.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-20 08:53  Antonin Houska <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-20 08:53 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Álvaro Herrera <[email protected]> wrote:

> Still on pg_repackdb, the implementation here is to install a symlink
> called pg_repackdb which points to vacuumdb, and make the program behave
> differently when called in this way.  The amount of additional code for
> this is relatively small, so I think this is a worthy technique --
> assuming it works.  If it doesn't, Antonin proposed a separate binary
> that just calls some functions from vacuumdb.  Or maybe we could have a
> common source file that both utilities call.

There's an issue with the symlink, maybe some meson expert can help. In
particular, the CI on Windows ends up with the following error:

ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb

(The reason it does not happen on other platforms might be that the build is
slower on Windows, and thus it's more prone to some specific race conditions.)

It appears that the 'point_to' argument of the 'install_symlink()' function
[1] is only a string rather than a "real target" [2]. That's likely the reason
the function does not wait for the creation of the 'vacuumdb' executable.

I could not find another symlink of this kind in the tree. (AFAICS, the
postmaster->postgres symlink had been removed before Meson has been
introduced.)

Does anyone happen to have a clue? Thanks.

[1] https://mesonbuild.com/Reference-manual_functions.html#install_symlink
[2] https://mesonbuild.com/Reference-manual_returned_tgt.html

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-20 12:07  Álvaro Herrera <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Álvaro Herrera @ 2025-08-20 12:07 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: [email protected]; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On 2025-Aug-20, Antonin Houska wrote:

> There's an issue with the symlink, maybe some meson expert can help. In
> particular, the CI on Windows ends up with the following error:
> 
> ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb

Hmm, that's not the problem I see in the CI run from the commitfest app:

https://cirrus-ci.com/task/5608274336153600

[19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj 
[19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq" "/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj "/c" ../src/bin/scripts/vacuumdb.c
[19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}'
[19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0

The real problem here seems to be the empty long_options_repack array.
I removed it and started a new run to see what happens.  Running now:
https://cirrus-ci.com/build/4961902171783168

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-20 14:22  Antonin Houska <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-20 14:22 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Álvaro Herrera <[email protected]> wrote:

> On 2025-Aug-20, Antonin Houska wrote:
> 
> > There's an issue with the symlink, maybe some meson expert can help. In
> > particular, the CI on Windows ends up with the following error:
> > 
> > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb
> 
> Hmm, that's not the problem I see in the CI run from the commitfest app:
> 
> https://cirrus-ci.com/task/5608274336153600

I was referring to the other build that you shared off-list (probably
independent from cfbot):

https://cirrus-ci.com/build/4726227505774592

> [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj 
> [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq" "/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj "/c" ../src/bin/scripts/vacuumdb.c
> [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}'
> [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0
> 
> The real problem here seems to be the empty long_options_repack array.
> I removed it and started a new run to see what happens.  Running now:
> https://cirrus-ci.com/build/4961902171783168

The symlink issue occurred at "Windows - Server 2019, MinGW64 - Meson", where
the code compiled well. The compilation failure mentioned above comes from
"Windows - Server 2019, VS 2019 - Meson & ninja". I think it's still possible
that the symlink issue will occur there once the compilation is fixed.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-20 16:11  Andres Freund <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Andres Freund @ 2025-08-20 16:11 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: [email protected]; [email protected]; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Hi,

On 2025-08-20 16:22:41 +0200, Antonin Houska wrote:
> Álvaro Herrera <[email protected]> wrote:
>
> > On 2025-Aug-20, Antonin Houska wrote:
> >
> > > There's an issue with the symlink, maybe some meson expert can help. In
> > > particular, the CI on Windows ends up with the following error:
> > >
> > > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb
> >
> > Hmm, that's not the problem I see in the CI run from the commitfest app:
> >
> > https://cirrus-ci.com/task/5608274336153600
>
> I was referring to the other build that you shared off-list (probably
> independent from cfbot):
>
> https://cirrus-ci.com/build/4726227505774592
>
> > [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj
> > [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq" "/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj "/c" ../src/bin/scripts/vacuumdb.c
> > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}'
> > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0
> >
> > The real problem here seems to be the empty long_options_repack array.
> > I removed it and started a new run to see what happens.  Running now:
> > https://cirrus-ci.com/build/4961902171783168
>
> The symlink issue occurred at "Windows - Server 2019, MinGW64 - Meson", where
> the code compiled well. The compilation failure mentioned above comes from
> "Windows - Server 2019, VS 2019 - Meson & ninja". I think it's still possible
> that the symlink issue will occur there once the compilation is fixed.

FWIW, I don't think it's particularly wise to rely on symlinks on windows -
IIRC they will often not be enabled outside of development environments.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-20 23:44  Mihail Nikalayeu <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-20 23:44 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

Hello everyone!

Alvaro Herrera <[email protected]>:
> Please note that Antonin already implemented this.  See his patches
> here:
> https://www.postgresql.org/message-id/77690.1725610115%40antos
> I proposed to leave this part out initially, which is why it hasn't been
> reposted.  We can review and discuss after the initial patches are in.

I think it is worth pushing it at least in the same release cycle.

> But you're welcome to review that part specifically if you're so
> inclined, and offer feedback on it.  (I suggest to rewind back your
> checked-out tree to branch master at the time that patch was posted, for
> easy application.  We can deal with a rebase later.)

I have rebased that on top of v18 (attached).

Also, I think I found an issue (or lost something during rebase): we
must preserve xmin,cmin during initial copy
to make sure that data is going to be visible by snapshots of
concurrent changes later:

static void
reform_and_rewrite_tuple(......)
.....
      /*It is also crucial to stamp the new record with the exact same
xid and cid,
      * because the tuple must be visible to the snapshot of the
applied concurrent
      * change later.
      */
      CommandId      cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
      TransactionId   xid = HeapTupleHeaderGetXmin(tuple->t_data);

      heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);

I'll try to polish that part a little bit.

> Because having an MVCC-safe mode has drawbacks, IMO we should make it
> optional.
Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)?

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v18-0006-Preserve-visibility-information-of-the-concurren.patch (56.7K, 2-v18-0006-Preserve-visibility-information-of-the-concurren.patch)
  download | inline diff:
From b2f4d126e04d0396f8ad69f5113ed405a4efd723 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 21 Aug 2025 00:19:24 +0200
Subject: [PATCH v18 6/6] Preserve visibility information of the concurrent
 data  changes.

As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.

However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". It ensures
that tuples written into the new table have the same XID and command ID (CID)
as they had in the old table.

To "replay" an UPDATE or DELETE command on the new table, we need the
appropriate snapshot to find the previous tuple version in the new table. The
(historic) snapshot we used to decode the UPDATE / DELETE should (by
definition) see the state of the catalog prior to that UPDATE / DELETE. Thus
we can use the same snapshot to find the "old tuple" for UPDATE / DELETE in
the new table if:

1) REPACK CONCURRENTLY preserves visibility information of all tuples - that's
the purpose of this part of the patch series.

2) The table being REPACKed is treated as a system catalog by all transactions
that modify its data. This ensures that reorderbuffer.c generates a new
snapshot for each data change in the table.

We ensure 2) by maintaining a shared hashtable of tables being REPACKed
CONCURRENTLY and by adjusting the RelationIsAccessibleInLogicalDecoding()
macro so it checks this hashtable. (The corresponding flag is also added to
the relation cache, so that the shared hashtable does not have to be accessed
too often.) It's essential that after adding an entry to the hashtable we wait
for completion of all the transactions that might have started to modify our
table before our entry has was added. We achieve that by upgrading our lock on
the table to ShareLock temporarily: as soon as we acquire it, no DML command
should be running on the table. (This lock upgrade shouldn't cause any
deadlock because we care to not hold a lock on other objects at the same
time.)

As long as we preserve the tuple visibility information (which includes XID),
it's important to avoid logical decoding of the WAL generated by DMLs on the
new table: the logical decoding subsystem probably does not expect that the
incoming WAL records contain XIDs of an already decoded transactions. (And of
course, repeated decoding would be wasted effort.)

Author: Antonin Houska <[email protected]> with small changes from Mikhail Nikalayeu <[email protected]
>
---
 src/backend/access/common/toast_internals.c   |   3 +-
 src/backend/access/heap/heapam.c              |  54 ++-
 src/backend/access/heap/heapam_handler.c      |  23 +-
 src/backend/access/transam/xact.c             |  52 +++
 src/backend/commands/cluster.c                | 400 ++++++++++++++++--
 src/backend/replication/logical/decode.c      |  28 +-
 src/backend/replication/logical/snapbuild.c   |  22 +-
 .../pgoutput_repack/pgoutput_repack.c         |  68 ++-
 src/backend/storage/ipc/ipci.c                |   2 +
 .../utils/activity/wait_event_names.txt       |   1 +
 src/backend/utils/cache/inval.c               |  21 +
 src/backend/utils/cache/relcache.c            |   4 +
 src/include/access/heapam.h                   |  12 +-
 src/include/access/xact.h                     |   2 +
 src/include/commands/cluster.h                |  22 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/utils/inval.h                     |   2 +
 src/include/utils/rel.h                       |   7 +-
 src/include/utils/snapshot.h                  |   3 +
 .../injection_points/specs/repack.spec        |   4 -
 src/tools/pgindent/typedefs.list              |   1 +
 21 files changed, 636 insertions(+), 96 deletions(-)

diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index a1d0eed8953..586eb42a137 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
 		memcpy(VARDATA(&chunk_data), data_p, chunk_size);
 		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
 
-		heap_insert(toastrel, toasttup, mycid, options, NULL);
+		heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+					options, NULL);
 
 		/*
 		 * Create the index entry.  We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4fdb3e880e4..e7b9f7b6374 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2059,7 +2059,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
 /*
  *	heap_insert		- insert tuple into a heap
  *
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
  * command ID.
  *
  * See table_tuple_insert for comments about most of the input flags, except
@@ -2075,15 +2075,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * reflected into *tup.
  */
 void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
-			int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+			CommandId cid, int options, BulkInsertState bistate)
 {
-	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple	heaptup;
 	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		all_visible_cleared = false;
 
+	Assert(TransactionIdIsValid(xid));
+
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
 		   RelationGetNumberOfAttributes(relation));
@@ -2165,8 +2166,15 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		/*
 		 * If this is a catalog, we need to transmit combo CIDs to properly
 		 * decode, so log that as well.
+		 *
+		 * HEAP_INSERT_NO_LOGICAL should be set when applying data changes
+		 * done by other transactions during REPACK CONCURRENTLY. In such a
+		 * case, the insertion should not be decoded at all - see
+		 * heap_decode(). (It's also set by raw_heap_insert() for TOAST, but
+		 * TOAST does not pass this test anyway.)
 		 */
-		if (RelationIsAccessibleInLogicalDecoding(relation))
+		if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+			RelationIsAccessibleInLogicalDecoding(relation))
 			log_heap_new_cid(relation, heaptup);
 
 		/*
@@ -2712,7 +2720,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 void
 simple_heap_insert(Relation relation, HeapTuple tup)
 {
-	heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+	heap_insert(relation, tup, GetCurrentTransactionId(),
+				GetCurrentCommandId(true), 0, NULL);
 }
 
 /*
@@ -2769,11 +2778,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart, bool wal_logical)
+			TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+			TM_FailureData *tmfd, bool changingPart,
+			bool wal_logical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	ItemId		lp;
 	HeapTupleData tp;
 	Page		page;
@@ -2790,6 +2799,7 @@ heap_delete(Relation relation, ItemPointer tid,
 	bool		old_key_copied = false;
 
 	Assert(ItemPointerIsValid(tid));
+	Assert(TransactionIdIsValid(xid));
 
 	AssertHasSnapshotForToast(relation);
 
@@ -3086,8 +3096,12 @@ l1:
 		/*
 		 * For logical decode we need combo CIDs to properly decode the
 		 * catalog
+		 *
+		 * Like in heap_insert(), visibility is unchanged when called from
+		 * VACUUM FULL / CLUSTER.
 		 */
-		if (RelationIsAccessibleInLogicalDecoding(relation))
+		if (wal_logical &&
+			RelationIsAccessibleInLogicalDecoding(relation))
 			log_heap_new_cid(relation, &tp);
 
 		xlrec.flags = 0;
@@ -3206,11 +3220,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	TM_Result	result;
 	TM_FailureData tmfd;
 
-	result = heap_delete(relation, tid,
+	result = heap_delete(relation, tid, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, false, /* changingPart */
-						 true /* wal_logical */);
+						 &tmfd, false,	/* changingPart */
+						 true /* wal_logical */ );
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -3249,12 +3263,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  */
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, LockTupleMode *lockmode,
+			TransactionId xid, CommandId cid, Snapshot crosscheck,
+			bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
 			TU_UpdateIndexes *update_indexes, bool wal_logical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	Bitmapset  *hot_attrs;
 	Bitmapset  *sum_attrs;
 	Bitmapset  *key_attrs;
@@ -3294,6 +3307,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 				infomask2_new_tuple;
 
 	Assert(ItemPointerIsValid(otid));
+	Assert(TransactionIdIsValid(xid));
 
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4133,8 +4147,12 @@ l2:
 		/*
 		 * For logical decoding we need combo CIDs to properly decode the
 		 * catalog.
+		 *
+		 * Like in heap_insert(), visibility is unchanged when called from
+		 * VACUUM FULL / CLUSTER.
 		 */
-		if (RelationIsAccessibleInLogicalDecoding(relation))
+		if (wal_logical &&
+			RelationIsAccessibleInLogicalDecoding(relation))
 		{
 			log_heap_new_cid(relation, &oldtup);
 			log_heap_new_cid(relation, heaptup);
@@ -4500,7 +4518,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
 
-	result = heap_update(relation, otid, tup,
+	result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, &lockmode, update_indexes,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c829c06f769..c42a1bd55ee 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -253,7 +253,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -276,7 +277,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
 	options |= HEAP_INSERT_SPECULATIVE;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -310,8 +312,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
-					   true);
+	return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+					   crosscheck, wait, tmfd, changingPart, true);
 }
 
 
@@ -329,7 +331,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
+	result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+						 cid, crosscheck, wait,
 						 tmfd, lockmode, update_indexes, true);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
@@ -2477,9 +2480,15 @@ reform_and_rewrite_tuple(HeapTuple tuple,
 		 * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
 		 * the relation files, it drops this relation, so no logical
 		 * replication subscription should need the data.
+		 *
+		* It is also crucial to stamp the new record with the exact same xid and cid,
+		* because the tuple must be visible to the snapshot of the applied concurrent
+		* change later.
 		 */
-		heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
-					HEAP_INSERT_NO_LOGICAL, NULL);
+		CommandId		cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
+		TransactionId	xid = HeapTupleHeaderGetXmin(tuple->t_data);
+
+		heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);
 	}
 
 	heap_freetuple(copiedTuple);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5670f2bfbde..e913594fc07 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -126,6 +126,18 @@ static FullTransactionId XactTopFullTransactionId = {InvalidTransactionId};
 static int	nParallelCurrentXids = 0;
 static TransactionId *ParallelCurrentXids;
 
+/*
+ * Another case that requires TransactionIdIsCurrentTransactionId() to behave
+ * specially is when REPACK CONCURRENTLY is processing data changes made in
+ * the old storage of a table by other transactions. When applying the changes
+ * to the new storage, the backend executing the CLUSTER command needs to act
+ * on behalf on those other transactions. The transactions responsible for the
+ * changes in the old storage are stored in this array, sorted by
+ * xidComparator.
+ */
+static int	nRepackCurrentXids = 0;
+static TransactionId *RepackCurrentXids = NULL;
+
 /*
  * Miscellaneous flag bits to record events which occur on the top level
  * transaction. These flags are only persisted in MyXactFlags and are intended
@@ -973,6 +985,8 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 		int			low,
 					high;
 
+		Assert(nRepackCurrentXids == 0);
+
 		low = 0;
 		high = nParallelCurrentXids - 1;
 		while (low <= high)
@@ -992,6 +1006,21 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 		return false;
 	}
 
+	/*
+	 * When executing CLUSTER CONCURRENTLY, the array of current transactions
+	 * is given.
+	 */
+	if (nRepackCurrentXids > 0)
+	{
+		Assert(nParallelCurrentXids == 0);
+
+		return bsearch(&xid,
+					   RepackCurrentXids,
+					   nRepackCurrentXids,
+					   sizeof(TransactionId),
+					   xidComparator) != NULL;
+	}
+
 	/*
 	 * We will return true for the Xid of the current subtransaction, any of
 	 * its subcommitted children, any of its parents, or any of their
@@ -5661,6 +5690,29 @@ EndParallelWorkerTransaction(void)
 	CurrentTransactionState->blockState = TBLOCK_DEFAULT;
 }
 
+/*
+ * SetRepackCurrentXids
+ *		Set the XID array that TransactionIdIsCurrentTransactionId() should
+ *		use.
+ */
+void
+SetRepackCurrentXids(TransactionId *xip, int xcnt)
+{
+	RepackCurrentXids = xip;
+	nRepackCurrentXids = xcnt;
+}
+
+/*
+ * ResetRepackCurrentXids
+ *		Undo the effect of SetRepackCurrentXids().
+ */
+void
+ResetRepackCurrentXids(void)
+{
+	RepackCurrentXids = NULL;
+	nRepackCurrentXids = 0;
+}
+
 /*
  * ShowTransactionState
  *		Debug support
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index aa3ae85bcee..a4d4d37d211 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -82,6 +82,11 @@ typedef struct
  * The following definitions are used for concurrent processing.
  */
 
+/*
+ * OID of the table being repacked by this backend.
+ */
+static Oid	repacked_rel = InvalidOid;
+
 /*
  * The locators are used to avoid logical decoding of data that we do not need
  * for our table.
@@ -125,8 +130,10 @@ static List *get_tables_to_repack_partitioned(RepackCommand cmd,
 static bool cluster_is_permitted_for_relation(RepackCommand cmd,
 											  Oid relid, Oid userid);
 
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
+static void begin_concurrent_repack(Relation rel, Relation *index_p,
+									bool *entered_p);
+static void end_concurrent_repack(bool error);
+static void cluster_before_shmem_exit_callback(int code, Datum arg);
 static LogicalDecodingContext *setup_logical_decoding(Oid relid,
 													  const char *slotname,
 													  TupleDesc tupdesc);
@@ -146,6 +153,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
 									ConcurrentChange *change);
 static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
 								   HeapTuple tup_key,
+								   Snapshot snapshot,
 								   IndexInsertState *iistate,
 								   TupleTableSlot *ident_slot,
 								   IndexScanDesc *scan_p);
@@ -437,6 +445,8 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 	bool		verbose = ((params->options & CLUOPT_VERBOSE) != 0);
 	bool		recheck = ((params->options & CLUOPT_RECHECK) != 0);
 	bool		concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
+	bool		entered,
+				success;
 
 	/*
 	 * Check that the correct lock is held. The lock mode is
@@ -607,23 +617,30 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 		TransferPredicateLocksToHeapRelation(OldHeap);
 
 	/* rebuild_relation does all the dirty work */
+	entered = false;
+	success = false;
 	PG_TRY();
 	{
 		/*
-		 * For concurrent processing, make sure that our logical decoding
-		 * ignores data changes of other tables than the one we are
-		 * processing.
+		 * For concurrent processing, make sure that
+		 *
+		 * 1) our logical decoding ignores data changes of other tables than
+		 * the one we are processing.
+		 *
+		 * 2) other transactions treat this table as if it was a system / user
+		 * catalog, and WAL the relevant additional information.
 		 */
 		if (concurrent)
-			begin_concurrent_repack(OldHeap);
+			begin_concurrent_repack(OldHeap, &index, &entered);
 
 		rebuild_relation(cmd, usingindex, OldHeap, index, save_userid,
 						 verbose, concurrent);
+		success = true;
 	}
 	PG_FINALLY();
 	{
-		if (concurrent)
-			end_concurrent_repack();
+		if (concurrent && entered)
+			end_concurrent_repack(!success);
 	}
 	PG_END_TRY();
 
@@ -2383,6 +2400,47 @@ determine_clustered_index(Relation rel, bool usingindex, const char *indexname)
 }
 
 
+/*
+ * Each relation being processed by REPACK CONCURRENTLY must be in the
+ * repackedRels hashtable.
+ */
+typedef struct RepackedRel
+{
+	Oid			relid;
+	Oid			dbid;
+} RepackedRel;
+
+static HTAB *RepackedRelsHash = NULL;
+
+/*
+ * Maximum number of entries in the hashtable.
+ *
+ * A replication slot is needed for the processing, so use this GUC to
+ * allocate memory for the hashtable.
+ */
+#define	MAX_REPACKED_RELS	(max_replication_slots)
+
+Size
+RepackShmemSize(void)
+{
+	return hash_estimate_size(MAX_REPACKED_RELS, sizeof(RepackedRel));
+}
+
+void
+RepackShmemInit(void)
+{
+	HASHCTL		info;
+
+	info.keysize = sizeof(RepackedRel);
+	info.entrysize = info.keysize;
+
+	RepackedRelsHash = ShmemInitHash("Repacked Relations",
+									 MAX_REPACKED_RELS,
+									 MAX_REPACKED_RELS,
+									 &info,
+									 HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * Call this function before REPACK CONCURRENTLY starts to setup logical
  * decoding. It makes sure that other users of the table put enough
@@ -2397,11 +2455,119 @@ determine_clustered_index(Relation rel, bool usingindex, const char *indexname)
  *
  * Note that TOAST table needs no attention here as it's not scanned using
  * historic snapshot.
+ *
+ * 'index_p' is in/out argument because the function unlocks the index
+ * temporarily.
+ *
+ * 'enter_p' receives a bool value telling whether relation OID was entered
+ * into RepackedRelsHash or not.
  */
 static void
-begin_concurrent_repack(Relation rel)
+begin_concurrent_repack(Relation rel, Relation *index_p, bool *entered_p)
 {
-	Oid			toastrelid;
+	Oid			relid,
+				toastrelid;
+	Relation	index = NULL;
+	Oid			indexid = InvalidOid;
+	RepackedRel key,
+			   *entry;
+	bool		found;
+	static bool before_shmem_exit_callback_setup = false;
+
+	relid = RelationGetRelid(rel);
+	index = index_p ? *index_p : NULL;
+
+	/*
+	 * Make sure that we do not leave an entry in RepackedRelsHash if exiting
+	 * due to FATAL.
+	 */
+	if (!before_shmem_exit_callback_setup)
+	{
+		before_shmem_exit(cluster_before_shmem_exit_callback, 0);
+		before_shmem_exit_callback_setup = true;
+	}
+
+	memset(&key, 0, sizeof(key));
+	key.relid = relid;
+	key.dbid = MyDatabaseId;
+
+	*entered_p = false;
+	LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+	entry = (RepackedRel *)
+		hash_search(RepackedRelsHash, &key, HASH_ENTER_NULL, &found);
+	if (found)
+	{
+		/*
+		 * Since REPACK CONCURRENTLY takes ShareRowExclusiveLock, a conflict
+		 * should occur much earlier. However that lock may be released
+		 * temporarily, see below.  Anyway, we should complain whatever the
+		 * reason of the conflict might be.
+		 */
+		ereport(ERROR,
+				(errmsg("relation \"%s\" is already being processed by REPACK CONCURRENTLY",
+						RelationGetRelationName(rel))));
+	}
+	if (entry == NULL)
+		ereport(ERROR,
+				(errmsg("too many requests for REPACK CONCURRENTLY at a time")),
+				(errhint("Please consider increasing the \"max_replication_slots\" configuration parameter.")));
+
+	/*
+	 * Even if anything fails below, the caller has to do cleanup in the
+	 * shared memory.
+	 */
+	*entered_p = true;
+
+	/*
+	 * Enable the callback to remove the entry in case of exit. We should not
+	 * do this earlier, otherwise an attempt to insert already existing entry
+	 * could make us remove that entry (inserted by another backend) during
+	 * ERROR handling.
+	 */
+	Assert(!OidIsValid(repacked_rel));
+	repacked_rel = relid;
+
+	LWLockRelease(RepackedRelsLock);
+
+	/*
+	 * Make sure that other backends are aware of the new hash entry as soon
+	 * as they open our table.
+	 */
+	CacheInvalidateRelcacheImmediate(relid);
+
+	/*
+	 * Also make sure that the existing users of the table update their
+	 * relcache entry as soon as they try to run DML commands on it.
+	 *
+	 * ShareLock is the weakest lock that conflicts with DMLs. If any backend
+	 * has a lower lock, we assume it'll accept our invalidation message when
+	 * it changes the lock mode.
+	 *
+	 * Before upgrading the lock on the relation, close the index temporarily
+	 * to avoid a deadlock if another backend running DML already has its lock
+	 * (ShareLock) on the table and waits for the lock on the index.
+	 */
+	if (index)
+	{
+		indexid = RelationGetRelid(index);
+		index_close(index, ShareUpdateExclusiveLock);
+	}
+	LockRelationOid(relid, ShareLock);
+	UnlockRelationOid(relid, ShareLock);
+	if (OidIsValid(indexid))
+	{
+		/*
+		 * Re-open the index and check that it hasn't changed while unlocked.
+		 */
+		check_index_is_clusterable(rel, indexid, ShareUpdateExclusiveLock);
+
+		/*
+		 * Return the new relcache entry to the caller. (It's been locked by
+		 * the call above.)
+		 */
+		index = index_open(indexid, NoLock);
+		*index_p = index;
+	}
 
 	/* Avoid logical decoding of other relations by this backend. */
 	repacked_rel_locator = rel->rd_locator;
@@ -2419,15 +2585,122 @@ begin_concurrent_repack(Relation rel)
 
 /*
  * Call this when done with REPACK CONCURRENTLY.
+ *
+ * 'error' tells whether the function is being called in order to handle
+ * error.
  */
 static void
-end_concurrent_repack(void)
+end_concurrent_repack(bool error)
 {
+	RepackedRel key;
+	RepackedRel *entry = NULL;
+	Oid			relid = repacked_rel;
+
+	/* Remove the relation from the hash if we managed to insert one. */
+	if (OidIsValid(repacked_rel))
+	{
+		memset(&key, 0, sizeof(key));
+		key.relid = repacked_rel;
+		key.dbid = MyDatabaseId;
+		LWLockAcquire(RepackedRelsLock, LW_EXCLUSIVE);
+		entry = hash_search(RepackedRelsHash, &key, HASH_REMOVE, NULL);
+		LWLockRelease(RepackedRelsLock);
+
+		/*
+		 * Make others refresh their information whether they should still
+		 * treat the table as catalog from the perspective of writing WAL.
+		 *
+		 * XXX Unlike entering the entry into the hashtable, we do not bother
+		 * with locking and unlocking the table here:
+		 *
+		 * 1) On normal completion (and sometimes even on ERROR), the caller
+		 * is already holding AccessExclusiveLock on the table, so there
+		 * should be no relcache reference unaware of this change.
+		 *
+		 * 2) In the other cases, the worst scenario is that the other
+		 * backends will write unnecessary information to WAL until they close
+		 * the relation.
+		 *
+		 * Should we use ShareLock mode to fix 2) at least for the non-FATAL
+		 * errors? (Our before_shmem_exit callback is in charge of FATAL, and
+		 * that probably should not try to acquire any lock.)
+		 */
+		CacheInvalidateRelcacheImmediate(repacked_rel);
+
+		/*
+		 * By clearing this variable we also disable
+		 * cluster_before_shmem_exit_callback().
+		 */
+		repacked_rel = InvalidOid;
+	}
+
 	/*
 	 * Restore normal function of (future) logical decoding for this backend.
 	 */
 	repacked_rel_locator.relNumber = InvalidOid;
 	repacked_rel_toast_locator.relNumber = InvalidOid;
+
+	/*
+	 * On normal completion (!error), we should not really fail to remove the
+	 * entry. But if it wasn't there for any reason, raise ERROR to make sure
+	 * the transaction is aborted: if other transactions, while changing the
+	 * contents of the relation, didn't know that REPACK CONCURRENTLY was in
+	 * progress, they could have missed to WAL enough information, and thus we
+	 * could have produced an inconsistent table contents.
+	 *
+	 * On the other hand, if we are already handling an error, there's no
+	 * reason to worry about inconsistent contents of the new storage because
+	 * the transaction is going to be rolled back anyway. Furthermore, by
+	 * raising ERROR here we'd shadow the original error.
+	 */
+	if (!error)
+	{
+		char	   *relname;
+
+		if (OidIsValid(relid) && entry == NULL)
+		{
+			relname = get_rel_name(relid);
+			if (!relname)
+				ereport(ERROR,
+						(errmsg("cache lookup failed for relation %u",
+								relid)));
+
+			ereport(ERROR,
+					(errmsg("relation \"%s\" not found among repacked relations",
+							relname)));
+		}
+	}
+}
+
+/*
+ * A wrapper to call end_concurrent_repack() as a before_shmem_exit callback.
+ */
+static void
+cluster_before_shmem_exit_callback(int code, Datum arg)
+{
+	if (OidIsValid(repacked_rel))
+		end_concurrent_repack(true);
+}
+
+/*
+ * Check if relation is currently being processed by REPACK CONCURRENTLY.
+ */
+bool
+is_concurrent_repack_in_progress(Oid relid)
+{
+	RepackedRel key,
+			   *entry;
+
+	memset(&key, 0, sizeof(key));
+	key.relid = relid;
+	key.dbid = MyDatabaseId;
+
+	LWLockAcquire(RepackedRelsLock, LW_SHARED);
+	entry = (RepackedRel *)
+		hash_search(RepackedRelsHash, &key, HASH_FIND, NULL);
+	LWLockRelease(RepackedRelsLock);
+
+	return entry != NULL;
 }
 
 /*
@@ -2489,6 +2762,9 @@ setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
 	dstate->relid = relid;
 	dstate->tstore = tuplestore_begin_heap(false, false,
 										   maintenance_work_mem);
+#ifdef USE_ASSERT_CHECKING
+	dstate->last_change_xid = InvalidTransactionId;
+#endif
 
 	dstate->tupdesc = tupdesc;
 
@@ -2636,6 +2912,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 		char	   *change_raw,
 				   *src;
 		ConcurrentChange change;
+		Snapshot	snapshot;
 		bool		isnull[1];
 		Datum		values[1];
 
@@ -2704,8 +2981,30 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 
 			/*
 			 * Find the tuple to be updated or deleted.
+			 *
+			 * As the table being REPACKed concurrently is treated like a
+			 * catalog, new CID is WAL-logged and decoded. And since we use
+			 * the same XID that the original DMLs did, the snapshot used for
+			 * the logical decoding (by now converted to a non-historic MVCC
+			 * snapshot) should see the tuples inserted previously into the
+			 * new heap and/or updated there.
 			 */
-			tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+			snapshot = change.snapshot;
+
+			/*
+			 * Set what should be considered current transaction (and
+			 * subtransactions) during visibility check.
+			 *
+			 * Note that this snapshot was created from a historic snapshot
+			 * using SnapBuildMVCCFromHistoric(), which does not touch
+			 * 'subxip'. Thus, unlike in a regular MVCC snapshot, the array
+			 * only contains the transactions whose data changes we are
+			 * applying, and its subtransactions. That's exactly what we need
+			 * to check if particular xact is a "current transaction:".
+			 */
+			SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+			tup_exist = find_target_tuple(rel, key, nkeys, tup_key, snapshot,
 										  iistate, ident_slot, &ind_scan);
 			if (tup_exist == NULL)
 				elog(ERROR, "Failed to find target tuple");
@@ -2716,6 +3015,8 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 			else
 				apply_concurrent_delete(rel, tup_exist, &change);
 
+			ResetRepackCurrentXids();
+
 			if (tup_old != NULL)
 			{
 				pfree(tup_old);
@@ -2728,14 +3029,14 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 		else
 			elog(ERROR, "Unrecognized kind of change: %d", change.kind);
 
-		/*
-		 * If a change was applied now, increment CID for next writes and
-		 * update the snapshot so it sees the changes we've applied so far.
-		 */
-		if (change.kind != CHANGE_UPDATE_OLD)
+		/* Free the snapshot if this is the last change that needed it. */
+		Assert(change.snapshot->active_count > 0);
+		change.snapshot->active_count--;
+		if (change.snapshot->active_count == 0)
 		{
-			CommandCounterIncrement();
-			UpdateActiveSnapshotCommandId();
+			if (change.snapshot == dstate->snapshot)
+				dstate->snapshot = NULL;
+			FreeSnapshot(change.snapshot);
 		}
 
 		/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
@@ -2755,16 +3056,35 @@ static void
 apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
 						IndexInsertState *iistate, TupleTableSlot *index_slot)
 {
+	Snapshot	snapshot = change->snapshot;
 	List	   *recheck;
 
+	/*
+	 * For INSERT, the visibility information is not important, but we use the
+	 * snapshot to get CID. Index functions might need the whole snapshot
+	 * anyway.
+	 */
+	SetRepackCurrentXids(snapshot->subxip, snapshot->subxcnt);
+
+	/*
+	 * Write the tuple into the new heap.
+	 *
+	 * The snapshot is the one we used to decode the insert (though converted
+	 * to "non-historic" MVCC snapshot), i.e. the snapshot's curcid is the
+	 * tuple CID incremented by one (due to the "new CID" WAL record that got
+	 * written along with the INSERT record). Thus if we want to use the
+	 * original CID, we need to subtract 1 from curcid.
+	 */
+	Assert(snapshot->curcid != InvalidCommandId &&
+		   snapshot->curcid > FirstCommandId);
 
 	/*
 	 * Like simple_heap_insert(), but make sure that the INSERT is not
 	 * logically decoded - see reform_and_rewrite_tuple() for more
 	 * information.
 	 */
-	heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
-				NULL);
+	heap_insert(rel, tup, change->xid, snapshot->curcid - 1,
+				HEAP_INSERT_NO_LOGICAL, NULL);
 
 	/*
 	 * Update indexes.
@@ -2772,6 +3092,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
 	 * In case functions in the index need the active snapshot and caller
 	 * hasn't set one.
 	 */
+	PushActiveSnapshot(snapshot);
 	ExecStoreHeapTuple(tup, index_slot, false);
 	recheck = ExecInsertIndexTuples(iistate->rri,
 									index_slot,
@@ -2782,6 +3103,8 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
 									NIL,	/* arbiterIndexes */
 									false	/* onlySummarizing */
 		);
+	PopActiveSnapshot();
+	ResetRepackCurrentXids();
 
 	/*
 	 * If recheck is required, it must have been preformed on the source
@@ -2803,6 +3126,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 	TU_UpdateIndexes update_indexes;
 	TM_Result	res;
 	List	   *recheck;
+	Snapshot	snapshot = change->snapshot;
 
 	/*
 	 * Write the new tuple into the new heap. ('tup' gets the TID assigned
@@ -2810,13 +3134,19 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 	 *
 	 * Do it like in simple_heap_update(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Regarding CID, see the comment in apply_concurrent_insert().
 	 */
+	Assert(snapshot->curcid != InvalidCommandId &&
+		   snapshot->curcid > FirstCommandId);
+
 	res = heap_update(rel, &tup_target->t_self, tup,
-					  GetCurrentCommandId(true),
+					  change->xid, snapshot->curcid - 1,
 					  InvalidSnapshot,
 					  false,	/* no wait - only we are doing changes */
 					  &tmfd, &lockmode, &update_indexes,
-					  false /* wal_logical */ );
+	/* wal_logical */
+					  false);
 	if (res != TM_Ok)
 		ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
 
@@ -2824,6 +3154,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 
 	if (update_indexes != TU_None)
 	{
+		PushActiveSnapshot(snapshot);
 		recheck = ExecInsertIndexTuples(iistate->rri,
 										index_slot,
 										iistate->estate,
@@ -2833,6 +3164,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 										NIL,	/* arbiterIndexes */
 		/* onlySummarizing */
 										update_indexes == TU_Summarizing);
+		PopActiveSnapshot();
 		list_free(recheck);
 	}
 
@@ -2845,6 +3177,12 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
 {
 	TM_Result	res;
 	TM_FailureData tmfd;
+	Snapshot	snapshot = change->snapshot;
+
+
+	/* Regarding CID, see the comment in apply_concurrent_insert(). */
+	Assert(snapshot->curcid != InvalidCommandId &&
+		   snapshot->curcid > FirstCommandId);
 
 	/*
 	 * Delete tuple from the new heap.
@@ -2852,11 +3190,11 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
 	 * Do it like in simple_heap_delete(), except for 'wal_logical' (and
 	 * except for 'wait').
 	 */
-	res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
-					  InvalidSnapshot, false,
-					  &tmfd,
-					  false,	/* no wait - only we are doing changes */
-					  false /* wal_logical */ );
+	res = heap_delete(rel, &tup_target->t_self, change->xid,
+					  snapshot->curcid - 1, InvalidSnapshot, false,
+					  &tmfd, false,
+	/* wal_logical */
+					  false);
 
 	if (res != TM_Ok)
 		ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
@@ -2877,7 +3215,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
  */
 static HeapTuple
 find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
-				  IndexInsertState *iistate,
+				  Snapshot snapshot, IndexInsertState *iistate,
 				  TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
 {
 	IndexScanDesc scan;
@@ -2886,7 +3224,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
 	HeapTuple	result = NULL;
 
 	/* XXX no instrumentation for now */
-	scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+	scan = index_beginscan(rel, iistate->ident_index, snapshot,
 						   NULL, nkeys, 0);
 	*scan_p = scan;
 	index_rescan(scan, key, nkeys, NULL, 0);
@@ -2958,6 +3296,8 @@ process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
 	}
 	PG_FINALLY();
 	{
+		ResetRepackCurrentXids();
+
 		if (rel_src)
 			rel_dst->rd_toastoid = InvalidOid;
 	}
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5dc4ae58ffe..9fefcffd8b3 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -475,9 +475,14 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	/*
 	 * If the change is not intended for logical decoding, do not even
-	 * establish transaction for it - REPACK CONCURRENTLY is the typical use
-	 * case.
-	 *
+	 * establish transaction for it. This is particularly important if the
+	 * record was generated by REPACK CONCURRENTLY because this command uses
+	 * the original XID when doing changes in the new storage. The decoding
+	 * system probably does not expect to see the same transaction multiple
+	 * times.
+	 */
+
+	/*
 	 * First, check if REPACK CONCURRENTLY is being performed by this backend.
 	 * If so, only decode data changes of the table that it is processing, and
 	 * the changes of its TOAST relation.
@@ -504,11 +509,11 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	 * Second, skip records which do not contain sufficient information for
 	 * the decoding.
 	 *
-	 * The problem we solve here is that REPACK CONCURRENTLY generates WAL
-	 * when doing changes in the new table. Those changes should not be useful
-	 * for any other user (such as logical replication subscription) because
-	 * the new table will eventually be dropped (after REPACK CONCURRENTLY has
-	 * assigned its file to the "old table").
+	 * One particular problem we solve here is that REPACK CONCURRENTLY
+	 * generates WAL when doing changes in the new table. Those changes should
+	 * not be decoded because reorderbuffer.c considers their XID already
+	 * committed. (REPACK CONCURRENTLY deliberately generates WAL records in
+	 * such a way that they are skipped here.)
 	 */
 	switch (info)
 	{
@@ -995,13 +1000,6 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	xlrec = (xl_heap_insert *) XLogRecGetData(r);
 
-	/*
-	 * Ignore insert records without new tuples (this does happen when
-	 * raw_heap_insert marks the TOAST record as HEAP_INSERT_NO_LOGICAL).
-	 */
-	if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
-		return;
-
 	/* only interested in our database */
 	XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
 	if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 8e5116a9cab..72a38074a7b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -155,7 +155,7 @@ static bool ExportInProgress = false;
 static void SnapBuildPurgeOlderTxn(SnapBuild *builder);
 
 /* snapshot building/manipulation/distribution functions */
-static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder);
+static Snapshot SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn);
 
 static void SnapBuildFreeSnapshot(Snapshot snap);
 
@@ -352,12 +352,17 @@ SnapBuildSnapDecRefcount(Snapshot snap)
  * Build a new snapshot, based on currently committed catalog-modifying
  * transactions.
  *
+ * 'lsn' is the location of the commit record (of a catalog-changing
+ * transaction) that triggered creation of the snapshot. Pass
+ * InvalidXLogRecPtr for the transaction base snapshot or if it the user of
+ * the snapshot should not need the LSN.
+ *
  * In-progress transactions with catalog access are *not* allowed to modify
  * these snapshots; they have to copy them and fill in appropriate ->curcid
  * and ->subxip/subxcnt values.
  */
 static Snapshot
-SnapBuildBuildSnapshot(SnapBuild *builder)
+SnapBuildBuildSnapshot(SnapBuild *builder, XLogRecPtr lsn)
 {
 	Snapshot	snapshot;
 	Size		ssize;
@@ -425,6 +430,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->snapXactCompletionCount = 0;
+	snapshot->lsn = lsn;
 
 	return snapshot;
 }
@@ -461,7 +467,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	if (TransactionIdIsValid(MyProc->xmin))
 		elog(ERROR, "cannot build an initial slot snapshot when MyProc->xmin already is valid");
 
-	snap = SnapBuildBuildSnapshot(builder);
+	snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
 
 	/*
 	 * We know that snap->xmin is alive, enforced by the logical xmin
@@ -502,7 +508,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
 
 	Assert(builder->state == SNAPBUILD_CONSISTENT);
 
-	snap = SnapBuildBuildSnapshot(builder);
+	snap = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
 	return SnapBuildMVCCFromHistoric(snap, false);
 }
 
@@ -636,7 +642,7 @@ SnapBuildGetOrBuildSnapshot(SnapBuild *builder)
 	/* only build a new snapshot if we don't have a prebuilt one */
 	if (builder->snapshot == NULL)
 	{
-		builder->snapshot = SnapBuildBuildSnapshot(builder);
+		builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
 		/* increase refcount for the snapshot builder */
 		SnapBuildSnapIncRefcount(builder->snapshot);
 	}
@@ -716,7 +722,7 @@ SnapBuildProcessChange(SnapBuild *builder, TransactionId xid, XLogRecPtr lsn)
 		/* only build a new snapshot if we don't have a prebuilt one */
 		if (builder->snapshot == NULL)
 		{
-			builder->snapshot = SnapBuildBuildSnapshot(builder);
+			builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
 			/* increase refcount for the snapshot builder */
 			SnapBuildSnapIncRefcount(builder->snapshot);
 		}
@@ -1130,7 +1136,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		if (builder->snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder);
+		builder->snapshot = SnapBuildBuildSnapshot(builder, lsn);
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1958,7 +1964,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 	{
 		SnapBuildSnapDecRefcount(builder->snapshot);
 	}
-	builder->snapshot = SnapBuildBuildSnapshot(builder);
+	builder->snapshot = SnapBuildBuildSnapshot(builder, InvalidXLogRecPtr);
 	SnapBuildSnapIncRefcount(builder->snapshot);
 
 	ReorderBufferSetRestartPoint(builder->reorder, lsn);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 687fbbc59bb..28bd16f9cc7 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
 							Relation relations[],
 							ReorderBufferChange *change);
 static void store_change(LogicalDecodingContext *ctx,
-						 ConcurrentChangeKind kind, HeapTuple tuple);
+						 ConcurrentChangeKind kind, HeapTuple tuple,
+						 TransactionId xid);
 
 void
 _PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -100,6 +101,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 			  Relation relation, ReorderBufferChange *change)
 {
 	RepackDecodingState *dstate;
+	Snapshot	snapshot;
 
 	dstate = (RepackDecodingState *) ctx->output_writer_private;
 
@@ -107,6 +109,48 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (relation->rd_id != dstate->relid)
 		return;
 
+	/*
+	 * Catalog snapshot is fine because the table we are processing is
+	 * temporarily considered a user catalog table.
+	 */
+	snapshot = GetCatalogSnapshot(InvalidOid);
+	Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+	Assert(!snapshot->suboverflowed);
+
+	/*
+	 * This should not happen, but if we don't have enough information to
+	 * apply a new snapshot, the consequences would be bad. Thus prefer ERROR
+	 * to Assert().
+	 */
+	if (XLogRecPtrIsInvalid(snapshot->lsn))
+		ereport(ERROR, (errmsg("snapshot has invalid LSN")));
+
+	/*
+	 * reorderbuffer.c changes the catalog snapshot as soon as it sees a new
+	 * CID or a commit record of a catalog-changing transaction.
+	 */
+	if (dstate->snapshot == NULL || snapshot->lsn != dstate->snapshot_lsn ||
+		snapshot->curcid != dstate->snapshot->curcid)
+	{
+		/* CID should not go backwards. */
+		Assert(dstate->snapshot == NULL ||
+			   snapshot->curcid >= dstate->snapshot->curcid ||
+			   change->txn->xid != dstate->last_change_xid);
+
+		/*
+		 * XXX Is it a problem that the copy is created in
+		 * TopTransactionContext?
+		 *
+		 * XXX Wouldn't it be o.k. for SnapBuildMVCCFromHistoric() to set xcnt
+		 * to 0 instead of converting xip in this case? The point is that
+		 * transactions which are still in progress from the perspective of
+		 * reorderbuffer.c could not be replayed yet, so we do not need to
+		 * examine their XIDs.
+		 */
+		dstate->snapshot = SnapBuildMVCCFromHistoric(snapshot, false);
+		dstate->snapshot_lsn = snapshot->lsn;
+	}
+
 	/* Decode entry depending on its type */
 	switch (change->action)
 	{
@@ -124,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (newtuple == NULL)
 					elog(ERROR, "Incomplete insert info.");
 
-				store_change(ctx, CHANGE_INSERT, newtuple);
+				store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +185,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					elog(ERROR, "Incomplete update info.");
 
 				if (oldtuple != NULL)
-					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+								 change->txn->xid);
 
-				store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+				store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+							 change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +202,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (oldtuple == NULL)
 					elog(ERROR, "Incomplete delete info.");
 
-				store_change(ctx, CHANGE_DELETE, oldtuple);
+				store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
 			}
 			break;
 		default:
@@ -190,13 +236,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (i == nrelations)
 		return;
 
-	store_change(ctx, CHANGE_TRUNCATE, NULL);
+	store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
 }
 
 /* Store concurrent data change. */
 static void
 store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
-			 HeapTuple tuple)
+			 HeapTuple tuple, TransactionId xid)
 {
 	RepackDecodingState *dstate;
 	char	   *change_raw;
@@ -266,6 +312,11 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
 	dst = dst_start + SizeOfConcurrentChange;
 	memcpy(dst, tuple->t_data, tuple->t_len);
 
+	/* Initialize the other fields. */
+	change.xid = xid;
+	change.snapshot = dstate->snapshot;
+	dstate->snapshot->active_count++;
+
 	/* The data has been copied. */
 	if (flattened)
 		pfree(tuple);
@@ -279,6 +330,9 @@ store:
 	isnull[0] = false;
 	tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
 						 values, isnull);
+#ifdef USE_ASSERT_CHECKING
+	dstate->last_change_xid = xid;
+#endif
 
 	/* Accounting. */
 	dstate->nchanges++;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index e9ddf39500c..e24e1795aa9 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -151,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
 	size = add_size(size, InjectionPointShmemSize());
 	size = add_size(size, SlotSyncShmemSize());
 	size = add_size(size, AioShmemSize());
+	size = add_size(size, RepackShmemSize());
 
 	/* include additional requested shmem from preload libraries */
 	size = add_size(size, total_addin_request);
@@ -344,6 +345,7 @@ CreateOrAttachShmemStructs(void)
 	WaitEventCustomShmemInit();
 	InjectionPointShmemInit();
 	AioShmemInit();
+	RepackShmemInit();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..f546c2e12ca 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -352,6 +352,7 @@ DSMRegistry	"Waiting to read or update the dynamic shared memory registry."
 InjectionPoint	"Waiting to read or update information related to injection points."
 SerialControl	"Waiting to read or update shared <filename>pg_serial</filename> state."
 AioWorkerSubmissionQueue	"Waiting to access AIO worker submission queue."
+RepackedRels	"Waiting to access to hash table with list of repacked relations."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index 02505c88b8e..ecaa2283c2a 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1643,6 +1643,27 @@ CacheInvalidateRelcache(Relation relation)
 								 databaseId, relationId);
 }
 
+/*
+ * CacheInvalidateRelcacheImmediate
+ *		Send invalidation message for the specified relation's relcache entry.
+ *
+ * Currently this is used in REPACK CONCURRENTLY, to make sure that other
+ * backends are aware that the command is being executed for the relation.
+ */
+void
+CacheInvalidateRelcacheImmediate(Oid relid)
+{
+	SharedInvalidationMessage msg;
+
+	msg.rc.id = SHAREDINVALRELCACHE_ID;
+	msg.rc.dbId = MyDatabaseId;
+	msg.rc.relId = relid;
+	/* check AddCatcacheInvalidationMessage() for an explanation */
+	VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg));
+
+	SendSharedInvalidMessages(&msg, 1);
+}
+
 /*
  * CacheInvalidateRelcacheAll
  *		Register invalidation of the whole relcache at the end of command.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index d27a4c30548..ea565b5b053 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1279,6 +1279,10 @@ retry:
 	/* make sure relation is marked as having no open file yet */
 	relation->rd_smgr = NULL;
 
+	/* Is REPACK CONCURRENTLY in progress? */
+	relation->rd_repack_concurrent =
+		is_concurrent_repack_in_progress(targetRelId);
+
 	/*
 	 * now we can free the memory allocated for pg_class_tuple
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b82dd17a966..981425f23b6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,22 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
 extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
 
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
-						int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+						CommandId cid, int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 TransactionId xid, CommandId cid,
+							 Snapshot crosscheck, bool wait,
 							 struct TM_FailureData *tmfd, bool changingPart,
 							 bool wal_logical);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
-							 HeapTuple newtup,
+							 HeapTuple newtup, TransactionId xid,
 							 CommandId cid, Snapshot crosscheck, bool wait,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes, bool wal_logical);
+							 TU_UpdateIndexes *update_indexes,
+							 bool wal_logical);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool follow_updates,
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index b2bc10ee041..fbb66d559b6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -482,6 +482,8 @@ extern Size EstimateTransactionStateSpace(void);
 extern void SerializeTransactionState(Size maxsize, char *start_address);
 extern void StartParallelWorkerTransaction(char *tstatespace);
 extern void EndParallelWorkerTransaction(void);
+extern void SetRepackCurrentXids(TransactionId *xip, int xcnt);
+extern void ResetRepackCurrentXids(void);
 extern bool IsTransactionBlock(void);
 extern bool IsTransactionOrTransactionBlock(void);
 extern char TransactionBlockStatusCode(void);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 532ffa7208d..00ae6fc18e8 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -59,6 +59,14 @@ typedef struct ConcurrentChange
 	/* See the enum above. */
 	ConcurrentChangeKind kind;
 
+	/* Transaction that changes the data. */
+	TransactionId xid;
+
+	/*
+	 * Historic catalog snapshot that was used to decode this change.
+	 */
+	Snapshot	snapshot;
+
 	/*
 	 * The actual tuple.
 	 *
@@ -90,6 +98,8 @@ typedef struct RepackDecodingState
 	 * tuplestore does this transparently.
 	 */
 	Tuplestorestate *tstore;
+	/* XID of the last change added to tstore. */
+	TransactionId last_change_xid PG_USED_FOR_ASSERTS_ONLY;
 
 	/* The current number of changes in tstore. */
 	double		nchanges;
@@ -110,6 +120,14 @@ typedef struct RepackDecodingState
 	/* Slot to retrieve data from tstore. */
 	TupleTableSlot *tsslot;
 
+	/*
+	 * Historic catalog snapshot that was used to decode the most recent
+	 * change.
+	 */
+	Snapshot	snapshot;
+	/* LSN of the record  */
+	XLogRecPtr	snapshot_lsn;
+
 	ResourceOwner resowner;
 } RepackDecodingState;
 
@@ -139,4 +157,8 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 							 MultiXactId cutoffMulti,
 							 char newrelpersistence);
 
+extern Size RepackShmemSize(void);
+extern void RepackShmemInit(void);
+extern bool is_concurrent_repack_in_progress(Oid relid);
+
 #endif							/* CLUSTER_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..869d9da7337 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -85,6 +85,7 @@ PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
 PG_LWLOCK(53, AioWorkerSubmissionQueue)
+PG_LWLOCK(54, RepackedRels)
 
 /*
  * There also exist several built-in LWLock tranches.  As with the predefined
diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h
index 9b871caef62..ae9dee394dc 100644
--- a/src/include/utils/inval.h
+++ b/src/include/utils/inval.h
@@ -50,6 +50,8 @@ extern void CacheInvalidateCatalog(Oid catalogId);
 
 extern void CacheInvalidateRelcache(Relation relation);
 
+extern void CacheInvalidateRelcacheImmediate(Oid relid);
+
 extern void CacheInvalidateRelcacheAll(void);
 
 extern void CacheInvalidateRelcacheByTuple(HeapTuple classTuple);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index b552359915f..66de3bc0c29 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -253,6 +253,9 @@ typedef struct RelationData
 	bool		pgstat_enabled; /* should relation stats be counted */
 	/* use "struct" here to avoid needing to include pgstat.h: */
 	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+
+	/* Is REPACK CONCURRENTLY being performed on this relation? */
+	bool		rd_repack_concurrent;
 } RelationData;
 
 
@@ -695,7 +698,9 @@ RelationCloseSmgr(Relation relation)
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
 	 RelationNeedsWAL(relation) && \
-	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
+	 (IsCatalogRelation(relation) || \
+	  RelationIsUsedAsCatalogTable(relation) || \
+	  (relation)->rd_repack_concurrent))
 
 /*
  * RelationIsLogicallyLogged
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 0e546ec1497..014f27db7d7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -13,6 +13,7 @@
 #ifndef SNAPSHOT_H
 #define SNAPSHOT_H
 
+#include "access/xlogdefs.h"
 #include "lib/pairingheap.h"
 
 
@@ -201,6 +202,8 @@ typedef struct SnapshotData
 	uint32		regd_count;		/* refcount on RegisteredSnapshots */
 	pairingheap_node ph_node;	/* link in the RegisteredSnapshots heap */
 
+	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+
 	/*
 	 * The transaction completion count at the time GetSnapshotData() built
 	 * this snapshot. Allows to avoid re-computing static snapshots when no
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 75850334986..3711a7c92b9 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -86,9 +86,6 @@ step change_new
 # When applying concurrent data changes, we should see the effects of an
 # in-progress subtransaction.
 #
-# XXX Not sure this test is useful now - it was designed for the patch that
-# preserves tuple visibility and which therefore modifies
-# TransactionIdIsCurrentTransactionId().
 step change_subxact1
 {
 	BEGIN;
@@ -103,7 +100,6 @@ step change_subxact1
 # When applying concurrent data changes, we should not see the effects of a
 # rolled back subtransaction.
 #
-# XXX Is this test useful? See above.
 step change_subxact2
 {
 	BEGIN;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 480ce894978..e42bb5209c0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2540,6 +2540,7 @@ ReorderBufferTupleCidKey
 ReorderBufferUpdateProgressTxnCB
 ReorderTuple
 RepOriginId
+RepackedRel
 RepackCommand
 RepackDecodingState
 RepackStmt
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-21 18:07  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-21 18:07 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Also, I think I found an issue (or lost something during rebase): we
> must preserve xmin,cmin during initial copy
> to make sure that data is going to be visible by snapshots of
> concurrent changes later:
> 
> static void
> reform_and_rewrite_tuple(......)
> .....
>       /*It is also crucial to stamp the new record with the exact same
> xid and cid,
>       * because the tuple must be visible to the snapshot of the
> applied concurrent
>       * change later.
>       */
>       CommandId      cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
>       TransactionId   xid = HeapTupleHeaderGetXmin(tuple->t_data);
> 
>       heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);

When posting version 12 of the patch [1] I raised a concern that the the MVCC
safety is too expensive when it comes to logical decoding. Therefore, I
abandoned the concept for now, and v13 [2] uses plain heap_insert(). Once we
implement the MVCC safety, we simply rewrite the tuple like v12 did - that's
the simplest way to preserve fields like xmin, cmin, ...

[1] https://www.postgresql.org/message-id/178741.1743514291%40localhost
[2] https://www.postgresql.org/message-id/97795.1744363522%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-21 18:14  Antonin Houska <[email protected]>
  parent: Andres Freund <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-21 18:14 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: [email protected]; [email protected]; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Andres Freund <[email protected]> wrote:

> Hi,
> 
> On 2025-08-20 16:22:41 +0200, Antonin Houska wrote:
> > Álvaro Herrera <[email protected]> wrote:
> >
> > > On 2025-Aug-20, Antonin Houska wrote:
> > >
> > > > There's an issue with the symlink, maybe some meson expert can help. In
> > > > particular, the CI on Windows ends up with the following error:
> > > >
> > > > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb
> > >
> > > Hmm, that's not the problem I see in the CI run from the commitfest app:
> > >
> > > https://cirrus-ci.com/task/5608274336153600
> >
> > I was referring to the other build that you shared off-list (probably
> > independent from cfbot):
> >
> > https://cirrus-ci.com/build/4726227505774592
> >
> > > [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj
> > > [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq" "/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj "/c" ../src/bin/scripts/vacuumdb.c
> > > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}'
> > > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0
> > >
> > > The real problem here seems to be the empty long_options_repack array.
> > > I removed it and started a new run to see what happens.  Running now:
> > > https://cirrus-ci.com/build/4961902171783168
> >
> > The symlink issue occurred at "Windows - Server 2019, MinGW64 - Meson", where
> > the code compiled well. The compilation failure mentioned above comes from
> > "Windows - Server 2019, VS 2019 - Meson & ninja". I think it's still possible
> > that the symlink issue will occur there once the compilation is fixed.
> 
> FWIW, I don't think it's particularly wise to rely on symlinks on windows -
> IIRC they will often not be enabled outside of development environments.

ok, installing a copy of the same executable with a different name seems more
reliable. At least that's how the postmaster->postgres link used to be
handled, if I read Makefile correctly. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-21 18:16  Andres Freund <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Andres Freund @ 2025-08-21 18:16 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: [email protected]; [email protected]; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Hi,

On 2025-08-21 20:14:14 +0200, Antonin Houska wrote:
> ok, installing a copy of the same executable with a different name seems more
> reliable. At least that's how the postmaster->postgres link used to be
> handled, if I read Makefile correctly. Thanks.

I have not followed this thread, but I don't think the whole thing of having a
single executable with multiple names is worth doing. Just make whatever an
option, instead of having multiple "executables".

Greetings,

Andres





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-21 22:06  Robert Treat <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Robert Treat @ 2025-08-21 22:06 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On Tue, Aug 19, 2025 at 2:53 PM Álvaro Herrera <[email protected]> wrote:
> Note choice of shell command name: though all the other programs in
> src/bin/scripts do not use the "pg_" prefix, this one does; we thought
> it made no sense to follow the old programs as precedent because there
> seems to be a lament for the lack of pg_ prefix in those, and we only
> keep what they are because of their long history.  This one has no
> history.
>
> Still on pg_repackdb, the implementation here is to install a symlink
> called pg_repackdb which points to vacuumdb, and make the program behave
> differently when called in this way.  The amount of additional code for
> this is relatively small, so I think this is a worthy technique --
> assuming it works.  If it doesn't, Antonin proposed a separate binary
> that just calls some functions from vacuumdb.  Or maybe we could have a
> common source file that both utilities call.
>

What's the plan for clusterdb? It seems like we'd ideally create a
stand alone pg_repackdb which replaces clusterdb and also allows us to
remove the FULL options from vacuumdb.


Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-22 09:40  Álvaro Herrera <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Álvaro Herrera @ 2025-08-22 09:40 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On 2025-Aug-21, Robert Treat wrote:

> What's the plan for clusterdb? It seems like we'd ideally create a
> stand alone pg_repackdb which replaces clusterdb and also allows us to
> remove the FULL options from vacuumdb.

I don't think we should remove clusterdb, to avoid breaking any scripts
that work today.  As you say, I would create the standalone pg_repackdb
to do what we need it to do (namely: run the REPACK commands) and leave
vacuumdb and clusterdb alone.  Removing the obsolete commands and
options can be done in a few years.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-22 20:32  Euler Taveira <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Euler Taveira @ 2025-08-22 20:32 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; Robert Treat <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On Fri, Aug 22, 2025, at 6:40 AM, Álvaro Herrera wrote:
> On 2025-Aug-21, Robert Treat wrote:
>
>> What's the plan for clusterdb? It seems like we'd ideally create a
>> stand alone pg_repackdb which replaces clusterdb and also allows us to
>> remove the FULL options from vacuumdb.
>
> I don't think we should remove clusterdb, to avoid breaking any scripts
> that work today.  As you say, I would create the standalone pg_repackdb
> to do what we need it to do (namely: run the REPACK commands) and leave
> vacuumdb and clusterdb alone.  Removing the obsolete commands and
> options can be done in a few years.
>

I would say that we need to plan the removal of these binaries (clusterdb and
vacuumdb). We can start with a warning into clusterdb saying they should use
pg_repackdb. In a few years, we can remove clusterdb. There were complaints
about binary names without a pg_ prefix in the past [1].

I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb)
to pg_repackdb. We can add a similar warning message saying they should use
pg_repackdb if the symlink is used.


[1] https://www.postgresql.org/message-id/CAJgfmqXYYKXR%2BQUhEa3cq6pc8dV0Hu7QvOUccm7R0TkC%3DT-%2B%3DA%40...


-- 
Euler Taveira
EDB   https://www.enterprisedb.com/





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-23 06:56  Michael Banck <[email protected]>
  parent: Euler Taveira <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Michael Banck @ 2025-08-23 06:56 UTC (permalink / raw)
  To: Euler Taveira <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Hi,

On Fri, Aug 22, 2025 at 05:32:34PM -0300, Euler Taveira wrote:
> On Fri, Aug 22, 2025, at 6:40 AM, Álvaro Herrera wrote:
> > On 2025-Aug-21, Robert Treat wrote:
> >> What's the plan for clusterdb? It seems like we'd ideally create a
> >> stand alone pg_repackdb which replaces clusterdb and also allows us to
> >> remove the FULL options from vacuumdb.
> >
> > I don't think we should remove clusterdb, to avoid breaking any scripts
> > that work today.  As you say, I would create the standalone pg_repackdb
> > to do what we need it to do (namely: run the REPACK commands) and leave
> > vacuumdb and clusterdb alone.  Removing the obsolete commands and
> > options can be done in a few years.
> 
> I would say that we need to plan the removal of these binaries (clusterdb and
> vacuumdb). We can start with a warning into clusterdb saying they should use
> pg_repackdb. In a few years, we can remove clusterdb. There were complaints
> about binary names without a pg_ prefix in the past [1].

Yeah.
 
> I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb)
> to pg_repackdb. We can add a similar warning message saying they should use
> pg_repackdb if the symlink is used.

Unless pg_repack has the same (or a superset of) CLI and behaviour as
vacuumdb (I haven't checked, but doubt it?), I think replacing vacuumdb
with a symlink to pg_repack will lead to much more breakage in existing
scripts/automation than clusterdb, which I guess is used orders of
magnitude less frequently than vacumdb.


Michael





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-23 14:22  Álvaro Herrera <[email protected]>
  parent: Michael Banck <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Álvaro Herrera @ 2025-08-23 14:22 UTC (permalink / raw)
  To: Michael Banck <[email protected]>; Euler Taveira <[email protected]>; +Cc: Robert Treat <[email protected]>; pgsql-hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On 2025-08-23, Michael Banck wrote:

> On Fri, Aug 22, 2025 at 05:32:34PM -0300, Euler Taveira wrote:

>> I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb)
>> to pg_repackdb. We can add a similar warning message saying they should use
>> pg_repackdb if the symlink is used.
>
> Unless pg_repack has the same (or a superset of) CLI and behaviour as
> vacuumdb (I haven't checked, but doubt it?), I think replacing vacuumdb
> with a symlink to pg_repack will lead to much more breakage in existing
> scripts/automation than clusterdb, which I guess is used orders of
> magnitude less frequently than vacumdb.

Yeah, I completely disagree with the idea of getting rid of vacuumdb. We can, maybe, in a distant future, get rid of the --full option to vacuumdb.  But the rest of the vacuumdb behavior must stay, I think, because REPACK is not VACUUM — it is only VACUUM FULL. And we want to make that distinction very clear.

We can also, in a few years, get rid of clusterdb.  But I don't think we need to deprecate it just yet.

-- 
Álvaro Herrera





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-24 16:52  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-24 16:52 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Hello, Antonin!

> When posting version 12 of the patch [1] I raised a concern that the the MVCC
> safety is too expensive when it comes to logical decoding. Therefore, I
> abandoned the concept for now, and v13 [2] uses plain heap_insert(). Once we
> implement the MVCC safety, we simply rewrite the tuple like v12 did - that's
> the simplest way to preserve fields like xmin, cmin, ...

Thanks for the explanation.

I was looking into catalog-related logical decoding features, and it
seems like they are clearly overkill for the repack case.
We don't need CID tracking or even a snapshot for each commit if we’re
okay with passing xmin/xmax as arguments.

What do you think about the following approach for replaying:
* use the extracted XID as the value for xmin/xmax.
* use SnapshotSelf to find the tuple for update/delete operations.

SnapshotSelf seems like a good fit here:
* it sees the last "existing" version.
* any XID set as xmin/xmax in the repacked version is already
committed - so each update/insert is effectively "committed" once
written.
* it works with multiple updates of the same tuple within a single
transaction - SnapshotSelf sees the last version.
* all updates are ordered and replayed sequentially - so the last
version is always the one we want.

If I'm not missing anything, this looks like something worth including
in the patch set.
If so, I can try implementing a test version.

Best regards,
Mikhail

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 13:09  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-25 13:09 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> I was looking into catalog-related logical decoding features, and it
> seems like they are clearly overkill for the repack case.
> We don't need CID tracking or even a snapshot for each commit if we’re
> okay with passing xmin/xmax as arguments.

I assume you are concerned with the patch part 0005 of the v12 patch
("Preserve visibility information of the concurrent data changes."), aren't
you?

> What do you think about the following approach for replaying:
> * use the extracted XID as the value for xmin/xmax.
> * use SnapshotSelf to find the tuple for update/delete operations.
> 
> SnapshotSelf seems like a good fit here:
> * it sees the last "existing" version.
> * any XID set as xmin/xmax in the repacked version is already
> committed - so each update/insert is effectively "committed" once
> written.
> * it works with multiple updates of the same tuple within a single
> transaction - SnapshotSelf sees the last version.
> * all updates are ordered and replayed sequentially - so the last
> version is always the one we want.
> 
> If I'm not missing anything, this looks like something worth including
> in the patch set.
> If so, I can try implementing a test version.

Not sure I understand in all details, but I don't think SnapshotSelf is the
correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its
'snapshot' argument at all. Instead, it considers the set of running
transactions as it is at the time the function is called.

One particular problem I imagine is replaying an UPDATE to a row that some
later transaction will eventually delete, but the transaction that ran the
UPDATE obviously had to see it. When looking for the old version during the
replay, HeapTupleSatisfiesMVCC() will find the old version as long as we pass
the correct snapshot to it.

However, at the time we're replaying the UPDATE in the new table, the tuple
may have been already deleted from the old table, and the deleting transaction
may already have committed. In such a case, HeapTupleSatisfiesSelf() will
conclude the old version invisible and the we'll fail to replay the UPDATE.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 14:15  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-25 14:15 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Hi, Antonin!

> I assume you are concerned with the patch part 0005 of the v12 patch
> ("Preserve visibility information of the concurrent data changes."), aren't
> you?

Yes, of course. I got an idea while trying to find a way to optimize it.

> Not sure I understand in all details, but I don't think SnapshotSelf is the
> correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its
> 'snapshot' argument at all. Instead, it considers the set of running
> transactions as it is at the time the function is called.

Yes, and it is almost the same behavior when a typical MVCC snapshot
encounters a tuple created by its own transaction.

So, how it works in the non MVCC-safe case (current patch behaviour):

1) we have a whole initial table snapshot with all the xmin = repack XID
2) appling transaction sees ALL the self-alive (no xmax) tuples in it
because all tuples created\deleted by transaction itself
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax and inserts a new one
4) so, there is no any real MVCC involved - just find the latest
version and create a new version
5) and it works correctly because all ordering issues were resolved by
locking mechanisms on the original table or by reordering buffer

How it maps to MVCC-safe case (SnapshotSelf):

1) we have a whole initial table snapshot with all xmin copied from
the original table. All such xmin are committed.
2) appling transaction sees ALL the self-alive (no xmax) tuple in it
because its xmin\xmax is committed and SnapshotSelf is happy with it
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax=original xid and inserts a new one
keeping with xmin=orignal xid
4) --//--
5) --//--

> However, at the time we're replaying the UPDATE in the new table, the tuple
> may have been already deleted from the old table, and the deleting transaction
> may already have committed. In such a case, HeapTupleSatisfiesSelf() will
> conclude the old version invisible and the we'll fail to replay the UPDATE.

No, it will see it - because its xmax will be empty in the repacked
version of the table.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 15:42  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-25 15:42 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> > Not sure I understand in all details, but I don't think SnapshotSelf is the
> > correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its
> > 'snapshot' argument at all. Instead, it considers the set of running
> > transactions as it is at the time the function is called.
> 
> Yes, and it is almost the same behavior when a typical MVCC snapshot
> encounters a tuple created by its own transaction.
> 
> So, how it works in the non MVCC-safe case (current patch behaviour):
> 
> 1) we have a whole initial table snapshot with all the xmin = repack XID
> 2) appling transaction sees ALL the self-alive (no xmax) tuples in it
> because all tuples created\deleted by transaction itself
> 3) each update/delete during the replay selects the last existing
> tuple version, updates it xmax and inserts a new one
> 4) so, there is no any real MVCC involved - just find the latest
> version and create a new version
> 5) and it works correctly because all ordering issues were resolved by
> locking mechanisms on the original table or by reordering buffer

ok

> How it maps to MVCC-safe case (SnapshotSelf):
> 
> 1) we have a whole initial table snapshot with all xmin copied from
> the original table. All such xmin are committed.
> 2) appling transaction sees ALL the self-alive (no xmax) tuple in it
> because its xmin\xmax is committed and SnapshotSelf is happy with it

How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a
snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ?

> 3) each update/delete during the replay selects the last existing
> tuple version, updates it xmax=original xid and inserts a new one
> keeping with xmin=orignal xid
> 4) --//--
> 5) --//--

> > However, at the time we're replaying the UPDATE in the new table, the tuple
> > may have been already deleted from the old table, and the deleting transaction
> > may already have committed. In such a case, HeapTupleSatisfiesSelf() will
> > conclude the old version invisible and the we'll fail to replay the UPDATE.
> 
> No, it will see it - because its xmax will be empty in the repacked
> version of the table.

You're right, it'll be empty in the new table.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 16:03  Robert Treat <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Robert Treat @ 2025-08-25 16:03 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Michael Banck <[email protected]>; Euler Taveira <[email protected]>; pgsql-hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On Sat, Aug 23, 2025 at 10:23 AM Álvaro Herrera <[email protected]> wrote:
> On 2025-08-23, Michael Banck wrote:
> > On Fri, Aug 22, 2025 at 05:32:34PM -0300, Euler Taveira wrote:
>
> >> I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb)
> >> to pg_repackdb. We can add a similar warning message saying they should use
> >> pg_repackdb if the symlink is used.
> >
> > Unless pg_repack has the same (or a superset of) CLI and behaviour as
> > vacuumdb (I haven't checked, but doubt it?), I think replacing vacuumdb
> > with a symlink to pg_repack will lead to much more breakage in existing
> > scripts/automation than clusterdb, which I guess is used orders of
> > magnitude less frequently than vacumdb.
>
> Yeah, I completely disagree with the idea of getting rid of vacuumdb. We can, maybe, in a distant future, get rid of the --full option to vacuumdb.  But the rest of the vacuumdb behavior must stay, I think, because REPACK is not VACUUM — it is only VACUUM FULL. And we want to make that distinction very clear.
>

Or to put it the other way, VACUUM FULL is not really VACUUM either,
it is really a form of "repack".

> We can also, in a few years, get rid of clusterdb.  But I don't think we need to deprecate it just yet.
>

Yeah, ISTM the long term goal should be two binaries, one of which
manages aspects of clustering/repacking type of activities, and one
which manages vacuum type activities. I don't think that's different
that what Alvaro is proposing, FWIW my original question was about
confirming that was the end goal, but also trying to understand the
coordination of when these changes would take place, because the
changes to the code, changes to the SQL commands and their docs, and
changes to the command line tools, seem to be working at different
cadences. Which can be fine if it's on purpose, but maybe needs to be
tightened up if not; for example, the current patchset doesn't make
any changes to clusterdb, which one might expect to emit a warning
about being deprecated in favor of pg_repackdb, if not just a complete
punting to use pg_repackdb instead.

Robert Treat
https://xzilla.net

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 16:23  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-25 16:23 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Hi, Antonin

> How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a
> snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ?

Yes, TransactionIdDidCommit. Another option is just invent a new
snapshot type - SnapshotBelieveEverythingCommitted - for that
particular case it should work - because all xmin/xmax written into
the new table are committed by design.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 16:36  Robert Treat <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Robert Treat @ 2025-08-25 16:36 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Antonin Houska <[email protected]>; Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Pg Hackers <[email protected]>

On Mon, Aug 25, 2025 at 10:15 AM Mihail Nikalayeu
<[email protected]> wrote:
> 1) we have a whole initial table snapshot with all xmin copied from
> the original table. All such xmin are committed.
> 2) appling transaction sees ALL the self-alive (no xmax) tuple in it
> because its xmin\xmax is committed and SnapshotSelf is happy with it
> 3) each update/delete during the replay selects the last existing
> tuple version, updates it xmax=original xid and inserts a new one
> keeping with xmin=orignal xid
> 4) --//--
> 5) --//--
>

Advancing the tables min xid to at least repack XID is a pretty big
feature, but the above scenario sounds like it would result in any
non-modified pre-existing tuples ending up with their original xmin
rather than repack XID, which seems like it could lead to weird
side-effects. Maybe I am mis-thinking it though?

Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 16:54  Antonin Houska <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-08-25 16:54 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Pg Hackers <[email protected]>

Robert Treat <[email protected]> wrote:

> On Mon, Aug 25, 2025 at 10:15 AM Mihail Nikalayeu
> <[email protected]> wrote:
> > 1) we have a whole initial table snapshot with all xmin copied from
> > the original table. All such xmin are committed.
> > 2) appling transaction sees ALL the self-alive (no xmax) tuple in it
> > because its xmin\xmax is committed and SnapshotSelf is happy with it
> > 3) each update/delete during the replay selects the last existing
> > tuple version, updates it xmax=original xid and inserts a new one
> > keeping with xmin=orignal xid
> > 4) --//--
> > 5) --//--
> >
> 
> Advancing the tables min xid to at least repack XID is a pretty big
> feature, but the above scenario sounds like it would result in any
> non-modified pre-existing tuples ending up with their original xmin
> rather than repack XID, which seems like it could lead to weird
> side-effects. Maybe I am mis-thinking it though?

What we discuss here is how to keep visibility information of tuples (xmin,
xmax, ...) unchanged. Both CLUSTER and VACUUM FULL already do that. However
it's not trivial to ensure that REPACK with the CONCURRENTLY option does as
well.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 17:22  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-25 17:22 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hi, Antonin
> 
> > How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a
> > snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ?
> 
> Yes, TransactionIdDidCommit.

I think the problem is that HeapTupleSatisfiesSelf() uses
TransactionIdIsInProgress() instead of checking the snapshot:

        ...
        else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
                return false;
        else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
        ...

When decoding (and replaying) data changes, you deal with the database state
as it was (far) in the past. However TransactionIdIsInProgress() is not
suitable for this purpose.

And since CommitTransaction() updates the commit log before removing the
transaction from ProcArray, I can even imagine race conditions: if a
transaction is committed and decoded fast enough, TransactionIdIsInProgress()
might still return true. In such a case, HeapTupleSatisfiesSelf() returns
false instead of calling TransactionIdDidCommit().

> Another option is just invent a new
> snapshot type - SnapshotBelieveEverythingCommitted - for that
> particular case it should work - because all xmin/xmax written into
> the new table are committed by design.

I'd prefer optimization of the logical decoding for REPACK CONCURRENTLY, and
using the MVCC snapshots.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-25 18:18  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-25 18:18 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Antonin Houska <[email protected]>:
> I think the problem is that HeapTupleSatisfiesSelf() uses
> TransactionIdIsInProgress() instead of checking the snapshot:

Yes, some issues might be possible for SnapshotSelf.
Possible solution is to override TransactionIdIsCurrentTransactionId
to true (like you did with nParallelCurrentXids but just return true).
IIUC, in that case all checks are going to behave the same way as in v5 version.

> I'd prefer optimization of the logical decoding for REPACK CONCURRENTLY, and
> using the MVCC snapshots.

It is also possible, but it is much more complex and feels like overkill to me.
We need just a way to find the latest version of row in the world of
all-committed transactions without any concurrent writers - I am
pretty sure it is possible to achieve in a more simple and effective
way.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-26 08:46  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-26 08:46 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Antonin Houska <[email protected]>:
> > I think the problem is that HeapTupleSatisfiesSelf() uses
> > TransactionIdIsInProgress() instead of checking the snapshot:
> 
> Yes, some issues might be possible for SnapshotSelf.
> Possible solution is to override TransactionIdIsCurrentTransactionId
> to true (like you did with nParallelCurrentXids but just return true).
> IIUC, in that case all checks are going to behave the same way as in v5 version.

I assume you mean v12-0005. Yes, that modifies
TransactionIdIsCurrentTransactionId(), so that the the transaction being
replayed recognizes if it (or its subtransaction) performed particular change
itself.

Although it could work, I think it'd be confusing to consider the transactions
being replayed as "current" from the point of view of the backend that
executes REPACK CONCURRENTLY.

But the primary issue is that in v12-0005,
TransactionIdIsCurrentTransactionId() gets the information on "current
transactions" from snapshots - see the calls of SetRepackCurrentXids() before
each scan. It's probably not what you want.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-26 09:02  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-26 09:02 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Antonin Houska <[email protected]>:

> Although it could work, I think it'd be confusing to consider the transactions
> being replayed as "current" from the point of view of the backend that
> executes REPACK CONCURRENTLY.

Just realized SnapshotDirty is the thing that fits into the role - it
respects not-yet committed transactions, giving enough information to
wait for them.
It is already used in a similar pattern in
check_exclusion_or_unique_constraint and RelationFindReplTupleByIndex.

So, it is easy to detect the case of the race you described previously
and retry + there is no sense to hack around
TransactionIdIsCurrentTransactionId.

BWT, btree + SnapshotDirty has issue [0], but it is a different story
and happens only with concurrent updates which are not present in the
current scope.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwXGhH_qD6RGqPyEeKdmHgr-HpA-tASYdi5onP%2BRyP5TCw%40m...

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-26 13:31  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-26 13:31 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Antonin Houska <[email protected]>:
> 
> > Although it could work, I think it'd be confusing to consider the transactions
> > being replayed as "current" from the point of view of the backend that
> > executes REPACK CONCURRENTLY.
> 
> Just realized SnapshotDirty is the thing that fits into the role - it
> respects not-yet committed transactions, giving enough information to
> wait for them.
> It is already used in a similar pattern in
> check_exclusion_or_unique_constraint and RelationFindReplTupleByIndex.
> 
> So, it is easy to detect the case of the race you described previously
> and retry + there is no sense to hack around
> TransactionIdIsCurrentTransactionId.

Where exactly should HeapTupleSatisfiesDirty() conclude that the tuple is
visible? TransactionIdIsCurrentTransactionId() will not do w/o the
modifications that you proposed earlier [1] and TransactionIdIsInProgress() is
not suitable as I explained in [2].

I understand your idea that there are no transaction aborts in the new table,
which makes things simpler. I cannot judge if it's worth inventing a new kind
of snapshot. Anyway, I think you'd then also need to hack
HeapTupleSatisfiesUpdate(). Isn't that too invasive?

[1] https://www.postgresql.org/message-id/CADzfLwUqyOmpkLmciecBy4aBN1sohQVZ2Hgc6m-tjSUqDRHwyQ%40mail.gma...
[2] https://www.postgresql.org/message-id/24483.1756142534%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-27 00:38  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-27 00:38 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Hello, Antonin!

Antonin Houska <[email protected]>:
>
> Where exactly should HeapTupleSatisfiesDirty() conclude that the tuple is
> visible? TransactionIdIsCurrentTransactionId() will not do w/o the
> modifications that you proposed earlier [1] and TransactionIdIsInProgress() is
> not suitable as I explained in [2].

HeapTupleSatisfiesDirty is designed to respect even not-yet-committed
transactions and provides additional related information.

    else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
    {
       /*
        * Return the speculative token to caller.  Caller can worry about
        * xmax, since it requires a conclusively locked row version, and
        * a concurrent update to this tuple is a conflict of its
        * purposes.
        */
       if (HeapTupleHeaderIsSpeculative(tuple))
       {
          snapshot->speculativeToken =
             HeapTupleHeaderGetSpeculativeToken(tuple);

          Assert(snapshot->speculativeToken != 0);
       }

       snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple);
       /* XXX shouldn't we fall through to look at xmax? */
       return true;      /* in insertion by other */
    }

So, it returns true when TransactionIdIsInProgress is true.
However, that alone is not sufficient to trust the result in the common case.

You may check check_exclusion_or_unique_constraint or
RelationFindReplTupleByIndex for that pattern:
if xmin is set in the snapshot (a special hack in SnapshotDirty to
provide additional information from the check), we wait for the
ongoing transaction (or one that is actually committed but not yet
properly reflected in the proc array), and then retry the entire tuple
search.

So, the race condition you explained in [2] will be resolved by a
retry, and the changes to TransactionIdIsInProgress described in [1]
are not necessary.

> I understand your idea that there are no transaction aborts in the new table,
> which makes things simpler. I cannot judge if it's worth inventing a new kind
> of snapshot. Anyway, I think you'd then also need to hack
> HeapTupleSatisfiesUpdate(). Isn't that too invasive?

It seems that HeapTupleSatisfiesUpdate is also fine as it currently
exists (we'll see the committed version after retry)..

The solution appears to be non-invasive:
* uses the existing snapshot type
* follows the existing usage pattern
* leaves TransactionIdIsInProgress and HeapTupleSatisfiesUpdate unchanged

The main change is that xmin/xmax values are forced from the arguments
- but that seems unavoidable in any case.

I'll try to make some kind of prototype this weekend + cover race
condition you mentioned in specs.
Maybe some corner cases will appear.

By the way, there's one more optimization we could apply in both
MVCC-safe and non-MVCC-safe cases: setting the HEAP_XMIN_COMMITTED /
HEAP_XMAX_COMMITTED bit in the new table:
* in the MVCC-safe approach, the transaction is already committed.
* in the non-MVCC-safe case, it isn’t committed yet - but no one will
examine that bit before it commits (though this approach does feel
more fragile).

This could help avoid potential storms of full-page writes caused by
SetHintBit after the table switch.

Best regards,
Mikhail.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-27 06:16  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-27 06:16 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hello, Antonin!
> 
> Antonin Houska <[email protected]>:
> >
> > Where exactly should HeapTupleSatisfiesDirty() conclude that the tuple is
> > visible? TransactionIdIsCurrentTransactionId() will not do w/o the
> > modifications that you proposed earlier [1] and TransactionIdIsInProgress() is
> > not suitable as I explained in [2].
> 
> HeapTupleSatisfiesDirty is designed to respect even not-yet-committed
> transactions and provides additional related information.
> 
>     else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
>     {
>        /*
>         * Return the speculative token to caller.  Caller can worry about
>         * xmax, since it requires a conclusively locked row version, and
>         * a concurrent update to this tuple is a conflict of its
>         * purposes.
>         */
>        if (HeapTupleHeaderIsSpeculative(tuple))
>        {
>           snapshot->speculativeToken =
>              HeapTupleHeaderGetSpeculativeToken(tuple);
> 
>           Assert(snapshot->speculativeToken != 0);
>        }
> 
>        snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple);
>        /* XXX shouldn't we fall through to look at xmax? */
>        return true;      /* in insertion by other */
>     }
> 
> So, it returns true when TransactionIdIsInProgress is true.
> However, that alone is not sufficient to trust the result in the common case.
> 
> You may check check_exclusion_or_unique_constraint or
> RelationFindReplTupleByIndex for that pattern:
> if xmin is set in the snapshot (a special hack in SnapshotDirty to
> provide additional information from the check), we wait for the
> ongoing transaction (or one that is actually committed but not yet
> properly reflected in the proc array), and then retry the entire tuple
> search.
> 
> So, the race condition you explained in [2] will be resolved by a
> retry, and the changes to TransactionIdIsInProgress described in [1]
> are not necessary.

I insist that this is a misuse of TransactionIdIsInProgress(). When dealing
with logical decoding, only WAL should tell whether particular transaction is
still running. AFAICS this is how reorderbuffer.c works.

A new kind of snapshot seems like (much) cleaner solution at the moment.

> I'll try to make some kind of prototype this weekend + cover race
> condition you mentioned in specs.
> Maybe some corner cases will appear.

No rush. First, the MVCC safety is not likely to be included in v19
[1]. Second, I think it's good to let others propose their ideas before
writing code.

> By the way, there's one more optimization we could apply in both
> MVCC-safe and non-MVCC-safe cases: setting the HEAP_XMIN_COMMITTED /
> HEAP_XMAX_COMMITTED bit in the new table:
> * in the MVCC-safe approach, the transaction is already committed.
> * in the non-MVCC-safe case, it isn’t committed yet - but no one will
> examine that bit before it commits (though this approach does feel
> more fragile).
> 
> This could help avoid potential storms of full-page writes caused by
> SetHintBit after the table switch.

Good idea, thanks.

[1] https://www.postgresql.org/message-id/202504040733.ysuy5gad55md%40alvherre.pgsql

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-27 08:22  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-27 08:22 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Antonin Houska <[email protected]>:

> I insist that this is a misuse of TransactionIdIsInProgress(). When dealing
> with logical decoding, only WAL should tell whether particular transaction is
> still running. AFAICS this is how reorderbuffer.c works.

Hm... Maybe, but at the same time we already have SnapshotDirty used
in that way and it even deals with the same race....
But I agree - a special kind of snapshot is a more accurate solution.

> A new kind of snapshot seems like (much) cleaner solution at the moment.

Do you mean some kind of snapshot which only uses
TransactionIdDidCommit/Abort ignoring
TransactionIdIsCurrentTransactionId/TransactionIdIsInProgress?
Actually it behaves like SnapshotBelieveEverythingCommitted in that
particular case, but TransactionIdDidCommit/Abort may be used as some
kind of assert/error source to be sure everything is going as
designed.
And, yes, for the new snapshot we need to have
HeapTupleSatisfiesUpdate to be modified.

Also, to deal with that particular race we may just use
XactLockTableWait(xid, NULL, NULL, XLTW_None) before starting
transaction replay.

> No rush. First, the MVCC safety is not likely to be included in v19 [1].

That worries me - it is not the behaviour someone expects from a
database by default. At least the warning should be much more visible
and obvious.
I think most of user will expect the same guarantees as [CREATE|RE]
INDEX CONCURRENTLY provides.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-27 10:11  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-08-27 10:11 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> > A new kind of snapshot seems like (much) cleaner solution at the moment.
> 
> Do you mean some kind of snapshot which only uses
> TransactionIdDidCommit/Abort ignoring
> TransactionIdIsCurrentTransactionId/TransactionIdIsInProgress?
> Actually it behaves like SnapshotBelieveEverythingCommitted in that
> particular case, but TransactionIdDidCommit/Abort may be used as some
> kind of assert/error source to be sure everything is going as
> designed.

Given that there should be no (sub)transaction aborts in the new table, I
think you only need to check that the tuple has valid xmin and invalid xmax.

I think the XID should be in the commit log at the time the transaction is
being replayed, so it should be legal to use TransactionIdDidCommit/Abort in
Assert() statements. (And as long as REPACK CONCURRENTLY will use
ShareUpdateExclusiveLock, which conflicts with VACUUM, pg_class(relfrozenxid)
for given table should not advance during the processing, and therefore the
replayed XIDs should not be truncated from the commit log while REPACK
CONCURRENTLY is running.)

> And, yes, for the new snapshot we need to have
> HeapTupleSatisfiesUpdate to be modified.
> 
> Also, to deal with that particular race we may just use
> XactLockTableWait(xid, NULL, NULL, XLTW_None) before starting
> transaction replay.

Do you mean the race related to TransactionIdIsInProgress()? Not sure I
understand, as you suggested above that you no longer need the function.

> > No rush. First, the MVCC safety is not likely to be included in v19 [1].
> 
> That worries me - it is not the behaviour someone expects from a
> database by default. At least the warning should be much more visible
> and obvious.
> I think most of user will expect the same guarantees as [CREATE|RE]
> INDEX CONCURRENTLY provides.

It does not really worry me. The pg_squeeze extension is not MVCC-safe and I
remember there were only 1 or 2 related complaints throughout its
existence. (pg_repack isn't MVCC-safe as well, but I don't keep track of its
issues.)

Of course, user documentation should warn about the problem, in a way it does
for other commands (typically ALTER TABLE).

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-27 10:55  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-27 10:55 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Antonin Houska <[email protected]>:

> Do you mean the race related to TransactionIdIsInProgress()? Not sure I
> understand, as you suggested above that you no longer need the function.

The "lightweight" approaches I see so far:
* XactLockTableWait before replay + SnapshotSelf(GetLatestSnapshot?)
* SnapshotDirty + retry logic
* SnapshotBelieveEverythingCommitted + modification of
HeapTupleSatisfiesUpdate (because it called by heap_update and looks
into TransactionIdIsInProgress)

> It does not really worry me. The pg_squeeze extension is not MVCC-safe and I
> remember there were only 1 or 2 related complaints throughout its
> existence. (pg_repack isn't MVCC-safe as well, but I don't keep track of its
> issues.)

But pg_squeeze and pg_repack are extensions. If we are moving that
mechanics into core I'd expect some improvements over pg_squeeze.
MVCC-safety of REINDEX CONCURRENTLY makes it possible to run it on a
regular basis as some kind of background job. It would be nice to have
something like this for the heap.

I agree the initial approach is too invasive, complex and
performance-heavy to push it forward now.
But, any of "lightweight" feels like a good candidate to be shipped
with the feature itself - relatively easy and non-invasive.

Best regards,
Mikhail.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-28 21:39  Alvaro Herrera <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-28 21:39 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

On 2025-Aug-21, Mihail Nikalayeu wrote:

> Alvaro Herrera <[email protected]>:

> > I proposed to leave this part out initially, which is why it hasn't been
> > reposted.  We can review and discuss after the initial patches are in.
> 
> I think it is worth pushing it at least in the same release cycle.

If others are motivated enough to certify it, maybe we can consider it.
But I don't think I'm going to have time to get this part reviewed and
committed in time for 19, so you might need another committer.

> > Because having an MVCC-safe mode has drawbacks, IMO we should make it
> > optional.
>
> Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)?

Yes, exactly that.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La grandeza es una experiencia transitoria.  Nunca es consistente.
Depende en gran parte de la imaginación humana creadora de mitos"
(Irulan)





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-29 00:32  Mihail Nikalayeu <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-29 00:32 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>

Hello, Álvaro!

> If others are motivated enough to certify it, maybe we can consider it.
> But I don't think I'm going to have time to get this part reviewed and
> committed in time for 19, so you might need another committer.

I don't think it is realistic to involve another committer - it is
just a well-known curse of all non-committers :)

> > > Because having an MVCC-safe mode has drawbacks, IMO we should make it
> > > optional.

As far as I can see, the proposed "lightweight" solutions don't
introduce any drawbacks - unless something has been overlooked.

> > Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)?
> Yes, exactly that.

To be honest that approach feels a little bit strange for me. I work
in the database-consumer (not database-developer) industry and 90% of
DevOps engineers (or similar roles who deal with database maintenance
now) have no clue what MVCC is - and it is industry standard nowadays.



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-29 07:41  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-08-29 07:41 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> In case of some incident related to that (in a large well-known
> company) the typical takeaway for readers of tech blogs will simply be
> "some command in Postgres is broken".

For _responsible_ users, the message will rather be that "some tech bloggers
do not bother to read user documentation".

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-30 17:50  Alvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 2 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-30 17:50 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; +Cc: Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Hello,

Here's v19 of this patchset.  This is mostly Antonin's v18.  I added a
preparatory v19-0001 commit, which splits vacuumdb.c to create a new
file, vacuuming.c (and its header file vacuuming.h).  If you look at it
under 'git show --color-moved=zebra' you should notice that most of it
is just code movement; there's hardly any code changes.

v19-0002 has absorbed Antonin's v18-0005 (the pg_repackdb binary)
together with the introduction of the REPACK command proper; but instead
of using a symlink, I just created a separate pg_repackdb.c source file
for it and we compile that small new source file with vacuuming.c to
create a regular binary.  BTW the meson.build changes look somewhat
duplicative; maybe there's a less dumb way to go about this.  (For
instance, maybe just have libscripts.a include vacuuming.o, though it's
not used by any of the other programs in that subdir.)

I'm not wedded to the name "vacuuming.c"; happy to take suggestions.

After 0002, the pg_repackdb utility should be ready to take clusterdb's
place, and also vacuumdb --full, with one gotcha: if you try to use
pg_repackdb with an older server version, it will fail, claiming that
REPACK is not supported.  This is not ideal.  Instead, we should make it
run VACUUM FULL (or CLUSTER); so if you have a fleet including older
servers you can use the new utils there too.

All the logic for vacuumdb to select tables to operate on has been moved
to vacuuming.c verbatim.  This means this logic applies to pg_repackdb
as well.  As long as you stick to repacking a single table this is okay
(read: it won't be used at all), but if you want to use parallel mode
(say to process multiple schemas), we might need to change it.  For the
same reason, I think we should add an option to it (--index[=indexname])
to select whether to use the USING INDEX clause or not, and optionally
indicate which index to use; right now there's no way to select which
logic (cluster's or vacuum full's) to use.

Then v19-0003 through v19-0005 are Antonin's subsequent patches to add
the CONCURRENTLY option; I have not reviewed these at all, so I'm
including them here just for completion.  I also included v18-0006 as
posted by Mihail previously, though I have little faith that we're going
to include it in this release.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-31 12:09  Alvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  1 sibling, 2 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-31 12:09 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; +Cc: Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Apparently I mismerged src/bin/scripts/meson.build.  This v20 is
identical to v19, where that mistake has been corrected.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".


^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-31 15:29  Mihail Nikalayeu <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  1 sibling, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-08-31 15:29 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Hello!

I started an attempt to make a "lightweight" MVCC-safe prototype and
stuck into the "it is not working" issue.
After some debugging I realized Antonin's variant (catalog-mode based)
seems to be broken also...

And after a few more hours I realized non-MVCC is broken as well :)

This is a patch with a test to reproduce the issue related to repack +
concurrent modifications.
Seems like some updates may be lost.

I hope the patch logic is clear - but feel free to ask if not.

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v22-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 2-v22-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From c7424f44a086433d2eff6153476e0fd0c6b5b576 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v22 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-08-31 17:43  Alvaro Herrera <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Alvaro Herrera @ 2025-08-31 17:43 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

On 2025-Aug-31, Mihail Nikalayeu wrote:

> I started an attempt to make a "lightweight" MVCC-safe prototype and
> stuck into the "it is not working" issue.
> After some debugging I realized Antonin's variant (catalog-mode based)
> seems to be broken also...
> 
> And after a few more hours I realized non-MVCC is broken as well :)

Ugh.  Well, obviously we need to get this fixed if we want CONCURRENTLY
at all :-)

Please don't post patches that aren't the commitfest item's main patch
as attachment with .patch extension.  This confuses the CFbot into
thinking your patch is the patch-of-record (which it isn't) and reports
that the patch fails CI.  See here:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5117
(For the same reason, it isn't useful to number them as if they were
part of the patch series).

If you want to post secondary patches, please rename them to end in
something like .txt or .nocfbot or whatever.  See here:
https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches?

Thanks for your interest in this topic,

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-01 00:16  Michael Paquier <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Michael Paquier @ 2025-09-01 00:16 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Antonin Houska <[email protected]>; Alvaro Herrera <[email protected]>; Fujii Masao <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>

On Wed, Aug 27, 2025 at 10:22:24AM +0200, Mihail Nikalayeu wrote:
> That worries me - it is not the behaviour someone expects from a
> database by default. At least the warning should be much more visible
> and obvious.
> I think most of user will expect the same guarantees as [CREATE|RE]
> INDEX CONCURRENTLY provides.

Having a unified path for the handling of the waits and the locking
sounds to me like a pretty good argument in favor of a basic
implementation.

In my experience, users do not really care about the time it takes to
complete a operation involving CONCURRENTLY if we allow concurrent
reads and writes in parallel of it.  I have not looked at the proposal
in details, but before trying a more folkloric MVCC approach, relying
on basics that we know have been working for some time seems like a
good and sufficient initial step in terms of handling the waits and
the locks with table AMs (aka heap or something else).

Just my 2c.
--
Michael

Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-01 05:12  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-09-01 05:12 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hello!
> 
> I started an attempt to make a "lightweight" MVCC-safe prototype and
> stuck into the "it is not working" issue.
> After some debugging I realized Antonin's variant (catalog-mode based)
> seems to be broken also...
> 
> And after a few more hours I realized non-MVCC is broken as well :)
> 
> This is a patch with a test to reproduce the issue related to repack +
> concurrent modifications.
> Seems like some updates may be lost.
> 
> I hope the patch logic is clear - but feel free to ask if not.

Are you sure the test is complete? I see no occurrence of the REPACK command
in it.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-01 09:06  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-09-01 09:06 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Hello!

Antonin Houska <[email protected]>:
> Are you sure the test is complete? I see no occurrence of the REPACK command
> in it.
Oops, send invalid file. The correct one in attachment.


Attachments:

  [application/octet-stream] Add_test_for_REPACK_CONCURRENTLY_with_concurrent_modifications.patch_ (3.8K, 2-Add_test_for_REPACK_CONCURRENTLY_with_concurrent_modifications.patch_)
  download

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-01 13:00  Mihail Nikalayeu <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-09-01 13:00 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Hello, Álvaro!

Alvaro Herrera <[email protected]>:
> If you want to post secondary patches, please rename them to end in
> something like .txt or .nocfbot or whatever.  See here:
> https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches?

Sorry, I missed that.
But now it is possible to send ".patch" without changing the extension [0].

> It also ignores any files that start with "nocfbot".

[0]: https://discord.com/channels/1258108670710124574/1328362897189113867/1412021226528051250





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-01 15:30  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-09-01 15:30 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Antonin Houska <[email protected]>:
> > Are you sure the test is complete? I see no occurrence of the REPACK command
> > in it.
> Oops, send invalid file. The correct one in attachment.

Thanks!

The problem was that when removing the original "preserve visibility patch"
v12-0005 [1] from the series, I forgot to change the value of
'need_full_snapshot' argument of CreateInitDecodingContext().

v12 and earlier treated the repacked table like system catalog, so it was
o.k. to pass need_full_snapshot=false. However, it must be true now, otherwise
the snapshot created for the initial copy does not see commits of transactions
that do not change regular catalogs.

The fix is as simple as

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index f481a3cec6d..7866ac01278 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -502,6 +502,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
        StringInfo      buf = makeStringInfo();
 
        Assert(builder->state == SNAPBUILD_CONSISTENT);
+       Assert(builder->building_full_snapshot);
 
        snap = SnapBuildBuildSnapshot(builder);
 

I'll apply it to the next version of the "Add CONCURRENTLY option to REPACK
command" patch.


[1] https://www.postgresql.org/message-id/flat/CAFj8pRDK89FtY_yyGw7-MW-zTaHOCY4m6qfLRittdoPocz+dMQ@mail....

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-02 10:44  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-09-02 10:44 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Hello!

Antonin Houska <[email protected]>:
> I'll apply it to the next version of the "Add CONCURRENTLY option to REPACK
> command" patch.
I have added it to the v21 patchset.

Also, I’ve updated the MVCC-safe patch:
* it uses the "XactLockTableWait before replay + SnapshotSelf" approach from [0]
* it includes a TAP test to ensure MVCC safety - not intended to be
committed in its current form (too heavy)
* documentation has been updated.

It's now much simpler and does not negatively impact performance. It
is less aggressive in tuple freezing, but can be updated to match the
non-MVCC-safe version if needed.

While testing MVCC-safe version with stress-tests
007_repack_concurrently_mvcc.pl I encountered some random crashes with
such logs:

25-09-02 12:24:40.039 CEST client backend[261907]
007_repack_concurrently_mvcc.pl ERROR:  relcache reference
0x7715b9f394a8 is not owned by resource owner TopTransaction
2025-09-02 12:24:40.039 CEST client backend[261907]
007_repack_concurrently_mvcc.pl STATEMENT:  REPACK (CONCURRENTLY) tbl1
USING INDEX tbl1_pkey;
TRAP: failed Assert("rel->rd_refcnt > 0"), File:
"../src/backend/utils/cache/relcache.c", Line: 6992, PID: 261907
postgres: CIC_test: nkey postgres [local]
REPACK(ExceptionalCondition+0xbe)[0x5b7ac41d79f9]
postgres: CIC_test: nkey postgres [local] REPACK(+0x852d2e)[0x5b7ac41cbd2e]
postgres: CIC_test: nkey postgres [local] REPACK(+0x8aa4a6)[0x5b7ac42234a6]
postgres: CIC_test: nkey postgres [local] REPACK(+0x8aad3b)[0x5b7ac4223d3b]
postgres: CIC_test: nkey postgres [local] REPACK(+0x8aac69)[0x5b7ac4223c69]
postgres: CIC_test: nkey postgres [local]
REPACK(ResourceOwnerRelease+0x32)[0x5b7ac4223c26]
postgres: CIC_test: nkey postgres [local] REPACK(+0x1f43bf)[0x5b7ac3b6d3bf]
postgres: CIC_test: nkey postgres [local] REPACK(+0x1f4dfa)[0x5b7ac3b6ddfa]
postgres: CIC_test: nkey postgres [local]
REPACK(AbortCurrentTransaction+0xe)[0x5b7ac3b6dd6b]
postgres: CIC_test: nkey postgres [local]
REPACK(PostgresMain+0x57d)[0x5b7ac3fd7238]
postgres: CIC_test: nkey postgres [local] REPACK(+0x654102)[0x5b7ac3fcd102]
postgres: CIC_test: nkey postgres [local]
REPACK(postmaster_child_launch+0x191)[0x5b7ac3eceb7a]
postgres: CIC_test: nkey postgres [local] REPACK(+0x55c8c1)[0x5b7ac3ed58c1]
postgres: CIC_test: nkey postgres [local] REPACK(+0x559d1e)[0x5b7ac3ed2d1e]
postgres: CIC_test: nkey postgres [local]
REPACK(PostmasterMain+0x168a)[0x5b7ac3ed25f8]
postgres: CIC_test: nkey postgres [local] REPACK(main+0x3a1)[0x5b7ac3da2bd6]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7715b962a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7715b962a28b]

This time I was clever and tried to attempt to reproduce the issue on
a non-MVCC safe version at first - and it is reproducible.

Just comment \if :p_t1 != :p_t2 (and its internals, because they
catching non-mvcc behaviour which is expected without 0006 patch); and
set
'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=25000'

It takes about a minute on my PC to get the crash.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwXCTXNdxK-XGTKmObvT%3D_QnaCviwgrcGtG9chsj5sYzrg%40m...

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v21-0006-Preserve-visibility-information-of-the-concurren.patch (30.5K, 2-v21-0006-Preserve-visibility-information-of-the-concurren.patch)
  download | inline diff:
From 946862e2a4dbfd91ac6802c2e8da104dce81c43a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 2 Sep 2025 11:30:55 +0200
Subject: [PATCH v21 6/6] Preserve visibility information of the concurrent 
 data  changes.

As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.

However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". It ensures
that tuples written into the new table have the same XID and command ID (CID)
as they had in the old table.

To "replay" an UPDATE or DELETE command on the new table, we use SnapshotSelf to find the last alive version of tuple and update with stamp with xid of original transaction. It is safe because:
* all transactions we replaying are committed
* apply worker working without any concurrent modifiers of the table

As long as we preserve the tuple visibility information (which includes XID),
it's important to avoid logical decoding of the WAL generated by DMLs on the
new table: the logical decoding subsystem probably does not expect that the
incoming WAL records contain XIDs of an already decoded transactions. (And of
course, repeated decoding would be wasted effort.)

Author: Antonin Houska <[email protected]> with changes from Mikhail Nikalayeu <[email protected]
---
 contrib/amcheck/meson.build                   |   1 +
 .../amcheck/t/007_repack_concurrently_mvcc.pl | 113 ++++++++++++++++++
 doc/src/sgml/mvcc.sgml                        |  12 +-
 doc/src/sgml/ref/repack.sgml                  |   9 --
 src/backend/access/common/toast_internals.c   |   3 +-
 src/backend/access/heap/heapam.c              |  46 ++++---
 src/backend/access/heap/heapam_handler.c      |  24 ++--
 src/backend/commands/cluster.c                |  85 +++++++++----
 .../pgoutput_repack/pgoutput_repack.c         |  18 +--
 src/include/access/heapam.h                   |  12 +-
 src/include/commands/cluster.h                |   2 +
 .../injection_points/specs/repack.spec        |   4 -
 12 files changed, 249 insertions(+), 80 deletions(-)
 create mode 100644 contrib/amcheck/t/007_repack_concurrently_mvcc.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 1f0c347ed54..d07d6ed3f0c 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
       't/006_verify_gin.pl',
+      't/007_repack_concurrently_mvcc.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/007_repack_concurrently_mvcc.pl b/contrib/amcheck/t/007_repack_concurrently_mvcc.pl
new file mode 100644
index 00000000000..a83fd5b8141
--- /dev/null
+++ b/contrib/amcheck/t/007_repack_concurrently_mvcc.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Test REPACK CONCURRENTLY with concurrent modifications
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my $node;
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf(
+	'postgresql.conf', qq(
+wal_level = logical
+));
+$node->start;
+$node->safe_psql('postgres', q(CREATE TABLE tbl1(i int PRIMARY KEY, j int)));
+$node->safe_psql('postgres', q(CREATE TABLE tbl2(i int PRIMARY KEY, j int)));
+
+
+# Insert 100 rows into tbl1
+$node->safe_psql('postgres', q(
+    INSERT INTO tbl1 SELECT i, i % 100 FROM generate_series(1,100) i
+));
+
+# Insert 100 rows into tbl2
+$node->safe_psql('postgres', q(
+    INSERT INTO tbl2 SELECT i, i % 100 FROM generate_series(1,100) i
+));
+
+
+# Insert 100 rows into tbl1
+$node->safe_psql('postgres', q(
+	CREATE OR REPLACE FUNCTION log_raise(i int, j1 int, j2 int) RETURNS VOID AS $$
+	BEGIN
+	  RAISE NOTICE 'ERROR i=% j1=% j2=%', i, j1, j2;
+	END;$$ LANGUAGE plpgsql;
+));
+
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+
+$node->pgbench(
+'--no-vacuum --client=10 --jobs=4 --exit-on-abort --transactions=2500',
+0,
+[qr{actually processed}],
+[qr{^$}],
+'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+{
+	'concurrent_ops' => q(
+		SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+		\if :gotlock
+			SELECT nextval('in_row_rebuild') AS last_value \gset
+			\if :last_value = 2
+				REPACK (CONCURRENTLY) tbl1 USING INDEX tbl1_pkey;
+				\sleep 10 ms
+				REPACK (CONCURRENTLY) tbl2 USING INDEX tbl2_pkey;
+				\sleep 10 ms
+			\endif
+			SELECT pg_advisory_unlock(42);
+		\else
+			\set num random(1, 100)
+			BEGIN;
+			UPDATE tbl1 SET j = j + 1 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 2 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 3 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 4 WHERE i = :num;
+			\sleep 1 ms
+
+			UPDATE tbl2 SET j = j + 1 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 2 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 3 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 4 WHERE i = :num;
+
+			COMMIT;
+			SELECT setval('in_row_rebuild', 1);
+
+			BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+			SELECT COALESCE(SUM(j), 0) AS t1 FROM tbl1 WHERE i = :num \gset p_
+			\sleep 10 ms
+			SELECT COALESCE(SUM(j), 0) AS t2 FROM tbl2 WHERE i = :num \gset p_
+			\if :p_t1 != :p_t2
+				COMMIT;
+				SELECT log_raise(tbl1.i, tbl1.j, tbl2.j) FROM tbl1 LEFT OUTER JOIN tbl2 ON tbl1.i = tbl2.i WHERE tbl1.j != tbl2.j;
+				\sleep 10 ms
+				SELECT log_raise(tbl1.i, tbl1.j, tbl2.j) FROM tbl1 LEFT OUTER JOIN tbl2 ON tbl1.i = tbl2.i WHERE tbl1.j != tbl2.j;
+				SELECT (:p_t1 + :p_t2) / 0;
+			\endif
+
+			COMMIT;
+		\endif
+	)
+});
+
+$node->stop;
+done_testing();
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 0f5c34af542..049ee75a4ba 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,17 +1833,15 @@ SELECT pg_advisory_lock(q.id) FROM
    <title>Caveats</title>
 
    <para>
-    Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
-    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
-    TABLE</command></link> and <command>REPACK</command> with
-    the <literal>CONCURRENTLY</literal> option, are not
+    Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
+    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
     MVCC-safe.  This means that after the truncation or rewrite commits, the
     table will appear empty to concurrent transactions, if they are using a
-    snapshot taken before the command committed.  This will only be an
+    snapshot taken before the DDL command committed.  This will only be an
     issue for a transaction that did not access the table in question
-    before the command started &mdash; any transaction that has done so
+    before the DDL command started &mdash; any transaction that has done so
     would hold at least an <literal>ACCESS SHARE</literal> table lock,
-    which would block the truncating or rewriting command until that transaction completes.
+    which would block the DDL command until that transaction completes.
     So these commands will not cause any apparent inconsistency in the
     table contents for successive queries on the target table, but they
     could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index ff5ce48de55..271923a5a60 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -292,15 +292,6 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
        </listitem>
       </itemizedlist>
      </para>
-
-     <warning>
-      <para>
-       <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
-       option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
-       details.
-      </para>
-     </warning>
-
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index a1d0eed8953..586eb42a137 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -320,7 +320,8 @@ toast_save_datum(Relation rel, Datum value,
 		memcpy(VARDATA(&chunk_data), data_p, chunk_size);
 		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
 
-		heap_insert(toastrel, toasttup, mycid, options, NULL);
+		heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+					options, NULL);
 
 		/*
 		 * Create the index entry.  We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f9a4fe3faed..45da5902de0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2070,7 +2070,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
 /*
  *	heap_insert		- insert tuple into a heap
  *
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
  * command ID.
  *
  * See table_tuple_insert for comments about most of the input flags, except
@@ -2086,15 +2086,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * reflected into *tup.
  */
 void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
-			int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+			CommandId cid, int options, BulkInsertState bistate)
 {
-	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple	heaptup;
 	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		all_visible_cleared = false;
 
+	Assert(TransactionIdIsValid(xid));
+
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
 		   RelationGetNumberOfAttributes(relation));
@@ -2176,8 +2177,15 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		/*
 		 * If this is a catalog, we need to transmit combo CIDs to properly
 		 * decode, so log that as well.
+		 *
+		 * HEAP_INSERT_NO_LOGICAL should be set when applying data changes
+		 * done by other transactions during REPACK CONCURRENTLY. In such a
+		 * case, the insertion should not be decoded at all - see
+		 * heap_decode(). (It's also set by raw_heap_insert() for TOAST, but
+		 * TOAST does not pass this test anyway.)
 		 */
-		if (RelationIsAccessibleInLogicalDecoding(relation))
+		if ((options & HEAP_INSERT_NO_LOGICAL) == 0 &&
+			RelationIsAccessibleInLogicalDecoding(relation))
 			log_heap_new_cid(relation, heaptup);
 
 		/*
@@ -2723,7 +2731,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 void
 simple_heap_insert(Relation relation, HeapTuple tup)
 {
-	heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+	heap_insert(relation, tup, GetCurrentTransactionId(),
+				GetCurrentCommandId(true), 0, NULL);
 }
 
 /*
@@ -2780,11 +2789,11 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart, bool wal_logical)
+			TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
+			TM_FailureData *tmfd, bool changingPart,
+			bool wal_logical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	ItemId		lp;
 	HeapTupleData tp;
 	Page		page;
@@ -2801,6 +2810,7 @@ heap_delete(Relation relation, ItemPointer tid,
 	bool		old_key_copied = false;
 
 	Assert(ItemPointerIsValid(tid));
+	Assert(TransactionIdIsValid(xid));
 
 	AssertHasSnapshotForToast(relation);
 
@@ -3217,7 +3227,7 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	TM_Result	result;
 	TM_FailureData tmfd;
 
-	result = heap_delete(relation, tid,
+	result = heap_delete(relation, tid, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, false,	/* changingPart */
@@ -3260,12 +3270,11 @@ simple_heap_delete(Relation relation, ItemPointer tid)
  */
 TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, LockTupleMode *lockmode,
+			TransactionId xid, CommandId cid, Snapshot crosscheck,
+			bool wait, TM_FailureData *tmfd, LockTupleMode *lockmode,
 			TU_UpdateIndexes *update_indexes, bool wal_logical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	Bitmapset  *hot_attrs;
 	Bitmapset  *sum_attrs;
 	Bitmapset  *key_attrs;
@@ -3305,6 +3314,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 				infomask2_new_tuple;
 
 	Assert(ItemPointerIsValid(otid));
+	Assert(TransactionIdIsValid(xid));
 
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4144,8 +4154,12 @@ l2:
 		/*
 		 * For logical decoding we need combo CIDs to properly decode the
 		 * catalog.
+		 *
+		 * Like in heap_insert(), visibility is unchanged when called from
+		 * VACUUM FULL / CLUSTER.
 		 */
-		if (RelationIsAccessibleInLogicalDecoding(relation))
+		if (wal_logical &&
+			RelationIsAccessibleInLogicalDecoding(relation))
 		{
 			log_heap_new_cid(relation, &oldtup);
 			log_heap_new_cid(relation, heaptup);
@@ -4511,7 +4525,7 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
 
-	result = heap_update(relation, otid, tup,
+	result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, &lockmode, update_indexes,
@@ -5351,8 +5365,6 @@ compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 	uint16		new_infomask,
 				new_infomask2;
 
-	Assert(TransactionIdIsCurrentTransactionId(add_to_xmax));
-
 l5:
 	new_infomask = 0;
 	new_infomask2 = 0;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d03084768e0..6733e5fdda6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -253,7 +253,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -276,7 +277,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
 	options |= HEAP_INSERT_SPECULATIVE;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -310,8 +312,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
-					   true);
+	return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+					   crosscheck, wait, tmfd, changingPart, true);
 }
 
 
@@ -329,7 +331,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
+	result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+						 cid, crosscheck, wait,
 						 tmfd, lockmode, update_indexes, true);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
@@ -2477,9 +2480,16 @@ reform_and_rewrite_tuple(HeapTuple tuple,
 		 * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
 		 * the relation files, it drops this relation, so no logical
 		 * replication subscription should need the data.
+		 *
+		 * It is also crucial to stamp the new record with the exact same xid
+		 * and cid, because the tuple must be visible to the snapshots of the
+		 * concurrent transactions later.
 		 */
-		heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
-					HEAP_INSERT_NO_LOGICAL, NULL);
+		// TODO: looks like cid is not required
+		CommandId	cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
+		TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);
+
+		heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);
 	}
 
 	heap_freetuple(copiedTuple);
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 61224a3adf2..936cb0ae429 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -55,6 +55,7 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -146,6 +147,7 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
 									ConcurrentChange *change);
 static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
 								   HeapTuple tup_key,
+								   Snapshot snapshot,
 								   IndexInsertState *iistate,
 								   TupleTableSlot *ident_slot,
 								   IndexScanDesc *scan_p);
@@ -1008,7 +1010,14 @@ rebuild_relation(RepackCommand cmd, bool usingindex,
 
 	/* The historic snapshot won't be needed anymore. */
 	if (snapshot)
+	{
+		TransactionId xmin = snapshot->xmin;
 		PopActiveSnapshot();
+		Assert(concurrent);
+		// TODO: seems like it not required: need to check SnapBuildInitialSnapshotForRepack
+		WaitForOlderSnapshots(xmin, false);
+	}
+
 
 	if (concurrent)
 	{
@@ -1299,30 +1308,35 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_get_cutoffs(OldHeap, params, &cutoffs);
-
-	/*
-	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
-	 * backwards, so take the max.
-	 */
+	if (!concurrent)
 	{
 		TransactionId relfrozenxid = OldHeap->rd_rel->relfrozenxid;
+		MultiXactId relminmxid = OldHeap->rd_rel->relminmxid;
 
+		vacuum_get_cutoffs(OldHeap, params, &cutoffs);
+		/*
+		 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
+		 * backwards, so take the max.
+		 */
 		if (TransactionIdIsValid(relfrozenxid) &&
 			TransactionIdPrecedes(cutoffs.FreezeLimit, relfrozenxid))
 			cutoffs.FreezeLimit = relfrozenxid;
-	}
-
-	/*
-	 * MultiXactCutoff, similarly, shouldn't go backwards either.
-	 */
-	{
-		MultiXactId relminmxid = OldHeap->rd_rel->relminmxid;
-
+		/*
+		 * MultiXactCutoff, similarly, shouldn't go backwards either.
+		 */
 		if (MultiXactIdIsValid(relminmxid) &&
 			MultiXactIdPrecedes(cutoffs.MultiXactCutoff, relminmxid))
 			cutoffs.MultiXactCutoff = relminmxid;
 	}
+	else
+	{
+		/*
+		 * In concurrent mode we reuse all the xmin/xmax,
+		 * so just use current values for simplicity.
+		 */
+		cutoffs.FreezeLimit = OldHeap->rd_rel->relfrozenxid;
+		cutoffs.MultiXactCutoff = OldHeap->rd_rel->relminmxid;
+	}
 
 	/*
 	 * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
@@ -2675,6 +2689,16 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 			continue;
 		}
 
+		if (TransactionIdIsInProgress(change.xid))
+		{
+			/* xid is committed for sure because we got that update from reorderbuffer.
+			 * but there is a possibility procarray is not yet updated and current backend still see it as
+			 * in-progress. Let's wait for procarray to be updated. */
+			XactLockTableWait(change.xid, NULL, NULL, XLTW_None);
+			Assert(!TransactionIdIsInProgress(change.xid));
+			Assert(TransactionIdDidCommit(change.xid));
+		}
+
 		/*
 		 * Extract the tuple from the change. The tuple is copied here because
 		 * it might be assigned to 'tup_old', in which case it needs to
@@ -2712,9 +2736,13 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 			}
 
 			/*
-			 * Find the tuple to be updated or deleted.
+			 * Find the tuple to be updated or deleted using SnapshotSelf.
+			 * That way we receive the last alive version in case of HOT chain.
+			 * It is guaranteed there is no any non-yet committed, but updated version
+			 * because we here replaying all-committed transactions without any concurrency
+			 * involved.
 			 */
-			tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+			tup_exist = find_target_tuple(rel, key, nkeys, tup_key, SnapshotSelf,
 										  iistate, ident_slot, &ind_scan);
 			if (tup_exist == NULL)
 				elog(ERROR, "Failed to find target tuple");
@@ -2743,6 +2771,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
 		 */
 		if (change.kind != CHANGE_UPDATE_OLD)
 		{
+			// TODO: not sure it is required at all: we are replaying committed transactions stamping them with committed XID
 			CommandCounterIncrement();
 			UpdateActiveSnapshotCommandId();
 		}
@@ -2771,9 +2800,11 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
 	 * Like simple_heap_insert(), but make sure that the INSERT is not
 	 * logically decoded - see reform_and_rewrite_tuple() for more
 	 * information.
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
-	heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
-				NULL);
+	heap_insert(rel, tup, change->xid, GetCurrentCommandId(true),
+				HEAP_INSERT_NO_LOGICAL, NULL);
 
 	/*
 	 * Update indexes.
@@ -2781,6 +2812,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
 	 * In case functions in the index need the active snapshot and caller
 	 * hasn't set one.
 	 */
+	PushActiveSnapshot(GetLatestSnapshot());
 	ExecStoreHeapTuple(tup, index_slot, false);
 	recheck = ExecInsertIndexTuples(iistate->rri,
 									index_slot,
@@ -2791,6 +2823,7 @@ apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
 									NIL,	/* arbiterIndexes */
 									false	/* onlySummarizing */
 		);
+	PopActiveSnapshot();
 
 	/*
 	 * If recheck is required, it must have been preformed on the source
@@ -2819,9 +2852,11 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 	 *
 	 * Do it like in simple_heap_update(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
 	res = heap_update(rel, &tup_target->t_self, tup,
-					  GetCurrentCommandId(true),
+					  change->xid, GetCurrentCommandId(true),
 					  InvalidSnapshot,
 					  false,	/* no wait - only we are doing changes */
 					  &tmfd, &lockmode, &update_indexes,
@@ -2833,6 +2868,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 
 	if (update_indexes != TU_None)
 	{
+		PushActiveSnapshot(GetLatestSnapshot());
 		recheck = ExecInsertIndexTuples(iistate->rri,
 										index_slot,
 										iistate->estate,
@@ -2842,6 +2878,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 										NIL,	/* arbiterIndexes */
 		/* onlySummarizing */
 										update_indexes == TU_Summarizing);
+		PopActiveSnapshot();
 		list_free(recheck);
 	}
 
@@ -2860,9 +2897,11 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
 	 *
 	 * Do it like in simple_heap_delete(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
-	res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
-					  InvalidSnapshot, false,
+	res = heap_delete(rel, &tup_target->t_self, change->xid,
+					  GetCurrentCommandId(true), InvalidSnapshot, false,
 					  &tmfd,
 					  false,	/* no wait - only we are doing changes */
 					  false /* wal_logical */ );
@@ -2886,7 +2925,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target,
  */
 static HeapTuple
 find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
-				  IndexInsertState *iistate,
+				  Snapshot snapshot, IndexInsertState *iistate,
 				  TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
 {
 	IndexScanDesc scan;
@@ -2895,7 +2934,7 @@ find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
 	HeapTuple	result = NULL;
 
 	/* XXX no instrumentation for now */
-	scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+	scan = index_beginscan(rel, iistate->ident_index, snapshot,
 						   NULL, nkeys, 0);
 	*scan_p = scan;
 	index_rescan(scan, key, nkeys, NULL, 0);
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 687fbbc59bb..020ff7b7c80 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -32,7 +32,8 @@ static void plugin_truncate(struct LogicalDecodingContext *ctx,
 							Relation relations[],
 							ReorderBufferChange *change);
 static void store_change(LogicalDecodingContext *ctx,
-						 ConcurrentChangeKind kind, HeapTuple tuple);
+						 ConcurrentChangeKind kind, HeapTuple tuple,
+						 TransactionId xid);
 
 void
 _PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -124,7 +125,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (newtuple == NULL)
 					elog(ERROR, "Incomplete insert info.");
 
-				store_change(ctx, CHANGE_INSERT, newtuple);
+				store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_UPDATE:
@@ -141,9 +142,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					elog(ERROR, "Incomplete update info.");
 
 				if (oldtuple != NULL)
-					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+								 change->txn->xid);
 
-				store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+				store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+							 change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_DELETE:
@@ -156,7 +159,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (oldtuple == NULL)
 					elog(ERROR, "Incomplete delete info.");
 
-				store_change(ctx, CHANGE_DELETE, oldtuple);
+				store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
 			}
 			break;
 		default:
@@ -190,13 +193,13 @@ plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 	if (i == nrelations)
 		return;
 
-	store_change(ctx, CHANGE_TRUNCATE, NULL);
+	store_change(ctx, CHANGE_TRUNCATE, NULL, InvalidTransactionId);
 }
 
 /* Store concurrent data change. */
 static void
 store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
-			 HeapTuple tuple)
+			 HeapTuple tuple, TransactionId xid)
 {
 	RepackDecodingState *dstate;
 	char	   *change_raw;
@@ -266,6 +269,7 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
 	dst = dst_start + SizeOfConcurrentChange;
 	memcpy(dst, tuple->t_data, tuple->t_len);
 
+	change.xid = xid;
 	/* The data has been copied. */
 	if (flattened)
 		pfree(tuple);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b82dd17a966..981425f23b6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -316,22 +316,24 @@ extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
 extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
 
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
-						int options, BulkInsertState bistate);
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+						CommandId cid, int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 TransactionId xid, CommandId cid,
+							 Snapshot crosscheck, bool wait,
 							 struct TM_FailureData *tmfd, bool changingPart,
 							 bool wal_logical);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
-							 HeapTuple newtup,
+							 HeapTuple newtup, TransactionId xid,
 							 CommandId cid, Snapshot crosscheck, bool wait,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes, bool wal_logical);
+							 TU_UpdateIndexes *update_indexes,
+							 bool wal_logical);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool follow_updates,
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 4a508c57a50..242f8da770a 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -61,6 +61,8 @@ typedef struct ConcurrentChange
 	/* See the enum above. */
 	ConcurrentChangeKind kind;
 
+	/* Transaction that changes the data. */
+	TransactionId xid;
 	/*
 	 * The actual tuple.
 	 *
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index 75850334986..3711a7c92b9 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -86,9 +86,6 @@ step change_new
 # When applying concurrent data changes, we should see the effects of an
 # in-progress subtransaction.
 #
-# XXX Not sure this test is useful now - it was designed for the patch that
-# preserves tuple visibility and which therefore modifies
-# TransactionIdIsCurrentTransactionId().
 step change_subxact1
 {
 	BEGIN;
@@ -103,7 +100,6 @@ step change_subxact1
 # When applying concurrent data changes, we should not see the effects of a
 # rolled back subtransaction.
 #
-# XXX Is this test useful? See above.
 step change_subxact2
 {
 	BEGIN;
-- 
2.43.0



  [application/octet-stream] v21-0003-Refactor-index_concurrently_create_copy-for-use-.patch (4.1K, 3-v21-0003-Refactor-index_concurrently_create_copy-for-use-.patch)
  download | inline diff:
From 896f4fc90d128f0a8625f47b82b08eb0da145be7 Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Mon, 11 Aug 2025 15:31:34 +0200
Subject: [PATCH v21 3/6] Refactor index_concurrently_create_copy() for use
 with REPACK (CONCURRENTLY).

This patch moves the code to index_create_copy() and adds a "concurrently"
parameter so it can be used by REPACK (CONCURRENTLY).

With the CONCURRENTLY option, REPACK cannot simply swap the heap file and
rebuild its indexes. Instead, it needs to build a separate set of indexes
(including system catalog entries) *before* the actual swap, to reduce the
time AccessExclusiveLock needs to be held for.
---
 src/backend/catalog/index.c | 36 ++++++++++++++++++++++++++++--------
 src/include/catalog/index.h |  3 +++
 2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3063abff9a5..0dee1b1a9d8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1290,15 +1290,31 @@ index_create(Relation heapRelation,
 /*
  * index_concurrently_create_copy
  *
- * Create concurrently an index based on the definition of the one provided by
- * caller.  The index is inserted into catalogs and needs to be built later
- * on.  This is called during concurrent reindex processing.
- *
- * "tablespaceOid" is the tablespace to use for this index.
+ * Variant of index_create_copy(), called during concurrent reindex
+ * processing.
  */
 Oid
 index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							   Oid tablespaceOid, const char *newName)
+{
+	return index_create_copy(heapRelation, oldIndexId, tablespaceOid, newName,
+							 true);
+}
+
+/*
+ * index_create_copy
+ *
+ * Create an index based on the definition of the one provided by caller.  The
+ * index is inserted into catalogs and needs to be built later on.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ *
+ * The actual implementation of index_concurrently_create_copy(), reusable for
+ * other purposes.
+ */
+Oid
+index_create_copy(Relation heapRelation, Oid oldIndexId, Oid tablespaceOid,
+				  const char *newName, bool concurrently)
 {
 	Relation	indexRelation;
 	IndexInfo  *oldInfo,
@@ -1317,6 +1333,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	List	   *indexColNames = NIL;
 	List	   *indexExprs = NIL;
 	List	   *indexPreds = NIL;
+	int			flags = 0;
 
 	indexRelation = index_open(oldIndexId, RowExclusiveLock);
 
@@ -1325,9 +1342,9 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 
 	/*
 	 * Concurrent build of an index with exclusion constraints is not
-	 * supported.
+	 * supported. If !concurrently, ii_ExclusinOps is currently not needed.
 	 */
-	if (oldInfo->ii_ExclusionOps != NULL)
+	if (oldInfo->ii_ExclusionOps != NULL && concurrently)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("concurrent index creation for exclusion constraints is not supported")));
@@ -1435,6 +1452,9 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 		stattargets[i].isnull = isnull;
 	}
 
+	if (concurrently)
+		flags = INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT;
+
 	/*
 	 * Now create the new index.
 	 *
@@ -1458,7 +1478,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  indcoloptions->values,
 							  stattargets,
 							  reloptionsDatum,
-							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT,
+							  flags,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..063a891351a 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid oldIndexId,
 										   Oid tablespaceOid,
 										   const char *newName);
+extern Oid	index_create_copy(Relation heapRelation, Oid oldIndexId,
+							  Oid tablespaceOid, const char *newName,
+							  bool concurrently);
 
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
-- 
2.43.0



  [application/octet-stream] v21-0005-Add-CONCURRENTLY-option-to-REPACK-command.patch (147.2K, 4-v21-0005-Add-CONCURRENTLY-option-to-REPACK-command.patch)
  download | inline diff:
From a9411b077bc121215b230556be5a114d5effd847 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <[email protected]>
Date: Sat, 30 Aug 2025 19:13:38 +0200
Subject: [PATCH v21 5/6] Add CONCURRENTLY option to REPACK command.

The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we should not request a
stronger lock without releasing the weaker one first, we acquire the exclusive
lock in the beginning and keep it till the end of the processing.)

This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even to write to it.

First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.

Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock that we need to swap the files. (Of course, more
data changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)

Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.

The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.

Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of a WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.

The WAL records produced by running DML commands on the new relation do not
contain enough information to be processed by the logical decoding system. All
we need from the new relation is the file (relfilenode), while the actual
relation is eventually dropped. Thus there is no point in replaying the DMLs
anywhere.

Author: Antonin Houska <[email protected]>
---
 doc/src/sgml/monitoring.sgml                  |   37 +-
 doc/src/sgml/mvcc.sgml                        |   12 +-
 doc/src/sgml/ref/repack.sgml                  |  129 +-
 src/Makefile                                  |    1 +
 src/backend/access/heap/heapam.c              |   34 +-
 src/backend/access/heap/heapam_handler.c      |  219 ++-
 src/backend/access/heap/rewriteheap.c         |    6 +-
 src/backend/access/transam/xact.c             |   11 +-
 src/backend/catalog/system_views.sql          |   30 +-
 src/backend/commands/cluster.c                | 1677 +++++++++++++++--
 src/backend/commands/matview.c                |    2 +-
 src/backend/commands/tablecmds.c              |    1 +
 src/backend/commands/vacuum.c                 |   12 +-
 src/backend/meson.build                       |    1 +
 src/backend/replication/logical/decode.c      |   83 +
 src/backend/replication/logical/snapbuild.c   |   21 +
 .../replication/pgoutput_repack/Makefile      |   32 +
 .../replication/pgoutput_repack/meson.build   |   18 +
 .../pgoutput_repack/pgoutput_repack.c         |  288 +++
 src/backend/storage/ipc/ipci.c                |    1 +
 .../storage/lmgr/generate-lwlocknames.pl      |    2 +-
 src/backend/utils/cache/relcache.c            |    1 +
 src/backend/utils/time/snapmgr.c              |    3 +-
 src/bin/psql/tab-complete.in.c                |   25 +-
 src/include/access/heapam.h                   |    9 +-
 src/include/access/heapam_xlog.h              |    2 +
 src/include/access/tableam.h                  |   10 +
 src/include/commands/cluster.h                |   91 +-
 src/include/commands/progress.h               |   23 +-
 src/include/replication/snapbuild.h           |    1 +
 src/include/storage/lockdefs.h                |    4 +-
 src/include/utils/snapmgr.h                   |    2 +
 src/test/modules/injection_points/Makefile    |    5 +-
 .../injection_points/expected/repack.out      |  113 ++
 .../modules/injection_points/logical.conf     |    1 +
 src/test/modules/injection_points/meson.build |    4 +
 .../injection_points/specs/repack.spec        |  143 ++
 src/test/regress/expected/rules.out           |   29 +-
 src/tools/pgindent/typedefs.list              |    4 +
 39 files changed, 2816 insertions(+), 271 deletions(-)
 create mode 100644 src/backend/replication/pgoutput_repack/Makefile
 create mode 100644 src/backend/replication/pgoutput_repack/meson.build
 create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
 create mode 100644 src/test/modules/injection_points/expected/repack.out
 create mode 100644 src/test/modules/injection_points/logical.conf
 create mode 100644 src/test/modules/injection_points/specs/repack.spec

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 12e103d319d..61c0197555f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6074,14 +6074,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>heap_tuples_written</structfield> <type>bigint</type>
+       <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of heap tuples written.
+       Number of heap tuples inserted.
        This counter only advances when the phase is
        <literal>seq scanning heap</literal>,
-       <literal>index scanning heap</literal>
-       or <literal>writing new heap</literal>.
+       <literal>index scanning heap</literal>,
+       <literal>writing new heap</literal>
+       or <literal>catch-up</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples updated.
+       This counter only advances when the phase is <literal>catch-up</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples deleted.
+       This counter only advances when the phase is <literal>catch-up</literal>.
       </para></entry>
      </row>
 
@@ -6162,6 +6183,14 @@ FROM pg_stat_get_backend_idset() AS backendid;
        <command>REPACK</command> is currently writing the new heap.
      </entry>
     </row>
+    <row>
+     <entry><literal>catch-up</literal></entry>
+     <entry>
+       <command>REPACK CONCURRENTLY</command> is currently processing the DML
+       commands that other transactions executed during any of the preceding
+       phase.
+     </entry>
+    </row>
     <row>
      <entry><literal>swapping relation files</literal></entry>
      <entry>
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 049ee75a4ba..0f5c34af542 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,15 +1833,17 @@ SELECT pg_advisory_lock(q.id) FROM
    <title>Caveats</title>
 
    <para>
-    Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
-    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
+    Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
+    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
+    TABLE</command></link> and <command>REPACK</command> with
+    the <literal>CONCURRENTLY</literal> option, are not
     MVCC-safe.  This means that after the truncation or rewrite commits, the
     table will appear empty to concurrent transactions, if they are using a
-    snapshot taken before the DDL command committed.  This will only be an
+    snapshot taken before the command committed.  This will only be an
     issue for a transaction that did not access the table in question
-    before the DDL command started &mdash; any transaction that has done so
+    before the command started &mdash; any transaction that has done so
     would hold at least an <literal>ACCESS SHARE</literal> table lock,
-    which would block the DDL command until that transaction completes.
+    which would block the truncating or rewriting command until that transaction completes.
     So these commands will not cause any apparent inconsistency in the
     table contents for successive queries on the target table, but they
     could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index fd9d89f8aaa..ff5ce48de55 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -27,6 +27,7 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
 
     VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
     ANALYSE | ANALYZE
+    CONCURRENTLY
 </synopsis>
  </refsynopsisdiv>
 
@@ -49,7 +50,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
    processes every table and materialized view in the current database that
    the current user has the <literal>MAINTAIN</literal> privilege on. This
    form of <command>REPACK</command> cannot be executed inside a transaction
-   block.
+   block.  Also, this form is not allowed if
+   the <literal>CONCURRENTLY</literal> option is used.
   </para>
 
   <para>
@@ -62,7 +64,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
    When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
    is acquired on it. This prevents any other database operations (both reads
    and writes) from operating on the table until the <command>REPACK</command>
-   is finished.
+   is finished. If you want to keep the table accessible during the repacking,
+   consider using the <literal>CONCURRENTLY</literal> option.
   </para>
 
   <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -179,6 +182,128 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>CONCURRENTLY</literal></term>
+    <listitem>
+     <para>
+      Allow other transactions to use the table while it is being repacked.
+     </para>
+
+     <para>
+      Internally, <command>REPACK</command> copies the contents of the table
+      (ignoring dead tuples) into a new file, sorted by the specified index,
+      and also creates a new file for each index. Then it swaps the old and
+      new files for the table and all the indexes, and deletes the old
+      files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+      sure that the old files do not change during the processing because the
+      changes would get lost due to the swap.
+     </para>
+
+     <para>
+      With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+      EXCLUSIVE</literal> lock is only acquired to swap the table and index
+      files. The data changes that took place during the creation of the new
+      table and index files are captured using logical decoding
+      (<xref linkend="logicaldecoding"/>) and applied before
+      the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+      is typically held only for the time needed to swap the files, which
+      should be pretty short. However, the time might still be noticeable if
+      too many data changes have been done to the table while
+      <command>REPACK</command> was waiting for the lock: those changes must
+      be processed just before the files are swapped, while the
+      <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+     </para>
+
+     <para>
+      Note that <command>REPACK</command> with the
+      the <literal>CONCURRENTLY</literal> option does not try to order the
+      rows inserted into the table after the repacking started. Also
+      note <command>REPACK</command> might fail to complete due to DDL
+      commands executed on the table by other transactions during the
+      repacking.
+     </para>
+
+     <note>
+      <para>
+       In addition to the temporary space requirements explained in
+       <xref linkend="sql-repack-notes-on-resources"/>,
+       the <literal>CONCURRENTLY</literal> option can add to the usage of
+       temporary space a bit more. The reason is that other transactions can
+       perform DML operations which cannot be applied to the new file until
+       <command>REPACK</command> has copied all the tuples from the old
+       file. Thus the tuples inserted into the old file during the copying are
+       also stored separately in a temporary file, so they can eventually be
+       applied to the new file.
+      </para>
+
+      <para>
+       Furthermore, the data changes performed during the copying are
+       extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+       this extraction (decoding) only takes place when certain amount of WAL
+       has been written. Therefore, WAL removal can be delayed by this
+       threshold. Currently the threshold is equal to the value of
+       the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+       configuration parameter.
+      </para>
+     </note>
+
+     <para>
+      The <literal>CONCURRENTLY</literal> option cannot be used in the
+      following cases:
+
+      <itemizedlist>
+       <listitem>
+        <para>
+          The table is <literal>UNLOGGED</literal>.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+          The table is partitioned.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+          The table is a system catalog or a <acronym>TOAST</acronym> table.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+         <command>REPACK</command> is executed inside a transaction block.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+          The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+          configuration parameter is less than <literal>logical</literal>.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+         The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+         configuration parameter does not allow for creation of an additional
+         replication slot.
+        </para>
+       </listitem>
+      </itemizedlist>
+     </para>
+
+     <warning>
+      <para>
+       <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
+       option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
+       details.
+      </para>
+     </warning>
+
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>VERBOSE</literal></term>
     <listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
 	interfaces \
 	backend/replication/libpqwalreceiver \
 	backend/replication/pgoutput \
+	backend/replication/pgoutput_repack \
 	fe_utils \
 	bin \
 	pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e3e7307ef5f..f9a4fe3faed 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 								  Buffer newbuf, HeapTuple oldtup,
 								  HeapTuple newtup, HeapTuple old_key_tuple,
-								  bool all_visible_cleared, bool new_all_visible_cleared);
+								  bool all_visible_cleared, bool new_all_visible_cleared,
+								  bool wal_logical);
 #ifdef USE_ASSERT_CHECKING
 static void check_lock_if_inplace_updateable_rel(Relation relation,
 												 ItemPointer otid,
@@ -2780,7 +2781,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 TM_Result
 heap_delete(Relation relation, ItemPointer tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart)
+			TM_FailureData *tmfd, bool changingPart, bool wal_logical)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3027,7 +3028,8 @@ l1:
 	 * Compute replica identity tuple before entering the critical section so
 	 * we don't PANIC upon a memory allocation failure.
 	 */
-	old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+	old_key_tuple = wal_logical ?
+		ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
 
 	/*
 	 * If this is the first possibly-multixact-able operation in the current
@@ -3117,6 +3119,15 @@ l1:
 				xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
 		}
 
+		/*
+		 * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+		 * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+		 * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+		 * Consider not decoding tuples w/o the old tuple/key instead.
+		 */
+		if (!wal_logical)
+			xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHeapDelete);
 
@@ -3209,7 +3220,8 @@ simple_heap_delete(Relation relation, ItemPointer tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, false /* changingPart */ );
+						 &tmfd, false,	/* changingPart */
+						 true /* wal_logical */ );
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -3250,7 +3262,7 @@ TM_Result
 heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 			CommandId cid, Snapshot crosscheck, bool wait,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
-			TU_UpdateIndexes *update_indexes)
+			TU_UpdateIndexes *update_indexes, bool wal_logical)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -4143,7 +4155,8 @@ l2:
 								 newbuf, &oldtup, heaptup,
 								 old_key_tuple,
 								 all_visible_cleared,
-								 all_visible_cleared_new);
+								 all_visible_cleared_new,
+								 wal_logical);
 		if (newbuf != buffer)
 		{
 			PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4501,7 +4514,8 @@ simple_heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, &lockmode, update_indexes);
+						 &tmfd, &lockmode, update_indexes,
+						 true /* wal_logical */ );
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -8842,7 +8856,8 @@ static XLogRecPtr
 log_heap_update(Relation reln, Buffer oldbuf,
 				Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
 				HeapTuple old_key_tuple,
-				bool all_visible_cleared, bool new_all_visible_cleared)
+				bool all_visible_cleared, bool new_all_visible_cleared,
+				bool wal_logical)
 {
 	xl_heap_update xlrec;
 	xl_heap_header xlhdr;
@@ -8853,7 +8868,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 				suffixlen = 0;
 	XLogRecPtr	recptr;
 	Page		page = BufferGetPage(newbuf);
-	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
+	bool		need_tuple_data = RelationIsLogicallyLogged(reln) &&
+		wal_logical;
 	bool		init;
 	int			bufflags;
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 79f9de5d760..d03084768e0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
 #include "catalog/index.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -309,7 +310,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
+					   true);
 }
 
 
@@ -328,7 +330,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode, update_indexes);
+						 tmfd, lockmode, update_indexes, true);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -685,13 +687,15 @@ static void
 heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 Relation OldIndex, bool use_sort,
 								 TransactionId OldestXmin,
+								 Snapshot snapshot,
+								 LogicalDecodingContext *decoding_ctx,
 								 TransactionId *xid_cutoff,
 								 MultiXactId *multi_cutoff,
 								 double *num_tuples,
 								 double *tups_vacuumed,
 								 double *tups_recently_dead)
 {
-	RewriteState rwstate;
+	RewriteState rwstate = NULL;
 	IndexScanDesc indexScan;
 	TableScanDesc tableScan;
 	HeapScanDesc heapScan;
@@ -705,6 +709,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	bool	   *isnull;
 	BufferHeapTupleTableSlot *hslot;
 	BlockNumber prev_cblock = InvalidBlockNumber;
+	bool		concurrent = snapshot != NULL;
+	XLogRecPtr	end_of_wal_prev = GetFlushRecPtr(NULL);
 
 	/* Remember if it's a system catalog */
 	is_system_catalog = IsSystemRelation(OldHeap);
@@ -720,9 +726,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	values = (Datum *) palloc(natts * sizeof(Datum));
 	isnull = (bool *) palloc(natts * sizeof(bool));
 
-	/* Initialize the rewrite operation */
-	rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-								 *multi_cutoff);
+	/*
+	 * Initialize the rewrite operation.
+	 */
+	if (!concurrent)
+		rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin,
+									 *xid_cutoff, *multi_cutoff);
 
 
 	/* Set up sorting if wanted */
@@ -737,6 +746,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	 * Prepare to scan the OldHeap.  To ensure we see recently-dead tuples
 	 * that still need to be copied, we scan with SnapshotAny and use
 	 * HeapTupleSatisfiesVacuum for the visibility test.
+	 *
+	 * In the CONCURRENTLY case, we do regular MVCC visibility tests, using
+	 * the snapshot passed by the caller.
 	 */
 	if (OldIndex != NULL && !use_sort)
 	{
@@ -753,7 +765,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex,
+									snapshot ? snapshot : SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -762,7 +776,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
 									 PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
 
-		tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+		tableScan = table_beginscan(OldHeap,
+									snapshot ? snapshot : SnapshotAny,
+									0, (ScanKey) NULL);
 		heapScan = (HeapScanDesc) tableScan;
 		indexScan = NULL;
 
@@ -785,6 +801,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		HeapTuple	tuple;
 		Buffer		buf;
 		bool		isdead;
+		HTSV_Result vis;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -837,70 +854,84 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
 		buf = hslot->buffer;
 
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
-
-		switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+		/*
+		 * Regarding CONCURRENTLY, see the comments on MVCC snapshot above.
+		 */
+		if (!concurrent)
 		{
-			case HEAPTUPLE_DEAD:
-				/* Definitely dead */
-				isdead = true;
-				break;
-			case HEAPTUPLE_RECENTLY_DEAD:
-				*tups_recently_dead += 1;
-				/* fall through */
-			case HEAPTUPLE_LIVE:
-				/* Live or recently dead, must copy it */
-				isdead = false;
-				break;
-			case HEAPTUPLE_INSERT_IN_PROGRESS:
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
-				/*
-				 * Since we hold exclusive lock on the relation, normally the
-				 * only way to see this is if it was inserted earlier in our
-				 * own transaction.  However, it can happen in system
-				 * catalogs, since we tend to release write lock before commit
-				 * there.  Give a warning if neither case applies; but in any
-				 * case we had better copy it.
-				 */
-				if (!is_system_catalog &&
-					!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
-					elog(WARNING, "concurrent insert in progress within table \"%s\"",
-						 RelationGetRelationName(OldHeap));
-				/* treat as live */
-				isdead = false;
-				break;
-			case HEAPTUPLE_DELETE_IN_PROGRESS:
+			switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
+			{
+				case HEAPTUPLE_DEAD:
+					/* Definitely dead */
+					isdead = true;
+					break;
+				case HEAPTUPLE_RECENTLY_DEAD:
+					*tups_recently_dead += 1;
+					/* fall through */
+				case HEAPTUPLE_LIVE:
+					/* Live or recently dead, must copy it */
+					isdead = false;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
 
-				/*
-				 * Similar situation to INSERT_IN_PROGRESS case.
-				 */
-				if (!is_system_catalog &&
-					!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
-					elog(WARNING, "concurrent delete in progress within table \"%s\"",
-						 RelationGetRelationName(OldHeap));
-				/* treat as recently dead */
-				*tups_recently_dead += 1;
-				isdead = false;
-				break;
-			default:
-				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-				isdead = false; /* keep compiler quiet */
-				break;
-		}
+					/*
+					 * As long as we hold exclusive lock on the relation,
+					 * normally the only way to see this is if it was inserted
+					 * earlier in our own transaction.  However, it can happen
+					 * in system catalogs, since we tend to release write lock
+					 * before commit there. Also, there's no exclusive lock
+					 * during concurrent processing. Give a warning if neither
+					 * case applies; but in any case we had better copy it.
+					 */
+					if (!is_system_catalog && !concurrent &&
+						!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
+						elog(WARNING, "concurrent insert in progress within table \"%s\"",
+							 RelationGetRelationName(OldHeap));
+					/* treat as live */
+					isdead = false;
+					break;
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
 
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+					/*
+					 * Similar situation to INSERT_IN_PROGRESS case.
+					 */
+					if (!is_system_catalog && !concurrent &&
+						!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
+						elog(WARNING, "concurrent delete in progress within table \"%s\"",
+							 RelationGetRelationName(OldHeap));
+					/* treat as recently dead */
+					*tups_recently_dead += 1;
+					isdead = false;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					isdead = false; /* keep compiler quiet */
+					break;
+			}
 
-		if (isdead)
-		{
-			*tups_vacuumed += 1;
-			/* heap rewrite module still needs to see it... */
-			if (rewrite_heap_dead_tuple(rwstate, tuple))
+			if (isdead)
 			{
-				/* A previous recently-dead tuple is now known dead */
 				*tups_vacuumed += 1;
-				*tups_recently_dead -= 1;
+				/* heap rewrite module still needs to see it... */
+				if (rewrite_heap_dead_tuple(rwstate, tuple))
+				{
+					/* A previous recently-dead tuple is now known dead */
+					*tups_vacuumed += 1;
+					*tups_recently_dead -= 1;
+				}
+
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+				continue;
 			}
-			continue;
+
+			/*
+			 * In the concurrent case, we have a copy of the tuple, so we
+			 * don't worry whether the source tuple will be deleted / updated
+			 * after we release the lock.
+			 */
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		}
 
 		*num_tuples += 1;
@@ -919,7 +950,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		{
 			const int	ct_index[] = {
 				PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
-				PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+				PROGRESS_REPACK_HEAP_TUPLES_INSERTED
 			};
 			int64		ct_val[2];
 
@@ -934,6 +965,31 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			ct_val[1] = *num_tuples;
 			pgstat_progress_update_multi_param(2, ct_index, ct_val);
 		}
+
+		/*
+		 * Process the WAL produced by the load, as well as by other
+		 * transactions, so that the replication slot can advance and WAL does
+		 * not pile up. Use wal_segment_size as a threshold so that we do not
+		 * introduce the decoding overhead too often.
+		 *
+		 * Of course, we must not apply the changes until the initial load has
+		 * completed.
+		 *
+		 * Note that our insertions into the new table should not be decoded
+		 * as we (intentionally) do not write the logical decoding specific
+		 * information to WAL.
+		 */
+		if (concurrent)
+		{
+			XLogRecPtr	end_of_wal;
+
+			end_of_wal = GetFlushRecPtr(NULL);
+			if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+			{
+				repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+				end_of_wal_prev = end_of_wal;
+			}
+		}
 	}
 
 	if (indexScan != NULL)
@@ -977,7 +1033,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 									 values, isnull,
 									 rwstate);
 			/* Report n_tuples */
-			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
 										 n_tuples);
 		}
 
@@ -985,7 +1041,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	}
 
 	/* Write out any remaining tuples, and fsync if needed */
-	end_heap_rewrite(rwstate);
+	if (rwstate)
+		end_heap_rewrite(rwstate);
 
 	/* Clean up */
 	pfree(values);
@@ -2376,6 +2433,10 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
  * SET WITHOUT OIDS.
  *
  * So, we must reconstruct the tuple from component Datums.
+ *
+ * If rwstate=NULL, use simple_heap_insert() instead of rewriting - in that
+ * case we still need to deform/form the tuple. TODO Shouldn't we rename the
+ * function, as might not do any rewrite?
  */
 static void
 reform_and_rewrite_tuple(HeapTuple tuple,
@@ -2398,8 +2459,28 @@ reform_and_rewrite_tuple(HeapTuple tuple,
 
 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
 
-	/* The heap rewrite module does the rest */
-	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+	if (rwstate)
+		/* The heap rewrite module does the rest */
+		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+	else
+	{
+		/*
+		 * Insert tuple when processing REPACK CONCURRENTLY.
+		 *
+		 * rewriteheap.c is not used in the CONCURRENTLY case because it'd be
+		 * difficult to do the same in the catch-up phase (as the logical
+		 * decoding does not provide us with sufficient visibility
+		 * information). Thus we must use heap_insert() both during the
+		 * catch-up and here.
+		 *
+		 * The following is like simple_heap_insert() except that we pass the
+		 * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
+		 * the relation files, it drops this relation, so no logical
+		 * replication subscription should need the data.
+		 */
+		heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
+					HEAP_INSERT_NO_LOGICAL, NULL);
+	}
 
 	heap_freetuple(copiedTuple);
 }
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index e6d2b5fced1..6aa2ed214f2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -617,9 +617,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 		int			options = HEAP_INSERT_SKIP_FSM;
 
 		/*
-		 * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
-		 * for the TOAST table are not logically decoded.  The main heap is
-		 * WAL-logged as XLOG FPI records, which are not logically decoded.
+		 * While rewriting the heap for REPACK, make sure data for the TOAST
+		 * table are not logically decoded.  The main heap is WAL-logged as
+		 * XLOG FPI records, which are not logically decoded.
 		 */
 		options |= HEAP_INSERT_NO_LOGICAL;
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b46e7e9c2a6..5670f2bfbde 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -215,6 +215,7 @@ typedef struct TransactionStateData
 	bool		parallelChildXact;	/* is any parent transaction parallel? */
 	bool		chain;			/* start a new block after this one */
 	bool		topXidLogged;	/* for a subxact: is top-level XID logged? */
+	bool		internal;		/* for a subxact: launched internally? */
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -4735,6 +4736,7 @@ BeginInternalSubTransaction(const char *name)
 			/* Normal subtransaction start */
 			PushTransaction();
 			s = CurrentTransactionState;	/* changed by push */
+			s->internal = true;
 
 			/*
 			 * Savepoint names, like the TransactionState block itself, live
@@ -5251,7 +5253,13 @@ AbortSubTransaction(void)
 	LWLockReleaseAll();
 
 	pgstat_report_wait_end();
-	pgstat_progress_end_command();
+
+	/*
+	 * Internal subtransacion might be used by an user command, in which case
+	 * the command outlives the subtransaction.
+	 */
+	if (!s->internal)
+		pgstat_progress_end_command();
 
 	pgaio_error_cleanup();
 
@@ -5468,6 +5476,7 @@ PushTransaction(void)
 	s->parallelModeLevel = 0;
 	s->parallelChildXact = (p->parallelModeLevel != 0 || p->parallelChildXact);
 	s->topXidLogged = false;
+	s->internal = false;
 
 	CurrentTransactionState = s;
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b2b7b10c2be..a92ac78ad9e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1266,16 +1266,17 @@ CREATE VIEW pg_stat_progress_cluster AS
                       WHEN 2 THEN 'index scanning heap'
                       WHEN 3 THEN 'sorting tuples'
                       WHEN 4 THEN 'writing new heap'
-                      WHEN 5 THEN 'swapping relation files'
-                      WHEN 6 THEN 'rebuilding index'
-                      WHEN 7 THEN 'performing final cleanup'
+                      -- 5 is 'catch-up', but that should not appear here.
+                      WHEN 6 THEN 'swapping relation files'
+                      WHEN 7 THEN 'rebuilding index'
+                      WHEN 8 THEN 'performing final cleanup'
                       END AS phase,
         CAST(S.param3 AS oid) AS cluster_index_relid,
         S.param4 AS heap_tuples_scanned,
         S.param5 AS heap_tuples_written,
-        S.param6 AS heap_blks_total,
-        S.param7 AS heap_blks_scanned,
-        S.param8 AS index_rebuild_count
+        S.param8 AS heap_blks_total,
+        S.param9 AS heap_blks_scanned,
+        S.param10 AS index_rebuild_count
     FROM pg_stat_get_progress_info('CLUSTER') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
@@ -1291,16 +1292,19 @@ CREATE VIEW pg_stat_progress_repack AS
                       WHEN 2 THEN 'index scanning heap'
                       WHEN 3 THEN 'sorting tuples'
                       WHEN 4 THEN 'writing new heap'
-                      WHEN 5 THEN 'swapping relation files'
-                      WHEN 6 THEN 'rebuilding index'
-                      WHEN 7 THEN 'performing final cleanup'
+                      WHEN 5 THEN 'catch-up'
+                      WHEN 6 THEN 'swapping relation files'
+                      WHEN 7 THEN 'rebuilding index'
+                      WHEN 8 THEN 'performing final cleanup'
                       END AS phase,
         CAST(S.param3 AS oid) AS repack_index_relid,
         S.param4 AS heap_tuples_scanned,
-        S.param5 AS heap_tuples_written,
-        S.param6 AS heap_blks_total,
-        S.param7 AS heap_blks_scanned,
-        S.param8 AS index_rebuild_count
+        S.param5 AS heap_tuples_inserted,
+        S.param6 AS heap_tuples_updated,
+        S.param7 AS heap_tuples_deleted,
+        S.param8 AS heap_blks_total,
+        S.param9 AS heap_blks_scanned,
+        S.param10 AS index_rebuild_count
     FROM pg_stat_get_progress_info('REPACK') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 8b64f9e6795..61224a3adf2 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -25,6 +25,10 @@
 #include "access/toast_internals.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
@@ -32,6 +36,7 @@
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
@@ -39,15 +44,21 @@
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
+#include "executor/executor.h"
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -67,13 +78,45 @@ typedef struct
 	Oid			indexOid;
 } RelToCluster;
 
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+	ResultRelInfo *rri;
+	EState	   *estate;
+
+	Relation	ident_index;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
+
 static bool cluster_rel_recheck(RepackCommand cmd, Relation OldHeap,
-								Oid indexOid, Oid userid, int options);
+								Oid indexOid, Oid userid, LOCKMODE lmode,
+								int options);
+static void check_repack_concurrently_requirements(Relation rel);
 static void rebuild_relation(RepackCommand cmd, bool usingindex,
-							 Relation OldHeap, Relation index, bool verbose);
+							 Relation OldHeap, Relation index, Oid userid,
+							 bool verbose, bool concurrent);
 static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
-							bool verbose, bool *pSwapToastByContent,
-							TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+							Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+							bool verbose,
+							bool *pSwapToastByContent,
+							TransactionId *pFreezeXid,
+							MultiXactId *pCutoffMulti);
 static List *get_tables_to_repack(RepackCommand cmd, bool usingindex,
 								  MemoryContext permcxt);
 static List *get_tables_to_repack_partitioned(RepackCommand cmd,
@@ -81,12 +124,61 @@ static List *get_tables_to_repack_partitioned(RepackCommand cmd,
 											  Oid relid, bool rel_is_index);
 static bool cluster_is_permitted_for_relation(RepackCommand cmd,
 											  Oid relid, Oid userid);
+
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid,
+													  const char *slotname,
+													  TupleDesc tupdesc);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+									 Relation rel, ScanKey key, int nkeys,
+									 IndexInsertState *iistate);
+static void apply_concurrent_insert(Relation rel, ConcurrentChange *change,
+									HeapTuple tup, IndexInsertState *iistate,
+									TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+									HeapTuple tup_target,
+									ConcurrentChange *change,
+									IndexInsertState *iistate,
+									TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+									ConcurrentChange *change);
+static HeapTuple find_target_tuple(Relation rel, ScanKey key, int nkeys,
+								   HeapTuple tup_key,
+								   IndexInsertState *iistate,
+								   TupleTableSlot *ident_slot,
+								   IndexScanDesc *scan_p);
+static void process_concurrent_changes(LogicalDecodingContext *ctx,
+									   XLogRecPtr end_of_wal,
+									   Relation rel_dst,
+									   Relation rel_src,
+									   ScanKey ident_key,
+									   int ident_key_nentries,
+									   IndexInsertState *iistate);
+static IndexInsertState *get_index_insert_state(Relation relation,
+												Oid ident_index_id);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+								  int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+											   Relation cl_index,
+											   LogicalDecodingContext *ctx,
+											   bool swap_toast_by_content,
+											   TransactionId frozenXid,
+											   MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
 static Relation process_single_relation(RepackStmt *stmt,
+										LOCKMODE lockmode,
+										bool isTopLevel,
 										ClusterParams *params);
 static Oid	determine_clustered_index(Relation rel, bool usingindex,
 									  const char *indexname);
 
 
+#define REPL_PLUGIN_NAME   "pgoutput_repack"
+
 static const char *
 RepackCommandAsString(RepackCommand cmd)
 {
@@ -95,7 +187,7 @@ RepackCommandAsString(RepackCommand cmd)
 		case REPACK_COMMAND_REPACK:
 			return "REPACK";
 		case REPACK_COMMAND_VACUUMFULL:
-			return "VACUUM";
+			return "VACUUM (FULL)";
 		case REPACK_COMMAND_CLUSTER:
 			return "CLUSTER";
 	}
@@ -132,6 +224,7 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 	ClusterParams params = {0};
 	Relation	rel = NULL;
 	MemoryContext repack_context;
+	LOCKMODE	lockmode;
 	List	   *rtcs;
 
 	/* Parse option list */
@@ -142,6 +235,16 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 		else if (strcmp(opt->defname, "analyze") == 0 ||
 				 strcmp(opt->defname, "analyse") == 0)
 			params.options |= defGetBoolean(opt) ? CLUOPT_ANALYZE : 0;
+		else if (strcmp(opt->defname, "concurrently") == 0 &&
+				 defGetBoolean(opt))
+		{
+			if (stmt->command != REPACK_COMMAND_REPACK)
+				ereport(ERROR,
+						errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("CONCURRENTLY option not supported for %s",
+							   RepackCommandAsString(stmt->command)));
+			params.options |= CLUOPT_CONCURRENT;
+		}
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -151,13 +254,25 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 					 parser_errposition(pstate, opt->location)));
 	}
 
+	/*
+	 * Determine the lock mode expected by cluster_rel().
+	 *
+	 * In the exclusive case, we obtain AccessExclusiveLock right away to
+	 * avoid lock-upgrade hazard in the single-transaction case. In the
+	 * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+	 * of processing, supposedly for very short time. Until then, we'll have
+	 * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+	 */
+	lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+		AccessExclusiveLock : ShareUpdateExclusiveLock;
+
 	/*
 	 * If a single relation is specified, process it and we're done ... unless
 	 * the relation is a partitioned table, in which case we fall through.
 	 */
 	if (stmt->relation != NULL)
 	{
-		rel = process_single_relation(stmt, &params);
+		rel = process_single_relation(stmt, lockmode, isTopLevel, &params);
 		if (rel == NULL)
 			return;
 	}
@@ -169,10 +284,29 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 				errmsg("cannot ANALYZE multiple tables"));
 
 	/*
-	 * By here, we know we are in a multi-table situation.  In order to avoid
-	 * holding locks for too long, we want to process each table in its own
-	 * transaction.  This forces us to disallow running inside a user
-	 * transaction block.
+	 * By here, we know we are in a multi-table situation.
+	 *
+	 * Concurrent processing is currently considered rather special (e.g. in
+	 * terms of resources consumed) so it is not performed in bulk.
+	 */
+	if (params.options & CLUOPT_CONCURRENT)
+	{
+		if (rel != NULL)
+		{
+			Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+			ereport(ERROR,
+					errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+					errhint("Consider running the command for individual partitions."));
+		}
+		else
+			ereport(ERROR,
+					errmsg("REPACK CONCURRENTLY requires explicit table name"));
+	}
+
+	/*
+	 * In order to avoid holding locks for too long, we want to process each
+	 * table in its own transaction.  This forces us to disallow running
+	 * inside a user transaction block.
 	 */
 	PreventInTransactionBlock(isTopLevel, RepackCommandAsString(stmt->command));
 
@@ -252,7 +386,7 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 		 * Open the target table, coping with the case where it has been
 		 * dropped.
 		 */
-		rel = try_table_open(rtc->tableOid, AccessExclusiveLock);
+		rel = try_table_open(rtc->tableOid, lockmode);
 		if (rel == NULL)
 		{
 			CommitTransactionCommand();
@@ -264,7 +398,7 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 
 		/* Process this table */
 		cluster_rel(stmt->command, stmt->usingindex,
-					rel, rtc->indexOid, &params);
+					rel, rtc->indexOid, &params, isTopLevel);
 		/* cluster_rel closes the relation, but keeps lock */
 
 		PopActiveSnapshot();
@@ -293,22 +427,55 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
  * If indexOid is InvalidOid, the table will be rewritten in physical order
  * instead of index order.
  *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
  * 'cmd' indicates which command is being executed, to be used for error
  * messages.
  */
 void
 cluster_rel(RepackCommand cmd, bool usingindex,
-			Relation OldHeap, Oid indexOid, ClusterParams *params)
+			Relation OldHeap, Oid indexOid, ClusterParams *params,
+			bool isTopLevel)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
+	Relation	index;
+	LOCKMODE	lmode;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
 	bool		verbose = ((params->options & CLUOPT_VERBOSE) != 0);
 	bool		recheck = ((params->options & CLUOPT_RECHECK) != 0);
-	Relation	index;
+	bool		concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
 
-	Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+	/*
+	 * Check that the correct lock is held. The lock mode is
+	 * AccessExclusiveLock for normal processing and ShareUpdateExclusiveLock
+	 * for concurrent processing (so that SELECT, INSERT, UPDATE and DELETE
+	 * commands work, but cluster_rel() cannot be called concurrently for the
+	 * same relation).
+	 */
+	lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+	/* There are specific requirements on concurrent processing. */
+	if (concurrent)
+	{
+		/*
+		 * Make sure we have no XID assigned, otherwise call of
+		 * setup_logical_decoding() can cause a deadlock.
+		 *
+		 * The existence of transaction block actually does not imply that XID
+		 * was already assigned, but it very likely is. We might want to check
+		 * the result of GetCurrentTransactionIdIfAny() instead, but that
+		 * would be less clear from user's perspective.
+		 */
+		PreventInTransactionBlock(isTopLevel, "REPACK (CONCURRENTLY)");
+
+		check_repack_concurrently_requirements(OldHeap);
+	}
 
 	/* Check for user-requested abort. */
 	CHECK_FOR_INTERRUPTS();
@@ -351,11 +518,13 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 	 * If this is a single-transaction CLUSTER, we can skip these tests. We
 	 * *must* skip the one on indisclustered since it would reject an attempt
 	 * to cluster a not-previously-clustered index.
+	 *
+	 * XXX move [some of] these comments to where the RECHECK flag is
+	 * determined?
 	 */
-	if (recheck)
-		if (!cluster_rel_recheck(cmd, OldHeap, indexOid, save_userid,
-								 params->options))
-			goto out;
+	if (recheck && !cluster_rel_recheck(cmd, OldHeap, indexOid, save_userid,
+										lmode, params->options))
+		goto out;
 
 	/*
 	 * We allow repacking shared catalogs only when not using an index. It
@@ -369,6 +538,12 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 				 errmsg("cannot run \"%s\" on a shared catalog",
 						RepackCommandAsString(cmd))));
 
+	/*
+	 * The CONCURRENTLY case should have been rejected earlier because it does
+	 * not support system catalogs.
+	 */
+	Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
 	/*
 	 * Don't process temp tables of other backends ... their local buffer
 	 * manager is not going to cope.
@@ -404,7 +579,7 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 	if (OidIsValid(indexOid))
 	{
 		/* verify the index is good and lock it */
-		check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+		check_index_is_clusterable(OldHeap, indexOid, lmode);
 		/* also open it */
 		index = index_open(indexOid, NoLock);
 	}
@@ -421,7 +596,9 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 	if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
 		!RelationIsPopulated(OldHeap))
 	{
-		relation_close(OldHeap, AccessExclusiveLock);
+		if (index)
+			index_close(index, lmode);
+		relation_close(OldHeap, lmode);
 		goto out;
 	}
 
@@ -434,11 +611,35 @@ cluster_rel(RepackCommand cmd, bool usingindex,
 	 * invalid, because we move tuples around.  Promote them to relation
 	 * locks.  Predicate locks on indexes will be promoted when they are
 	 * reindexed.
+	 *
+	 * During concurrent processing, the heap as well as its indexes stay in
+	 * operation, so we postpone this step until they are locked using
+	 * AccessExclusiveLock near the end of the processing.
 	 */
-	TransferPredicateLocksToHeapRelation(OldHeap);
+	if (!concurrent)
+		TransferPredicateLocksToHeapRelation(OldHeap);
 
 	/* rebuild_relation does all the dirty work */
-	rebuild_relation(cmd, usingindex, OldHeap, index, verbose);
+	PG_TRY();
+	{
+		/*
+		 * For concurrent processing, make sure that our logical decoding
+		 * ignores data changes of other tables than the one we are
+		 * processing.
+		 */
+		if (concurrent)
+			begin_concurrent_repack(OldHeap);
+
+		rebuild_relation(cmd, usingindex, OldHeap, index, save_userid,
+						 verbose, concurrent);
+	}
+	PG_FINALLY();
+	{
+		if (concurrent)
+			end_concurrent_repack();
+	}
+	PG_END_TRY();
+
 	/* rebuild_relation closes OldHeap, and index if valid */
 
 out:
@@ -457,14 +658,14 @@ out:
  */
 static bool
 cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
-					Oid userid, int options)
+					Oid userid, LOCKMODE lmode, int options)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 
 	/* Check that the user still has privileges for the relation */
 	if (!cluster_is_permitted_for_relation(cmd, tableOid, userid))
 	{
-		relation_close(OldHeap, AccessExclusiveLock);
+		relation_close(OldHeap, lmode);
 		return false;
 	}
 
@@ -478,7 +679,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	 */
 	if (RELATION_IS_OTHER_TEMP(OldHeap))
 	{
-		relation_close(OldHeap, AccessExclusiveLock);
+		relation_close(OldHeap, lmode);
 		return false;
 	}
 
@@ -489,7 +690,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 		 */
 		if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
 		{
-			relation_close(OldHeap, AccessExclusiveLock);
+			relation_close(OldHeap, lmode);
 			return false;
 		}
 
@@ -500,7 +701,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 		if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
 			!get_index_isclustered(indexOid))
 		{
-			relation_close(OldHeap, AccessExclusiveLock);
+			relation_close(OldHeap, lmode);
 			return false;
 		}
 	}
@@ -641,19 +842,89 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
 	table_close(pg_index, RowExclusiveLock);
 }
 
+/*
+ * Check if the CONCURRENTLY option is legal for the relation.
+ */
+static void
+check_repack_concurrently_requirements(Relation rel)
+{
+	char		relpersistence,
+				replident;
+	Oid			ident_idx;
+
+	/* Data changes in system relations are not logically decoded. */
+	if (IsCatalogRelation(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+	/*
+	 * reorderbuffer.c does not seem to handle processing of TOAST relation
+	 * alone.
+	 */
+	if (IsToastRelation(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+	relpersistence = rel->rd_rel->relpersistence;
+	if (relpersistence != RELPERSISTENCE_PERMANENT)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+	/* With NOTHING, WAL does not contain the old tuple. */
+	replident = rel->rd_rel->relreplident;
+	if (replident == REPLICA_IDENTITY_NOTHING)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("Relation \"%s\" has insufficient replication identity.",
+						 RelationGetRelationName(rel))));
+
+	/*
+	 * Identity index is not set if the replica identity is FULL, but PK might
+	 * exist in such a case.
+	 */
+	ident_idx = RelationGetReplicaIndex(rel);
+	if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+		ident_idx = rel->rd_pkindex;
+	if (!OidIsValid(ident_idx))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot process relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 (errhint("Relation \"%s\" has no identity index.",
+						  RelationGetRelationName(rel)))));
+}
+
+
 /*
  * rebuild_relation: rebuild an existing relation in index or physical order
  *
- * OldHeap: table to rebuild.
+ * OldHeap: table to rebuild.  See cluster_rel() for comments on the required
+ * lock strength.
+ *
  * index: index to cluster by, or NULL to rewrite in physical order.
  *
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them -- AccessExclusiveLock for exclusive
+ * processing and ShareUpdateExclusiveLock for concurrent processing.
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock.  (The
+ * function handles the lock upgrade if 'concurrent' is true.)
  */
 static void
 rebuild_relation(RepackCommand cmd, bool usingindex,
-				 Relation OldHeap, Relation index, bool verbose)
+				 Relation OldHeap, Relation index, Oid userid,
+				 bool verbose, bool concurrent)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 	Oid			accessMethod = OldHeap->rd_rel->relam;
@@ -661,13 +932,55 @@ rebuild_relation(RepackCommand cmd, bool usingindex,
 	Oid			OIDNewHeap;
 	Relation	NewHeap;
 	char		relpersistence;
-	bool		is_system_catalog;
 	bool		swap_toast_by_content;
 	TransactionId frozenXid;
 	MultiXactId cutoffMulti;
+	NameData	slotname;
+	LogicalDecodingContext *ctx = NULL;
+	Snapshot	snapshot = NULL;
+#if USE_ASSERT_CHECKING
+	LOCKMODE	lmode;
 
-	Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
-		   (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+	lmode = concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock;
+
+	Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+	Assert(!usingindex || CheckRelationLockedByMe(index, lmode, false));
+#endif
+
+	if (concurrent)
+	{
+		TupleDesc	tupdesc;
+
+		/*
+		 * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+		 * should never fire.
+		 */
+		Assert(GetTopTransactionIdIfAny() == InvalidTransactionId);
+
+		/*
+		 * A single backend should not execute multiple REPACK commands at a
+		 * time, so use PID to make the slot unique.
+		 */
+		snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+		tupdesc = CreateTupleDescCopy(RelationGetDescr(OldHeap));
+
+		/*
+		 * Prepare to capture the concurrent data changes.
+		 *
+		 * Note that this call waits for all transactions with XID already
+		 * assigned to finish. If some of those transactions is waiting for a
+		 * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+		 * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+		 * risk is worth unlocking/locking the table (and its clustering
+		 * index) and checking again if its still eligible for REPACK
+		 * CONCURRENTLY.
+		 */
+		ctx = setup_logical_decoding(tableOid, NameStr(slotname), tupdesc);
+
+		snapshot = SnapBuildInitialSnapshotForRepack(ctx->snapshot_builder);
+		PushActiveSnapshot(snapshot);
+	}
 
 	/* for CLUSTER or REPACK USING INDEX, mark the index as the one to use */
 	if (usingindex)
@@ -675,7 +988,6 @@ rebuild_relation(RepackCommand cmd, bool usingindex,
 
 	/* Remember info about rel before closing OldHeap */
 	relpersistence = OldHeap->rd_rel->relpersistence;
-	is_system_catalog = IsSystemRelation(OldHeap);
 
 	/*
 	 * Create the transient table that will receive the re-ordered data.
@@ -691,30 +1003,67 @@ rebuild_relation(RepackCommand cmd, bool usingindex,
 	NewHeap = table_open(OIDNewHeap, NoLock);
 
 	/* Copy the heap data into the new table in the desired order */
-	copy_table_data(NewHeap, OldHeap, index, verbose,
+	copy_table_data(NewHeap, OldHeap, index, snapshot, ctx, verbose,
 					&swap_toast_by_content, &frozenXid, &cutoffMulti);
 
+	/* The historic snapshot won't be needed anymore. */
+	if (snapshot)
+		PopActiveSnapshot();
 
-	/* Close relcache entries, but keep lock until transaction commit */
-	table_close(OldHeap, NoLock);
-	if (index)
-		index_close(index, NoLock);
-
-	/*
-	 * Close the new relation so it can be dropped as soon as the storage is
-	 * swapped. The relation is not visible to others, so no need to unlock it
-	 * explicitly.
-	 */
-	table_close(NewHeap, NoLock);
-
-	/*
-	 * Swap the physical files of the target and transient tables, then
-	 * rebuild the target's indexes and throw away the transient table.
-	 */
-	finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
-					 swap_toast_by_content, false, true,
-					 frozenXid, cutoffMulti,
-					 relpersistence);
+	if (concurrent)
+	{
+		/*
+		 * Push a snapshot that we will use to find old versions of rows when
+		 * processing concurrent UPDATE and DELETE commands. (That snapshot
+		 * should also be used by index expressions.)
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/*
+		 * Make sure we can find the tuples just inserted when applying DML
+		 * commands on top of those.
+		 */
+		CommandCounterIncrement();
+		UpdateActiveSnapshotCommandId();
+
+		rebuild_relation_finish_concurrent(NewHeap, OldHeap, index,
+										   ctx, swap_toast_by_content,
+										   frozenXid, cutoffMulti);
+		PopActiveSnapshot();
+
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+		/* Done with decoding. */
+		cleanup_logical_decoding(ctx);
+		ReplicationSlotRelease();
+		ReplicationSlotDrop(NameStr(slotname), false);
+	}
+	else
+	{
+		bool		is_system_catalog = IsSystemRelation(OldHeap);
+
+		/* Close relcache entries, but keep lock until transaction commit */
+		table_close(OldHeap, NoLock);
+		if (index)
+			index_close(index, NoLock);
+
+		/*
+		 * Close the new relation so it can be dropped as soon as the storage
+		 * is swapped. The relation is not visible to others, so no need to
+		 * unlock it explicitly.
+		 */
+		table_close(NewHeap, NoLock);
+
+		/*
+		 * Swap the physical files of the target and transient tables, then
+		 * rebuild the target's indexes and throw away the transient table.
+		 */
+		finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+						 swap_toast_by_content, false, true, true,
+						 frozenXid, cutoffMulti,
+						 relpersistence);
+	}
 }
 
 
@@ -849,15 +1198,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
 /*
  * Do the physical copying of table data.
  *
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
  * There are three output parameters:
  * *pSwapToastByContent is set true if toast tables must be swapped by content.
  * *pFreezeXid receives the TransactionId used as freeze cutoff point.
  * *pCutoffMulti receives the MultiXactId used as a cutoff point.
  */
 static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
-				bool *pSwapToastByContent, TransactionId *pFreezeXid,
-				MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+				Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+				bool verbose, bool *pSwapToastByContent,
+				TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
 {
 	Relation	relRelation;
 	HeapTuple	reltup;
@@ -875,6 +1228,8 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	PGRUsage	ru0;
 	char	   *nspname;
 
+	bool		concurrent = snapshot != NULL;
+
 	pg_rusage_init(&ru0);
 
 	/* Store a copy of the namespace name for logging purposes */
@@ -977,8 +1332,48 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	 * provided, else plain seqscan.
 	 */
 	if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
+	{
+		ResourceOwner oldowner = NULL;
+		ResourceOwner resowner = NULL;
+
+		/*
+		 * In the CONCURRENT case, use a dedicated resource owner so we don't
+		 * leave any additional locks behind us that we cannot release easily.
+		 */
+		if (concurrent)
+		{
+			Assert(CheckRelationLockedByMe(OldHeap, ShareUpdateExclusiveLock,
+										   false));
+			Assert(CheckRelationLockedByMe(OldIndex, ShareUpdateExclusiveLock,
+										   false));
+
+			resowner = ResourceOwnerCreate(CurrentResourceOwner,
+										   "plan_cluster_use_sort");
+			oldowner = CurrentResourceOwner;
+			CurrentResourceOwner = resowner;
+		}
+
 		use_sort = plan_cluster_use_sort(RelationGetRelid(OldHeap),
 										 RelationGetRelid(OldIndex));
+
+		if (concurrent)
+		{
+			CurrentResourceOwner = oldowner;
+
+			/*
+			 * We are primarily concerned about locks, but if the planner
+			 * happened to allocate any other resources, we should release
+			 * them too because we're going to delete the whole resowner.
+			 */
+			ResourceOwnerRelease(resowner, RESOURCE_RELEASE_BEFORE_LOCKS,
+								 false, false);
+			ResourceOwnerRelease(resowner, RESOURCE_RELEASE_LOCKS,
+								 false, false);
+			ResourceOwnerRelease(resowner, RESOURCE_RELEASE_AFTER_LOCKS,
+								 false, false);
+			ResourceOwnerDelete(resowner);
+		}
+	}
 	else
 		use_sort = false;
 
@@ -1007,7 +1402,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	 * values (e.g. because the AM doesn't use freezing).
 	 */
 	table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
-									cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+									cutoffs.OldestXmin, snapshot,
+									decoding_ctx,
+									&cutoffs.FreezeLimit,
 									&cutoffs.MultiXactCutoff,
 									&num_tuples, &tups_vacuumed,
 									&tups_recently_dead);
@@ -1016,7 +1413,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	*pFreezeXid = cutoffs.FreezeLimit;
 	*pCutoffMulti = cutoffs.MultiXactCutoff;
 
-	/* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+	/*
+	 * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+	 * In the CONCURRENTLY case, we need to set it again before applying the
+	 * concurrent changes.
+	 */
 	NewHeap->rd_toastoid = InvalidOid;
 
 	num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1474,14 +1875,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 				 bool swap_toast_by_content,
 				 bool check_constraints,
 				 bool is_internal,
+				 bool reindex,
 				 TransactionId frozenXid,
 				 MultiXactId cutoffMulti,
 				 char newrelpersistence)
 {
 	ObjectAddress object;
 	Oid			mapped_tables[4];
-	int			reindex_flags;
-	ReindexParams reindex_params = {0};
 	int			i;
 
 	/* Report that we are now swapping relation files */
@@ -1507,39 +1907,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	if (is_system_catalog)
 		CacheInvalidateCatalog(OIDOldHeap);
 
-	/*
-	 * Rebuild each index on the relation (but not the toast table, which is
-	 * all-new at this point).  It is important to do this before the DROP
-	 * step because if we are processing a system catalog that will be used
-	 * during DROP, we want to have its indexes available.  There is no
-	 * advantage to the other order anyway because this is all transactional,
-	 * so no chance to reclaim disk space before commit.  We do not need a
-	 * final CommandCounterIncrement() because reindex_relation does it.
-	 *
-	 * Note: because index_build is called via reindex_relation, it will never
-	 * set indcheckxmin true for the indexes.  This is OK even though in some
-	 * sense we are building new indexes rather than rebuilding existing ones,
-	 * because the new heap won't contain any HOT chains at all, let alone
-	 * broken ones, so it can't be necessary to set indcheckxmin.
-	 */
-	reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
-	if (check_constraints)
-		reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+	if (reindex)
+	{
+		int			reindex_flags;
+		ReindexParams reindex_params = {0};
 
-	/*
-	 * Ensure that the indexes have the same persistence as the parent
-	 * relation.
-	 */
-	if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
-		reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
-	else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
-		reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+		/*
+		 * Rebuild each index on the relation (but not the toast table, which
+		 * is all-new at this point).  It is important to do this before the
+		 * DROP step because if we are processing a system catalog that will
+		 * be used during DROP, we want to have its indexes available.  There
+		 * is no advantage to the other order anyway because this is all
+		 * transactional, so no chance to reclaim disk space before commit. We
+		 * do not need a final CommandCounterIncrement() because
+		 * reindex_relation does it.
+		 *
+		 * Note: because index_build is called via reindex_relation, it will
+		 * never set indcheckxmin true for the indexes.  This is OK even
+		 * though in some sense we are building new indexes rather than
+		 * rebuilding existing ones, because the new heap won't contain any
+		 * HOT chains at all, let alone broken ones, so it can't be necessary
+		 * to set indcheckxmin.
+		 */
+		reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+		if (check_constraints)
+			reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
 
-	/* Report that we are now reindexing relations */
-	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
-								 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+		/*
+		 * Ensure that the indexes have the same persistence as the parent
+		 * relation.
+		 */
+		if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+			reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+		else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+			reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
 
-	reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+		/* Report that we are now reindexing relations */
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+		reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+	}
 
 	/* Report that we are now doing clean up */
 	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1881,7 +2289,8 @@ cluster_is_permitted_for_relation(RepackCommand cmd, Oid relid, Oid userid)
  * resolve in this case.
  */
 static Relation
-process_single_relation(RepackStmt *stmt, ClusterParams *params)
+process_single_relation(RepackStmt *stmt, LOCKMODE lockmode, bool isTopLevel,
+						ClusterParams *params)
 {
 	Relation	rel;
 	Oid			tableOid;
@@ -1890,13 +2299,9 @@ process_single_relation(RepackStmt *stmt, ClusterParams *params)
 	Assert(stmt->command == REPACK_COMMAND_CLUSTER ||
 		   stmt->command == REPACK_COMMAND_REPACK);
 
-	/*
-	 * Find, lock, and check permissions on the table.  We obtain
-	 * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
-	 * single-transaction case.
-	 */
+	/* Find, lock, and check permissions on the table. */
 	tableOid = RangeVarGetRelidExtended(stmt->relation,
-										AccessExclusiveLock,
+										lockmode,
 										0,
 										RangeVarCallbackMaintainsTable,
 										NULL);
@@ -1922,26 +2327,17 @@ process_single_relation(RepackStmt *stmt, ClusterParams *params)
 		return rel;
 	else
 	{
-		Oid			indexOid;
+		Oid			indexOid = InvalidOid;
 
-		indexOid = determine_clustered_index(rel, stmt->usingindex,
-											 stmt->indexname);
-		if (OidIsValid(indexOid))
-			check_index_is_clusterable(rel, indexOid, AccessExclusiveLock);
-		cluster_rel(stmt->command, stmt->usingindex, rel, indexOid, params);
-
-		/* Do an analyze, if requested */
-		if (params->options & CLUOPT_ANALYZE)
+		if (stmt->usingindex)
 		{
-			VacuumParams vac_params = {0};
-
-			vac_params.options |= VACOPT_ANALYZE;
-			if (params->options & CLUOPT_VERBOSE)
-				vac_params.options |= VACOPT_VERBOSE;
-			analyze_rel(RelationGetRelid(rel), NULL, vac_params, NIL, true,
-						NULL);
+			indexOid = determine_clustered_index(rel, stmt->usingindex,
+												 stmt->indexname);
+			check_index_is_clusterable(rel, indexOid, lockmode);
 		}
 
+		cluster_rel(stmt->command, stmt->usingindex, rel, indexOid,
+					params, isTopLevel);
 		return NULL;
 	}
 }
@@ -1998,3 +2394,1048 @@ determine_clustered_index(Relation rel, bool usingindex, const char *indexname)
 
 	return indexOid;
 }
+
+
+/*
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
+ */
+static void
+begin_concurrent_repack(Relation rel)
+{
+	Oid			toastrelid;
+
+	/* Avoid logical decoding of other relations by this backend. */
+	repacked_rel_locator = rel->rd_locator;
+	toastrelid = rel->rd_rel->reltoastrelid;
+	if (OidIsValid(toastrelid))
+	{
+		Relation	toastrel;
+
+		/* Avoid logical decoding of other TOAST relations. */
+		toastrel = table_open(toastrelid, AccessShareLock);
+		repacked_rel_toast_locator = toastrel->rd_locator;
+		table_close(toastrel, AccessShareLock);
+	}
+}
+
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+	/*
+	 * Restore normal function of (future) logical decoding for this backend.
+	 */
+	repacked_rel_locator.relNumber = InvalidOid;
+	repacked_rel_toast_locator.relNumber = InvalidOid;
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid, const char *slotname, TupleDesc tupdesc)
+{
+	LogicalDecodingContext *ctx;
+	RepackDecodingState *dstate;
+
+	/*
+	 * Check if we can use logical decoding.
+	 */
+	CheckSlotPermissions();
+	CheckLogicalDecodingRequirements();
+
+	/* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+	ReplicationSlotCreate(slotname, true, RS_TEMPORARY, false, false, false);
+
+	/*
+	 * Neither prepare_write nor do_write callback nor update_progress is
+	 * useful for us.
+	 */
+	ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+									NIL,
+									true,
+									InvalidXLogRecPtr,
+									XL_ROUTINE(.page_read = read_local_xlog_page,
+											   .segment_open = wal_segment_open,
+											   .segment_close = wal_segment_close),
+									NULL, NULL, NULL);
+
+	/*
+	 * We don't have control on setting fast_forward, so at least check it.
+	 */
+	Assert(!ctx->fast_forward);
+
+	DecodingContextFindStartpoint(ctx);
+
+	/* Some WAL records should have been read. */
+	Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+	XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+				wal_segment_size);
+
+	/*
+	 * Setup structures to store decoded changes.
+	 */
+	dstate = palloc0(sizeof(RepackDecodingState));
+	dstate->relid = relid;
+	dstate->tstore = tuplestore_begin_heap(false, false,
+										   maintenance_work_mem);
+
+	dstate->tupdesc = tupdesc;
+
+	/* Initialize the descriptor to store the changes ... */
+	dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+	TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+	/* ... as well as the corresponding slot. */
+	dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+											  &TTSOpsMinimalTuple);
+
+	dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+										   "logical decoding");
+
+	ctx->output_writer_private = dstate;
+	return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+	HeapTupleData tup_data;
+	HeapTuple	result;
+	char	   *src;
+
+	/*
+	 * Ensure alignment before accessing the fields. (This is why we can't use
+	 * heap_copytuple() instead of this function.)
+	 */
+	src = change + offsetof(ConcurrentChange, tup_data);
+	memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+	result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+	memcpy(result, &tup_data, sizeof(HeapTupleData));
+	result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+	src = change + SizeOfConcurrentChange;
+	memcpy(result->t_data, src, result->t_len);
+
+	return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+								 XLogRecPtr end_of_wal)
+{
+	RepackDecodingState *dstate;
+	ResourceOwner resowner_old;
+
+	/*
+	 * Invalidate the "present" cache before moving to "(recent) history".
+	 */
+	InvalidateSystemCaches();
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+	resowner_old = CurrentResourceOwner;
+	CurrentResourceOwner = dstate->resowner;
+
+	PG_TRY();
+	{
+		while (ctx->reader->EndRecPtr < end_of_wal)
+		{
+			XLogRecord *record;
+			XLogSegNo	segno_new;
+			char	   *errm = NULL;
+			XLogRecPtr	end_lsn;
+
+			record = XLogReadRecord(ctx->reader, &errm);
+			if (errm)
+				elog(ERROR, "%s", errm);
+
+			if (record != NULL)
+				LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+			/*
+			 * If WAL segment boundary has been crossed, inform the decoding
+			 * system that the catalog_xmin can advance. (We can confirm more
+			 * often, but a filling a single WAL segment should not take much
+			 * time.)
+			 */
+			end_lsn = ctx->reader->EndRecPtr;
+			XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+			if (segno_new != repack_current_segment)
+			{
+				LogicalConfirmReceivedLocation(end_lsn);
+				elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+					 (uint32) (end_lsn >> 32), (uint32) end_lsn);
+				repack_current_segment = segno_new;
+			}
+
+			CHECK_FOR_INTERRUPTS();
+		}
+		InvalidateSystemCaches();
+		CurrentResourceOwner = resowner_old;
+	}
+	PG_CATCH();
+	{
+		/* clear all timetravel entries */
+		InvalidateSystemCaches();
+		CurrentResourceOwner = resowner_old;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Apply changes that happened during the initial load.
+ *
+ * Scan key is passed by caller, so it does not have to be constructed
+ * multiple times. Key entries have all fields initialized, except for
+ * sk_argument.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, Relation rel,
+						 ScanKey key, int nkeys, IndexInsertState *iistate)
+{
+	TupleTableSlot *index_slot,
+			   *ident_slot;
+	HeapTuple	tup_old = NULL;
+
+	if (dstate->nchanges == 0)
+		return;
+
+	/* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+	index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+	/* A slot to fetch tuples from identity index. */
+	ident_slot = table_slot_create(rel, NULL);
+
+	while (tuplestore_gettupleslot(dstate->tstore, true, false,
+								   dstate->tsslot))
+	{
+		bool		shouldFree;
+		HeapTuple	tup_change,
+					tup,
+					tup_exist;
+		char	   *change_raw,
+				   *src;
+		ConcurrentChange change;
+		bool		isnull[1];
+		Datum		values[1];
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get the change from the single-column tuple. */
+		tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+		heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+		Assert(!isnull[0]);
+
+		/* Make sure we access aligned data. */
+		change_raw = (char *) DatumGetByteaP(values[0]);
+		src = (char *) VARDATA(change_raw);
+		memcpy(&change, src, SizeOfConcurrentChange);
+
+		/* TRUNCATE change contains no tuple, so process it separately. */
+		if (change.kind == CHANGE_TRUNCATE)
+		{
+			/*
+			 * All the things that ExecuteTruncateGuts() does (such as firing
+			 * triggers or handling the DROP_CASCADE behavior) should have
+			 * taken place on the source relation. Thus we only do the actual
+			 * truncation of the new relation (and its indexes).
+			 */
+			heap_truncate_one_rel(rel);
+
+			pfree(tup_change);
+			continue;
+		}
+
+		/*
+		 * Extract the tuple from the change. The tuple is copied here because
+		 * it might be assigned to 'tup_old', in which case it needs to
+		 * survive into the next iteration.
+		 */
+		tup = get_changed_tuple(src);
+
+		if (change.kind == CHANGE_UPDATE_OLD)
+		{
+			Assert(tup_old == NULL);
+			tup_old = tup;
+		}
+		else if (change.kind == CHANGE_INSERT)
+		{
+			Assert(tup_old == NULL);
+
+			apply_concurrent_insert(rel, &change, tup, iistate, index_slot);
+
+			pfree(tup);
+		}
+		else if (change.kind == CHANGE_UPDATE_NEW ||
+				 change.kind == CHANGE_DELETE)
+		{
+			IndexScanDesc ind_scan = NULL;
+			HeapTuple	tup_key;
+
+			if (change.kind == CHANGE_UPDATE_NEW)
+			{
+				tup_key = tup_old != NULL ? tup_old : tup;
+			}
+			else
+			{
+				Assert(tup_old == NULL);
+				tup_key = tup;
+			}
+
+			/*
+			 * Find the tuple to be updated or deleted.
+			 */
+			tup_exist = find_target_tuple(rel, key, nkeys, tup_key,
+										  iistate, ident_slot, &ind_scan);
+			if (tup_exist == NULL)
+				elog(ERROR, "Failed to find target tuple");
+
+			if (change.kind == CHANGE_UPDATE_NEW)
+				apply_concurrent_update(rel, tup, tup_exist, &change, iistate,
+										index_slot);
+			else
+				apply_concurrent_delete(rel, tup_exist, &change);
+
+			if (tup_old != NULL)
+			{
+				pfree(tup_old);
+				tup_old = NULL;
+			}
+
+			pfree(tup);
+			index_endscan(ind_scan);
+		}
+		else
+			elog(ERROR, "Unrecognized kind of change: %d", change.kind);
+
+		/*
+		 * If a change was applied now, increment CID for next writes and
+		 * update the snapshot so it sees the changes we've applied so far.
+		 */
+		if (change.kind != CHANGE_UPDATE_OLD)
+		{
+			CommandCounterIncrement();
+			UpdateActiveSnapshotCommandId();
+		}
+
+		/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+		Assert(shouldFree);
+		pfree(tup_change);
+	}
+
+	tuplestore_clear(dstate->tstore);
+	dstate->nchanges = 0;
+
+	/* Cleanup. */
+	ExecDropSingleTupleTableSlot(index_slot);
+	ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, ConcurrentChange *change, HeapTuple tup,
+						IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+	List	   *recheck;
+
+
+	/*
+	 * Like simple_heap_insert(), but make sure that the INSERT is not
+	 * logically decoded - see reform_and_rewrite_tuple() for more
+	 * information.
+	 */
+	heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
+				NULL);
+
+	/*
+	 * Update indexes.
+	 *
+	 * In case functions in the index need the active snapshot and caller
+	 * hasn't set one.
+	 */
+	ExecStoreHeapTuple(tup, index_slot, false);
+	recheck = ExecInsertIndexTuples(iistate->rri,
+									index_slot,
+									iistate->estate,
+									false,	/* update */
+									false,	/* noDupErr */
+									NULL,	/* specConflict */
+									NIL,	/* arbiterIndexes */
+									false	/* onlySummarizing */
+		);
+
+	/*
+	 * If recheck is required, it must have been preformed on the source
+	 * relation by now. (All the logical changes we process here are already
+	 * committed.)
+	 */
+	list_free(recheck);
+
+	pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+						ConcurrentChange *change, IndexInsertState *iistate,
+						TupleTableSlot *index_slot)
+{
+	LockTupleMode lockmode;
+	TM_FailureData tmfd;
+	TU_UpdateIndexes update_indexes;
+	TM_Result	res;
+	List	   *recheck;
+
+	/*
+	 * Write the new tuple into the new heap. ('tup' gets the TID assigned
+	 * here.)
+	 *
+	 * Do it like in simple_heap_update(), except for 'wal_logical' (and
+	 * except for 'wait').
+	 */
+	res = heap_update(rel, &tup_target->t_self, tup,
+					  GetCurrentCommandId(true),
+					  InvalidSnapshot,
+					  false,	/* no wait - only we are doing changes */
+					  &tmfd, &lockmode, &update_indexes,
+					  false /* wal_logical */ );
+	if (res != TM_Ok)
+		ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+	ExecStoreHeapTuple(tup, index_slot, false);
+
+	if (update_indexes != TU_None)
+	{
+		recheck = ExecInsertIndexTuples(iistate->rri,
+										index_slot,
+										iistate->estate,
+										true,	/* update */
+										false,	/* noDupErr */
+										NULL,	/* specConflict */
+										NIL,	/* arbiterIndexes */
+		/* onlySummarizing */
+										update_indexes == TU_Summarizing);
+		list_free(recheck);
+	}
+
+	pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target,
+						ConcurrentChange *change)
+{
+	TM_Result	res;
+	TM_FailureData tmfd;
+
+	/*
+	 * Delete tuple from the new heap.
+	 *
+	 * Do it like in simple_heap_delete(), except for 'wal_logical' (and
+	 * except for 'wait').
+	 */
+	res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
+					  InvalidSnapshot, false,
+					  &tmfd,
+					  false,	/* no wait - only we are doing changes */
+					  false /* wal_logical */ );
+
+	if (res != TM_Ok)
+		ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+	pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'key' is a pre-initialized scan key, into which the function will put the
+ * key values.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ *
+ * On exit,'*scan_p' contains the scan descriptor used. The caller must close
+ * it when he no longer needs the tuple returned.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ScanKey key, int nkeys, HeapTuple tup_key,
+				  IndexInsertState *iistate,
+				  TupleTableSlot *ident_slot, IndexScanDesc *scan_p)
+{
+	IndexScanDesc scan;
+	Form_pg_index ident_form;
+	int2vector *ident_indkey;
+	HeapTuple	result = NULL;
+
+	/* XXX no instrumentation for now */
+	scan = index_beginscan(rel, iistate->ident_index, GetActiveSnapshot(),
+						   NULL, nkeys, 0);
+	*scan_p = scan;
+	index_rescan(scan, key, nkeys, NULL, 0);
+
+	/* Info needed to retrieve key values from heap tuple. */
+	ident_form = iistate->ident_index->rd_index;
+	ident_indkey = &ident_form->indkey;
+
+	/* Use the incoming tuple to finalize the scan key. */
+	for (int i = 0; i < scan->numberOfKeys; i++)
+	{
+		ScanKey		entry;
+		bool		isnull;
+		int16		attno_heap;
+
+		entry = &scan->keyData[i];
+		attno_heap = ident_indkey->values[i];
+		entry->sk_argument = heap_getattr(tup_key,
+										  attno_heap,
+										  rel->rd_att,
+										  &isnull);
+		Assert(!isnull);
+	}
+	if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+	{
+		bool		shouldFree;
+
+		result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+		/* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+		Assert(!shouldFree);
+	}
+
+	return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ *
+ * Pass rel_src iff its reltoastrelid is needed.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *ctx, XLogRecPtr end_of_wal,
+						   Relation rel_dst, Relation rel_src, ScanKey ident_key,
+						   int ident_key_nentries, IndexInsertState *iistate)
+{
+	RepackDecodingState *dstate;
+
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_CATCH_UP);
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	repack_decode_concurrent_changes(ctx, end_of_wal);
+
+	if (dstate->nchanges == 0)
+		return;
+
+	PG_TRY();
+	{
+		/*
+		 * Make sure that TOAST values can eventually be accessed via the old
+		 * relation - see comment in copy_table_data().
+		 */
+		if (rel_src)
+			rel_dst->rd_toastoid = rel_src->rd_rel->reltoastrelid;
+
+		apply_concurrent_changes(dstate, rel_dst, ident_key,
+								 ident_key_nentries, iistate);
+	}
+	PG_FINALLY();
+	{
+		if (rel_src)
+			rel_dst->rd_toastoid = InvalidOid;
+	}
+	PG_END_TRY();
+}
+
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id)
+{
+	EState	   *estate;
+	int			i;
+	IndexInsertState *result;
+
+	result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+	estate = CreateExecutorState();
+
+	result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+	InitResultRelInfo(result->rri, relation, 0, 0, 0);
+	ExecOpenIndices(result->rri, false);
+
+	/*
+	 * Find the relcache entry of the identity index so that we spend no extra
+	 * effort to open / close it.
+	 */
+	for (i = 0; i < result->rri->ri_NumIndices; i++)
+	{
+		Relation	ind_rel;
+
+		ind_rel = result->rri->ri_IndexRelationDescs[i];
+		if (ind_rel->rd_id == ident_index_id)
+			result->ident_index = ind_rel;
+	}
+	if (result->ident_index == NULL)
+		elog(ERROR, "Failed to open identity index");
+
+	/* Only initialize fields needed by ExecInsertIndexTuples(). */
+	result->estate = estate;
+
+	return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+	Relation	ident_idx_rel;
+	Form_pg_index ident_idx;
+	int			n,
+				i;
+	ScanKey		result;
+
+	Assert(OidIsValid(ident_idx_oid));
+	ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+	ident_idx = ident_idx_rel->rd_index;
+	n = ident_idx->indnatts;
+	result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+	for (i = 0; i < n; i++)
+	{
+		ScanKey		entry;
+		int16		relattno;
+		Form_pg_attribute att;
+		Oid			opfamily,
+					opcintype,
+					opno,
+					opcode;
+
+		entry = &result[i];
+		relattno = ident_idx->indkey.values[i];
+		if (relattno >= 1)
+		{
+			TupleDesc	desc;
+
+			desc = rel_src->rd_att;
+			att = TupleDescAttr(desc, relattno - 1);
+		}
+		else
+			elog(ERROR, "Unexpected attribute number %d in index", relattno);
+
+		opfamily = ident_idx_rel->rd_opfamily[i];
+		opcintype = ident_idx_rel->rd_opcintype[i];
+		opno = get_opfamily_member(opfamily, opcintype, opcintype,
+								   BTEqualStrategyNumber);
+
+		if (!OidIsValid(opno))
+			elog(ERROR, "Failed to find = operator for type %u", opcintype);
+
+		opcode = get_opcode(opno);
+		if (!OidIsValid(opcode))
+			elog(ERROR, "Failed to find = operator for operator %u", opno);
+
+		/* Initialize everything but argument. */
+		ScanKeyInit(entry,
+					i + 1,
+					BTEqualStrategyNumber, opcode,
+					(Datum) NULL);
+		entry->sk_collation = att->attcollation;
+	}
+	index_close(ident_idx_rel, AccessShareLock);
+
+	*nentries = n;
+	return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+	ExecCloseIndices(iistate->rri);
+	FreeExecutorState(iistate->estate);
+	pfree(iistate->rri);
+	pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+	RepackDecodingState *dstate;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	ExecDropSingleTupleTableSlot(dstate->tsslot);
+	FreeTupleDesc(dstate->tupdesc_change);
+	FreeTupleDesc(dstate->tupdesc);
+	tuplestore_end(dstate->tstore);
+
+	FreeDecodingContext(ctx);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+								   Relation cl_index,
+								   LogicalDecodingContext *ctx,
+								   bool swap_toast_by_content,
+								   TransactionId frozenXid,
+								   MultiXactId cutoffMulti)
+{
+	LOCKMODE	lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+	List	   *ind_oids_new;
+	Oid			old_table_oid = RelationGetRelid(OldHeap);
+	Oid			new_table_oid = RelationGetRelid(NewHeap);
+	List	   *ind_oids_old = RelationGetIndexList(OldHeap);
+	ListCell   *lc,
+			   *lc2;
+	char		relpersistence;
+	bool		is_system_catalog;
+	Oid			ident_idx_old,
+				ident_idx_new;
+	IndexInsertState *iistate;
+	ScanKey		ident_key;
+	int			ident_key_nentries;
+	XLogRecPtr	wal_insert_ptr,
+				end_of_wal;
+	char		dummy_rec_data = '\0';
+	Relation   *ind_refs,
+			   *ind_refs_p;
+	int			nind;
+
+	/* Like in cluster_rel(). */
+	lockmode_old = ShareUpdateExclusiveLock;
+	Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+	Assert(cl_index == NULL ||
+		   CheckRelationLockedByMe(cl_index, lockmode_old, false));
+	/* This is expected from the caller. */
+	Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+	ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+	/*
+	 * Unlike the exclusive case, we build new indexes for the new relation
+	 * rather than swapping the storage and reindexing the old relation. The
+	 * point is that the index build can take some time, so we do it before we
+	 * get AccessExclusiveLock on the old heap and therefore we cannot swap
+	 * the heap storage yet.
+	 *
+	 * index_create() will lock the new indexes using AccessExclusiveLock - no
+	 * need to change that.
+	 *
+	 * We assume that ShareUpdateExclusiveLock on the table prevents anyone
+	 * from dropping the existing indexes or adding new ones, so the lists of
+	 * old and new indexes should match at the swap time. On the other hand we
+	 * do not block ALTER INDEX commands that do not require table lock (e.g.
+	 * ALTER INDEX ... SET ...).
+	 *
+	 * XXX Should we check a the end of our work if another transaction
+	 * executed such a command and issue a NOTICE that we might have discarded
+	 * its effects? (For example, someone changes storage parameter after we
+	 * have created the new index, the new value of that parameter is lost.)
+	 * Alternatively, we can lock all the indexes now in a mode that blocks
+	 * all the ALTER INDEX commands (ShareUpdateExclusiveLock ?), and keep
+	 * them locked till the end of the transactions. That might increase the
+	 * risk of deadlock during the lock upgrade below, however SELECT / DML
+	 * queries should not be involved in such a deadlock.
+	 */
+	ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+	/*
+	 * Processing shouldn't start w/o valid identity index.
+	 */
+	Assert(OidIsValid(ident_idx_old));
+
+	/* Find "identity index" on the new relation. */
+	ident_idx_new = InvalidOid;
+	forboth(lc, ind_oids_old, lc2, ind_oids_new)
+	{
+		Oid			ind_old = lfirst_oid(lc);
+		Oid			ind_new = lfirst_oid(lc2);
+
+		if (ident_idx_old == ind_old)
+		{
+			ident_idx_new = ind_new;
+			break;
+		}
+	}
+	if (!OidIsValid(ident_idx_new))
+
+		/*
+		 * Should not happen, given our lock on the old relation.
+		 */
+		ereport(ERROR,
+				(errmsg("Identity index missing on the new relation")));
+
+	/* Executor state to update indexes. */
+	iistate = get_index_insert_state(NewHeap, ident_idx_new);
+
+	/*
+	 * Build scan key that we'll use to look for rows to be updated / deleted
+	 * during logical decoding.
+	 */
+	ident_key = build_identity_key(ident_idx_new, OldHeap, &ident_key_nentries);
+
+	/*
+	 * During testing, wait for another backend to perform concurrent data
+	 * changes which we will process below.
+	 */
+	INJECTION_POINT("repack-concurrently-before-lock", NULL);
+
+	/*
+	 * Flush all WAL records inserted so far (possibly except for the last
+	 * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+	 * we need to flush while holding exclusive lock on the source table.
+	 */
+	wal_insert_ptr = GetInsertRecPtr();
+	XLogFlush(wal_insert_ptr);
+	end_of_wal = GetFlushRecPtr(NULL);
+
+	/*
+	 * Apply concurrent changes first time, to minimize the time we need to
+	 * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+	 * written during the data copying and index creation.)
+	 */
+	process_concurrent_changes(ctx, end_of_wal, NewHeap,
+							   swap_toast_by_content ? OldHeap : NULL,
+							   ident_key, ident_key_nentries, iistate);
+
+	/*
+	 * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+	 * is one), all its indexes, so that we can swap the files.
+	 *
+	 * Before that, unlock the index temporarily to avoid deadlock in case
+	 * another transaction is trying to lock it while holding the lock on the
+	 * table.
+	 */
+	if (cl_index)
+	{
+		index_close(cl_index, ShareUpdateExclusiveLock);
+		cl_index = NULL;
+	}
+	/* For the same reason, unlock TOAST relation. */
+	if (OldHeap->rd_rel->reltoastrelid)
+		LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+	/* Finally lock the table */
+	LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+	/*
+	 * Lock all indexes now, not only the clustering one: all indexes need to
+	 * have their files swapped. While doing that, store their relation
+	 * references in an array, to handle predicate locks below.
+	 */
+	ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+	nind = 0;
+	foreach(lc, ind_oids_old)
+	{
+		Oid			ind_oid;
+		Relation	index;
+
+		ind_oid = lfirst_oid(lc);
+		index = index_open(ind_oid, AccessExclusiveLock);
+
+		/*
+		 * TODO 1) Do we need to check if ALTER INDEX was executed since the
+		 * new index was created in build_new_indexes()? 2) Specifically for
+		 * the clustering index, should check_index_is_clusterable() be called
+		 * here? (Not sure about the latter: ShareUpdateExclusiveLock on the
+		 * table probably blocks all commands that affect the result of
+		 * check_index_is_clusterable().)
+		 */
+		*ind_refs_p = index;
+		ind_refs_p++;
+		nind++;
+	}
+
+	/*
+	 * In addition, lock the OldHeap's TOAST relation exclusively - again, the
+	 * lock is needed to swap the files.
+	 */
+	if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+		LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+	/*
+	 * Tuples and pages of the old heap will be gone, but the heap will stay.
+	 */
+	TransferPredicateLocksToHeapRelation(OldHeap);
+	/* The same for indexes. */
+	for (int i = 0; i < nind; i++)
+	{
+		Relation	index = ind_refs[i];
+
+		TransferPredicateLocksToHeapRelation(index);
+
+		/*
+		 * References to indexes on the old relation are not needed anymore,
+		 * however locks stay till the end of the transaction.
+		 */
+		index_close(index, NoLock);
+	}
+	pfree(ind_refs);
+
+	/*
+	 * Flush anything we see in WAL, to make sure that all changes committed
+	 * while we were waiting for the exclusive lock are available for
+	 * decoding. This should not be necessary if all backends had
+	 * synchronous_commit set, but we can't rely on this setting.
+	 *
+	 * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+	 * position, and GetLastImportantRecPtr() points at the start of the last
+	 * record rather than at the end. Thus the simplest way to determine the
+	 * insert position is to insert a dummy record and use its LSN.
+	 *
+	 * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+	 * last record (plus the total size of all the page headers the record
+	 * spans)?
+	 */
+	XLogBeginInsert();
+	XLogRegisterData(&dummy_rec_data, 1);
+	wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+	XLogFlush(wal_insert_ptr);
+	end_of_wal = GetFlushRecPtr(NULL);
+
+	/* Apply the concurrent changes again. */
+	process_concurrent_changes(ctx, end_of_wal, NewHeap,
+							   swap_toast_by_content ? OldHeap : NULL,
+							   ident_key, ident_key_nentries, iistate);
+
+	/* Remember info about rel before closing OldHeap */
+	relpersistence = OldHeap->rd_rel->relpersistence;
+	is_system_catalog = IsSystemRelation(OldHeap);
+
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+	/*
+	 * Even ShareUpdateExclusiveLock should have prevented others from
+	 * creating / dropping indexes (even using the CONCURRENTLY option), so we
+	 * do not need to check whether the lists match.
+	 */
+	forboth(lc, ind_oids_old, lc2, ind_oids_new)
+	{
+		Oid			ind_old = lfirst_oid(lc);
+		Oid			ind_new = lfirst_oid(lc2);
+		Oid			mapped_tables[4];
+
+		/* Zero out possible results from swapped_relation_files */
+		memset(mapped_tables, 0, sizeof(mapped_tables));
+
+		swap_relation_files(ind_old, ind_new,
+							(old_table_oid == RelationRelationId),
+							swap_toast_by_content,
+							true,
+							InvalidTransactionId,
+							InvalidMultiXactId,
+							mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+		/*
+		 * Concurrent processing is not supported for system relations, so
+		 * there should be no mapped tables.
+		 */
+		for (int i = 0; i < 4; i++)
+			Assert(mapped_tables[i] == 0);
+#endif
+	}
+
+	/* The new indexes must be visible for deletion. */
+	CommandCounterIncrement();
+
+	/* Close the old heap but keep lock until transaction commit. */
+	table_close(OldHeap, NoLock);
+	/* Close the new heap. (We didn't have to open its indexes). */
+	table_close(NewHeap, NoLock);
+
+	/* Cleanup what we don't need anymore. (And close the identity index.) */
+	pfree(ident_key);
+	free_index_insert_state(iistate);
+
+	/*
+	 * Swap the relations and their TOAST relations and TOAST indexes. This
+	 * also drops the new relation and its indexes.
+	 *
+	 * (System catalogs are currently not supported.)
+	 */
+	Assert(!is_system_catalog);
+	finish_heap_swap(old_table_oid, new_table_oid,
+					 is_system_catalog,
+					 swap_toast_by_content,
+					 false, true, false,
+					 frozenXid, cutoffMulti,
+					 relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+	ListCell   *lc;
+	List	   *result = NIL;
+
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+	foreach(lc, OldIndexes)
+	{
+		Oid			ind_oid,
+					ind_oid_new;
+		char	   *newName;
+		Relation	ind;
+
+		ind_oid = lfirst_oid(lc);
+		ind = index_open(ind_oid, AccessShareLock);
+
+		newName = ChooseRelationName(get_rel_name(ind_oid),
+									 NULL,
+									 "repacknew",
+									 get_rel_namespace(ind->rd_index->indrelid),
+									 false);
+		ind_oid_new = index_create_copy(NewHeap, ind_oid,
+										ind->rd_rel->reltablespace, newName,
+										false);
+		result = lappend_oid(result, ind_oid_new);
+
+		index_close(ind, AccessShareLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 188e26f0e6e..71b73c21ebf 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -904,7 +904,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 static void
 refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
 {
-	finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+	finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
 					 RecentXmin, ReadNextMultiXactId(), relpersistence);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 082a3575d62..c79f5b1dc0f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5989,6 +5989,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
 			finish_heap_swap(tab->relid, OIDNewHeap,
 							 false, false, true,
 							 !OidIsValid(tab->newTableSpace),
+							 true,
 							 RecentXmin,
 							 ReadNextMultiXactId(),
 							 persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8863ad0e8bd..6de9d0ba39d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -125,7 +125,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
 							  TransactionId lastSaneFrozenXid,
 							  MultiXactId lastSaneMinMulti);
 static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
-					   BufferAccessStrategy bstrategy);
+					   BufferAccessStrategy bstrategy, bool isTopLevel);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -633,7 +633,8 @@ vacuum(List *relations, const VacuumParams params, BufferAccessStrategy bstrateg
 
 			if (params.options & VACOPT_VACUUM)
 			{
-				if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+				if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+								isTopLevel))
 					continue;
 			}
 
@@ -1997,7 +1998,7 @@ vac_truncate_clog(TransactionId frozenXID,
  */
 static bool
 vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
-		   BufferAccessStrategy bstrategy)
+		   BufferAccessStrategy bstrategy, bool isTopLevel)
 {
 	LOCKMODE	lmode;
 	Relation	rel;
@@ -2288,7 +2289,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
 
 			/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
 			cluster_rel(REPACK_COMMAND_VACUUMFULL, false, rel, InvalidOid,
-						&cluster_params);
+						&cluster_params, isTopLevel);
 			/* cluster_rel closes the relation, but keeps lock */
 
 			rel = NULL;
@@ -2331,7 +2332,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
 		toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
 		toast_vacuum_params.toast_parent = relid;
 
-		vacuum_rel(toast_relid, NULL, toast_vacuum_params, bstrategy);
+		vacuum_rel(toast_relid, NULL, toast_vacuum_params, bstrategy,
+				   isTopLevel);
 	}
 
 	/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index b831a541652..5c148131217 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
 subdir('jit/llvm')
 subdir('replication/libpqwalreceiver')
 subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
 subdir('snowball')
 subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index cc03f0706e9..5dc4ae58ffe 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
+#include "commands/cluster.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
 #include "replication/message.h"
@@ -472,6 +473,88 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	TransactionId xid = XLogRecGetXid(buf->record);
 	SnapBuild  *builder = ctx->snapshot_builder;
 
+	/*
+	 * If the change is not intended for logical decoding, do not even
+	 * establish transaction for it - REPACK CONCURRENTLY is the typical use
+	 * case.
+	 *
+	 * First, check if REPACK CONCURRENTLY is being performed by this backend.
+	 * If so, only decode data changes of the table that it is processing, and
+	 * the changes of its TOAST relation.
+	 *
+	 * (TOAST locator should not be set unless the main is.)
+	 */
+	Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+		   OidIsValid(repacked_rel_locator.relNumber));
+
+	if (OidIsValid(repacked_rel_locator.relNumber))
+	{
+		XLogReaderState *r = buf->record;
+		RelFileLocator locator;
+
+		/* Not all records contain the block. */
+		if (XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL) &&
+			!RelFileLocatorEquals(locator, repacked_rel_locator) &&
+			(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+			 !RelFileLocatorEquals(locator, repacked_rel_toast_locator)))
+			return;
+	}
+
+	/*
+	 * Second, skip records which do not contain sufficient information for
+	 * the decoding.
+	 *
+	 * The problem we solve here is that REPACK CONCURRENTLY generates WAL
+	 * when doing changes in the new table. Those changes should not be useful
+	 * for any other user (such as logical replication subscription) because
+	 * the new table will eventually be dropped (after REPACK CONCURRENTLY has
+	 * assigned its file to the "old table").
+	 */
+	switch (info)
+	{
+		case XLOG_HEAP_INSERT:
+			{
+				xl_heap_insert *rec;
+
+				rec = (xl_heap_insert *) XLogRecGetData(buf->record);
+
+				/*
+				 * This does happen when 1) raw_heap_insert marks the TOAST
+				 * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
+				 * replays inserts performed by other backends.
+				 */
+				if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
+					return;
+
+				break;
+			}
+
+		case XLOG_HEAP_HOT_UPDATE:
+		case XLOG_HEAP_UPDATE:
+			{
+				xl_heap_update *rec;
+
+				rec = (xl_heap_update *) XLogRecGetData(buf->record);
+				if ((rec->flags &
+					 (XLH_UPDATE_CONTAINS_NEW_TUPLE |
+					  XLH_UPDATE_CONTAINS_OLD_TUPLE |
+					  XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
+					return;
+
+				break;
+			}
+
+		case XLOG_HEAP_DELETE:
+			{
+				xl_heap_delete *rec;
+
+				rec = (xl_heap_delete *) XLogRecGetData(buf->record);
+				if (rec->flags & XLH_DELETE_NO_LOGICAL)
+					return;
+				break;
+			}
+	}
+
 	ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
 
 	/*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a2f1803622c..d69229905a2 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,27 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	return SnapBuildMVCCFromHistoric(snap, true);
 }
 
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+	Snapshot	snap;
+
+	Assert(builder->state == SNAPBUILD_CONSISTENT);
+	Assert(builder->building_full_snapshot);
+
+	snap = SnapBuildBuildSnapshot(builder);
+	return SnapBuildMVCCFromHistoric(snap, false);
+}
+
 /*
  * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
  *
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+#    src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	$(WIN32RES) \
+	pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+	rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+  'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+  pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'pgoutput_repack',
+    '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+  pgoutput_repack_sources,
+  kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..687fbbc59bb
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,288 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_cluster.c
+ *		Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+						   OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+							  ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  Relation rel, ReorderBufferChange *change);
+static void plugin_truncate(struct LogicalDecodingContext *ctx,
+							ReorderBufferTXN *txn, int nrelations,
+							Relation relations[],
+							ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+						 ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+	AssertVariableIsOfType(&_PG_output_plugin_init, LogicalOutputPluginInit);
+
+	cb->startup_cb = plugin_startup;
+	cb->begin_cb = plugin_begin_txn;
+	cb->change_cb = plugin_change;
+	cb->truncate_cb = plugin_truncate;
+	cb->commit_cb = plugin_commit_txn;
+	cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+			   bool is_init)
+{
+	ctx->output_plugin_private = NULL;
+
+	/* Probably unnecessary, as we don't use the SQL interface ... */
+	opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+	if (ctx->output_plugin_options != NIL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("This plugin does not expect any options")));
+	}
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+				  XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+			  Relation relation, ReorderBufferChange *change)
+{
+	RepackDecodingState *dstate;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	/* Only interested in one particular relation. */
+	if (relation->rd_id != dstate->relid)
+		return;
+
+	/* Decode entry depending on its type */
+	switch (change->action)
+	{
+		case REORDER_BUFFER_CHANGE_INSERT:
+			{
+				HeapTuple	newtuple;
+
+				newtuple = change->data.tp.newtuple != NULL ?
+					change->data.tp.newtuple : NULL;
+
+				/*
+				 * Identity checks in the main function should have made this
+				 * impossible.
+				 */
+				if (newtuple == NULL)
+					elog(ERROR, "Incomplete insert info.");
+
+				store_change(ctx, CHANGE_INSERT, newtuple);
+			}
+			break;
+		case REORDER_BUFFER_CHANGE_UPDATE:
+			{
+				HeapTuple	oldtuple,
+							newtuple;
+
+				oldtuple = change->data.tp.oldtuple != NULL ?
+					change->data.tp.oldtuple : NULL;
+				newtuple = change->data.tp.newtuple != NULL ?
+					change->data.tp.newtuple : NULL;
+
+				if (newtuple == NULL)
+					elog(ERROR, "Incomplete update info.");
+
+				if (oldtuple != NULL)
+					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+				store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+			}
+			break;
+		case REORDER_BUFFER_CHANGE_DELETE:
+			{
+				HeapTuple	oldtuple;
+
+				oldtuple = change->data.tp.oldtuple ?
+					change->data.tp.oldtuple : NULL;
+
+				if (oldtuple == NULL)
+					elog(ERROR, "Incomplete delete info.");
+
+				store_change(ctx, CHANGE_DELETE, oldtuple);
+			}
+			break;
+		default:
+			/* Should not come here */
+			Assert(false);
+			break;
+	}
+}
+
+static void
+plugin_truncate(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+				int nrelations, Relation relations[],
+				ReorderBufferChange *change)
+{
+	RepackDecodingState *dstate;
+	int			i;
+	Relation	relation = NULL;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	/* Find the relation we are processing. */
+	for (i = 0; i < nrelations; i++)
+	{
+		relation = relations[i];
+
+		if (RelationGetRelid(relation) == dstate->relid)
+			break;
+	}
+
+	/* Is this truncation of another relation? */
+	if (i == nrelations)
+		return;
+
+	store_change(ctx, CHANGE_TRUNCATE, NULL);
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+			 HeapTuple tuple)
+{
+	RepackDecodingState *dstate;
+	char	   *change_raw;
+	ConcurrentChange change;
+	bool		flattened = false;
+	Size		size;
+	Datum		values[1];
+	bool		isnull[1];
+	char	   *dst,
+			   *dst_start;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	size = MAXALIGN(VARHDRSZ) + SizeOfConcurrentChange;
+
+	if (tuple)
+	{
+		/*
+		 * ReorderBufferCommit() stores the TOAST chunks in its private memory
+		 * context and frees them after having called apply_change().
+		 * Therefore we need flat copy (including TOAST) that we eventually
+		 * copy into the memory context which is available to
+		 * decode_concurrent_changes().
+		 */
+		if (HeapTupleHasExternal(tuple))
+		{
+			/*
+			 * toast_flatten_tuple_to_datum() might be more convenient but we
+			 * don't want the decompression it does.
+			 */
+			tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+			flattened = true;
+		}
+
+		size += tuple->t_len;
+	}
+
+	/* XXX Isn't there any function / macro to do this? */
+	if (size >= 0x3FFFFFFF)
+		elog(ERROR, "Change is too big.");
+
+	/* Construct the change. */
+	change_raw = (char *) palloc0(size);
+	SET_VARSIZE(change_raw, size);
+
+	/*
+	 * Since the varlena alignment might not be sufficient for the structure,
+	 * set the fields in a local instance and remember where it should
+	 * eventually be copied.
+	 */
+	change.kind = kind;
+	dst_start = (char *) VARDATA(change_raw);
+
+	/* No other information is needed for TRUNCATE. */
+	if (change.kind == CHANGE_TRUNCATE)
+	{
+		memcpy(dst_start, &change, SizeOfConcurrentChange);
+		goto store;
+	}
+
+	/*
+	 * Copy the tuple.
+	 *
+	 * CAUTION: change->tup_data.t_data must be fixed on retrieval!
+	 */
+	memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+	dst = dst_start + SizeOfConcurrentChange;
+	memcpy(dst, tuple->t_data, tuple->t_len);
+
+	/* The data has been copied. */
+	if (flattened)
+		pfree(tuple);
+
+store:
+	/* Copy the structure so it can be stored. */
+	memcpy(dst_start, &change, SizeOfConcurrentChange);
+
+	/* Store as tuple of 1 bytea column. */
+	values[0] = PointerGetDatum(change_raw);
+	isnull[0] = false;
+	tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+						 values, isnull);
+
+	/* Accounting. */
+	dstate->nchanges++;
+
+	/* Cleanup. */
+	pfree(change_raw);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..e9ddf39500c 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -25,6 +25,7 @@
 #include "access/xlogprefetcher.h"
 #include "access/xlogrecovery.h"
 #include "commands/async.h"
+#include "commands/cluster.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index cd3e43c448a..519f3953638 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -162,7 +162,7 @@ while (<$lwlocklist>)
 
 die
   "$wait_event_lwlocks[$lwlock_count] defined in wait_event_names.txt but "
-  . " missing from lwlocklist.h"
+  . "missing from lwlocklist.h"
   if $lwlock_count < scalar @wait_event_lwlocks;
 
 die
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 6fe268a8eec..d27a4c30548 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -64,6 +64,7 @@
 #include "catalog/pg_type.h"
 #include "catalog/schemapg.h"
 #include "catalog/storage.h"
+#include "commands/cluster.h"
 #include "commands/policy.h"
 #include "commands/publicationcmds.h"
 #include "commands/trigger.h"
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index bc7840052fe..6d46537cbe8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
 static void SnapshotResetXmin(void);
 
 /* ResourceOwner callbacks to track snapshot references */
@@ -657,7 +656,7 @@ CopySnapshot(Snapshot snapshot)
  * FreeSnapshot
  *		Free the memory associated with a snapshot.
  */
-static void
+void
 FreeSnapshot(Snapshot snapshot)
 {
 	Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 59ff6e0923b..528fb08154a 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4998,18 +4998,27 @@ match_previous_words(int pattern_id,
 	}
 
 /* REPACK */
-	else if (Matches("REPACK"))
+	else if (Matches("REPACK") || Matches("REPACK", "(*)"))
+		COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+										"CONCURRENTLY");
+	else if (Matches("REPACK", "CONCURRENTLY"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
-	else if (Matches("REPACK", "(*)"))
+	else if (Matches("REPACK", "(*)", "CONCURRENTLY"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
-	/* If we have REPACK <sth>, then add "USING INDEX" */
-	else if (Matches("REPACK", MatchAnyExcept("(")))
+	/* If we have REPACK [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+	else if (Matches("REPACK", MatchAnyExcept("(|CONCURRENTLY")) ||
+			 Matches("REPACK", "CONCURRENTLY", MatchAnyExcept("(")))
 		COMPLETE_WITH("USING INDEX");
-	/* If we have REPACK (*) <sth>, then add "USING INDEX" */
-	else if (Matches("REPACK", "(*)", MatchAny))
+	/* If we have REPACK (*) [ CONCURRENTLY ] <sth>, then add "USING INDEX" */
+	else if (Matches("REPACK", "(*)", MatchAnyExcept("CONCURRENTLY")) ||
+			 Matches("REPACK", "(*)", "CONCURRENTLY", MatchAnyExcept("(")))
 		COMPLETE_WITH("USING INDEX");
-	/* If we have REPACK <sth> USING, then add the index as well */
-	else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+
+	/*
+	 * Complete ... [ (*) ] [ CONCURRENTLY ] <sth> USING INDEX, with a list of
+	 * indexes for <sth>.
+	 */
+	else if (TailMatches(MatchAnyExcept("(|CONCURRENTLY"), "USING", "INDEX"))
 	{
 		set_completion_reference(prev3_wd);
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897f8..b82dd17a966 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -323,14 +323,15 @@ extern void heap_multi_insert(Relation relation, struct TupleTableSlot **slots,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, ItemPointer tid,
 							 CommandId cid, Snapshot crosscheck, bool wait,
-							 struct TM_FailureData *tmfd, bool changingPart);
+							 struct TM_FailureData *tmfd, bool changingPart,
+							 bool wal_logical);
 extern void heap_finish_speculative(Relation relation, ItemPointer tid);
 extern void heap_abort_speculative(Relation relation, ItemPointer tid);
 extern TM_Result heap_update(Relation relation, ItemPointer otid,
 							 HeapTuple newtup,
 							 CommandId cid, Snapshot crosscheck, bool wait,
 							 struct TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes);
+							 TU_UpdateIndexes *update_indexes, bool wal_logical);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool follow_updates,
@@ -411,6 +412,10 @@ extern HTSV_Result HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer
 												   TransactionId *dead_after);
 extern void HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 								 uint16 infomask, TransactionId xid);
+extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
+								  Buffer buffer);
+extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
+									Buffer buffer);
 extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 277df6b3cf0..8d4af07f840 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
 #define XLH_DELETE_CONTAINS_OLD_KEY				(1<<2)
 #define XLH_DELETE_IS_SUPER						(1<<3)
 #define XLH_DELETE_IS_PARTITION_MOVE			(1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL					(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLH_DELETE_CONTAINS_OLD						\
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..289b64edfd9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "replication/logical.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -623,6 +624,8 @@ typedef struct TableAmRoutine
 											  Relation OldIndex,
 											  bool use_sort,
 											  TransactionId OldestXmin,
+											  Snapshot snapshot,
+											  LogicalDecodingContext *decoding_ctx,
 											  TransactionId *xid_cutoff,
 											  MultiXactId *multi_cutoff,
 											  double *num_tuples,
@@ -1627,6 +1630,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
  *   not needed for the relation's AM
  * - *xid_cutoff - ditto
  * - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ *	 (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ *   changes.
  *
  * Output parameters:
  * - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1639,6 +1646,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
 								Relation OldIndex,
 								bool use_sort,
 								TransactionId OldestXmin,
+								Snapshot snapshot,
+								LogicalDecodingContext *decoding_ctx,
 								TransactionId *xid_cutoff,
 								MultiXactId *multi_cutoff,
 								double *num_tuples,
@@ -1647,6 +1656,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
 {
 	OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
 													use_sort, OldestXmin,
+													snapshot, decoding_ctx,
 													xid_cutoff, multi_cutoff,
 													num_tuples, tups_vacuumed,
 													tups_recently_dead);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 890998d84bb..4a508c57a50 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
 #ifndef CLUSTER_H
 #define CLUSTER_H
 
+#include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
 #include "parser/parse_node.h"
+#include "replication/logical.h"
 #include "storage/lock.h"
+#include "storage/relfilelocator.h"
 #include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
 
 
 /* flag bits for ClusterParams->options */
@@ -25,6 +30,8 @@
 #define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
 										 * indisclustered */
 #define CLUOPT_ANALYZE 0x08		/* do an ANALYZE */
+#define CLUOPT_CONCURRENT 0x08	/* allow concurrent data changes */
+
 
 /* options for CLUSTER */
 typedef struct ClusterParams
@@ -33,14 +40,95 @@ typedef struct ClusterParams
 } ClusterParams;
 
 
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+extern RelFileLocator repacked_rel_locator;
+extern RelFileLocator repacked_rel_toast_locator;
+
+typedef enum
+{
+	CHANGE_INSERT,
+	CHANGE_UPDATE_OLD,
+	CHANGE_UPDATE_NEW,
+	CHANGE_DELETE,
+	CHANGE_TRUNCATE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+	/* See the enum above. */
+	ConcurrentChangeKind kind;
+
+	/*
+	 * The actual tuple.
+	 *
+	 * The tuple data follows the ConcurrentChange structure. Before use make
+	 * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+	 * bytea) and that tuple->t_data is fixed.
+	 */
+	HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+								sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+	/* The relation whose changes we're decoding. */
+	Oid			relid;
+
+	/*
+	 * Decoded changes are stored here. Although we try to avoid excessive
+	 * batches, it can happen that the changes need to be stored to disk. The
+	 * tuplestore does this transparently.
+	 */
+	Tuplestorestate *tstore;
+
+	/* The current number of changes in tstore. */
+	double		nchanges;
+
+	/*
+	 * Descriptor to store the ConcurrentChange structure serialized (bytea).
+	 * We can't store the tuple directly because tuplestore only supports
+	 * minimum tuple and we may need to transfer OID system column from the
+	 * output plugin. Also we need to transfer the change kind, so it's better
+	 * to put everything in the structure than to use 2 tuplestores "in
+	 * parallel".
+	 */
+	TupleDesc	tupdesc_change;
+
+	/* Tuple descriptor needed to update indexes. */
+	TupleDesc	tupdesc;
+
+	/* Slot to retrieve data from tstore. */
+	TupleTableSlot *tsslot;
+
+	ResourceOwner resowner;
+} RepackDecodingState;
+
+
+
 extern void ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
 
 extern void cluster_rel(RepackCommand command, bool usingindex,
-						Relation OldHeap, Oid indexOid, ClusterParams *params);
+						Relation OldHeap, Oid indexOid, ClusterParams *params,
+						bool isTopLevel);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 									   LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
 
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+											 XLogRecPtr end_of_wal);
+
 extern Oid	make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
 						  char relpersistence, LOCKMODE lockmode);
 extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
@@ -48,6 +136,7 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 							 bool swap_toast_by_content,
 							 bool check_constraints,
 							 bool is_internal,
+							 bool reindex,
 							 TransactionId frozenXid,
 							 MultiXactId cutoffMulti,
 							 char newrelpersistence);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5b6639c114c..93917ad5544 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -59,18 +59,20 @@
 /*
  * Progress parameters for REPACK.
  *
- * Note: Since REPACK shares some code with CLUSTER, these values are also
- * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
- * introduce a separate set of constants.)
+ * Note: Since REPACK shares some code with CLUSTER, (some of) these values
+ * are also used by CLUSTER. (CLUSTER is now deprecated, so it makes little
+ * sense to introduce a separate set of constants.)
  */
 #define PROGRESS_REPACK_COMMAND					0
 #define PROGRESS_REPACK_PHASE					1
 #define PROGRESS_REPACK_INDEX_RELID				2
 #define PROGRESS_REPACK_HEAP_TUPLES_SCANNED		3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN		4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS			5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED		6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT		7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED	4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED		5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED		6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS			7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED		8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT		9
 
 /*
  * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -79,9 +81,10 @@
 #define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP	2
 #define PROGRESS_REPACK_PHASE_SORT_TUPLES		3
 #define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP	4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES	5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX		6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP		7
+#define PROGRESS_REPACK_PHASE_CATCH_UP			5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES	6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX		7
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP		8
 
 /*
  * Commands of PROGRESS_REPACK
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 6d4d2d1814c..802fc4b0823 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
 
 extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
 extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
 extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
 extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index 7f3ba0352f6..2739327b0da 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
 #define AccessShareLock			1	/* SELECT */
 #define RowShareLock			2	/* SELECT FOR UPDATE/FOR SHARE */
 #define RowExclusiveLock		3	/* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4	/* VACUUM (non-FULL), ANALYZE, CREATE
-									 * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4	/* VACUUM (non-exclusive), ANALYZE, CREATE
+									 * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
 #define ShareLock				5	/* CREATE INDEX (WITHOUT CONCURRENTLY) */
 #define ShareRowExclusiveLock	6	/* like EXCLUSIVE MODE, but allows ROW
 									 * SHARE */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index f65f83c85cd..1f821fd2ccd 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -64,6 +64,8 @@ extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
 
 extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
 extern Snapshot GetCatalogSnapshot(Oid relid);
 extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
 extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..f16422175f8 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,10 +11,11 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum
+# REGRESS = injection_points hashagg reindex_conc vacuum
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace syscache-update-pruned
+ISOLATION = basic inplace syscache-update-pruned repack
+ISOLATION_OPTS = --temp-config $(top_srcdir)/src/test/modules/injection_points/logical.conf
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..b575e9052ee
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step wait_before_lock: 
+	REPACK (CONCURRENTLY) repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing: 
+	UPDATE repack_test SET i=10 where i=1;
+	UPDATE repack_test SET j=20 where i=2;
+	UPDATE repack_test SET i=30 where i=3;
+	UPDATE repack_test SET i=40 where i=30;
+	DELETE FROM repack_test WHERE i=4;
+
+step change_new: 
+	INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+	UPDATE repack_test SET i=50 where i=5;
+	UPDATE repack_test SET j=60 where i=6;
+	DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1: 
+	BEGIN;
+	INSERT INTO repack_test(i, j) VALUES (100, 100);
+	SAVEPOINT s1;
+	UPDATE repack_test SET i=101 where i=100;
+	SAVEPOINT s2;
+	UPDATE repack_test SET i=102 where i=101;
+	COMMIT;
+
+step change_subxact2: 
+	BEGIN;
+	SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 110);
+	ROLLBACK TO SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 111);
+	COMMIT;
+
+step check2: 
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s2(i, j)
+	SELECT i, j FROM repack_test;
+
+  i|  j
+---+---
+  2| 20
+  6| 60
+  8|  8
+ 10|  1
+ 40|  3
+ 50|  5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock: 
+	SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1: 
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT count(DISTINCT node) FROM relfilenodes;
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s1(i, j)
+	SELECT i, j FROM repack_test;
+
+	SELECT count(*)
+	FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+	WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+    2
+(1 row)
+
+  i|  j
+---+---
+  2| 20
+  6| 60
+  8|  8
+ 10|  1
+ 40|  3
+ 50|  5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+    0
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
diff --git a/src/test/modules/injection_points/logical.conf b/src/test/modules/injection_points/logical.conf
new file mode 100644
index 00000000000..c8f264bc6cb
--- /dev/null
+++ b/src/test/modules/injection_points/logical.conf
@@ -0,0 +1 @@
+wal_level = logical
\ No newline at end of file
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..29561103bbf 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -47,9 +47,13 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'repack',
       'syscache-update-pruned',
     ],
     'runningcheck': false, # see syscache-update-pruned
+    # 'repack' requires wal_level = 'logical'.
+    'regress_args': ['--temp-config', files('logical.conf')],
+
   },
   'tap': {
     'env': {
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..75850334986
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,143 @@
+# Prefix the system columns with underscore as they are not allowed as column
+# names.
+setup
+{
+	CREATE EXTENSION injection_points;
+
+	CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+	INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+	CREATE TABLE relfilenodes(node oid);
+
+	CREATE TABLE data_s1(i int, j int);
+	CREATE TABLE data_s2(i int, j int);
+}
+
+teardown
+{
+	DROP TABLE repack_test;
+	DROP EXTENSION injection_points;
+
+	DROP TABLE relfilenodes;
+	DROP TABLE data_s1;
+	DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+	REPACK (CONCURRENTLY) repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+
+# Have each session write the contents into a table and use FULL JOIN to check
+# if the outputs are identical.
+step check1
+{
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT count(DISTINCT node) FROM relfilenodes;
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s1(i, j)
+	SELECT i, j FROM repack_test;
+
+	SELECT count(*)
+	FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+	WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+    SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+	UPDATE repack_test SET i=10 where i=1;
+	UPDATE repack_test SET j=20 where i=2;
+	UPDATE repack_test SET i=30 where i=3;
+	UPDATE repack_test SET i=40 where i=30;
+	DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+	INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+	UPDATE repack_test SET i=50 where i=5;
+	UPDATE repack_test SET j=60 where i=6;
+	DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+#
+# XXX Not sure this test is useful now - it was designed for the patch that
+# preserves tuple visibility and which therefore modifies
+# TransactionIdIsCurrentTransactionId().
+step change_subxact1
+{
+	BEGIN;
+	INSERT INTO repack_test(i, j) VALUES (100, 100);
+	SAVEPOINT s1;
+	UPDATE repack_test SET i=101 where i=100;
+	SAVEPOINT s2;
+	UPDATE repack_test SET i=102 where i=101;
+	COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+#
+# XXX Is this test useful? See above.
+step change_subxact2
+{
+	BEGIN;
+	SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 110);
+	ROLLBACK TO SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 111);
+	COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s2(i, j)
+	SELECT i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+	SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+	wait_before_lock
+	change_existing
+	change_new
+	change_subxact1
+	change_subxact2
+	check2
+	wakeup_before_lock
+	check1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3a1d1d28282..fe227bd8a30 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1999,17 +1999,17 @@ pg_stat_progress_cluster| SELECT s.pid,
             WHEN 2 THEN 'index scanning heap'::text
             WHEN 3 THEN 'sorting tuples'::text
             WHEN 4 THEN 'writing new heap'::text
-            WHEN 5 THEN 'swapping relation files'::text
-            WHEN 6 THEN 'rebuilding index'::text
-            WHEN 7 THEN 'performing final cleanup'::text
+            WHEN 6 THEN 'swapping relation files'::text
+            WHEN 7 THEN 'rebuilding index'::text
+            WHEN 8 THEN 'performing final cleanup'::text
             ELSE NULL::text
         END AS phase,
     (s.param3)::oid AS cluster_index_relid,
     s.param4 AS heap_tuples_scanned,
     s.param5 AS heap_tuples_written,
-    s.param6 AS heap_blks_total,
-    s.param7 AS heap_blks_scanned,
-    s.param8 AS index_rebuild_count
+    s.param8 AS heap_blks_total,
+    s.param9 AS heap_blks_scanned,
+    s.param10 AS index_rebuild_count
    FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_progress_copy| SELECT s.pid,
@@ -2081,17 +2081,20 @@ pg_stat_progress_repack| SELECT s.pid,
             WHEN 2 THEN 'index scanning heap'::text
             WHEN 3 THEN 'sorting tuples'::text
             WHEN 4 THEN 'writing new heap'::text
-            WHEN 5 THEN 'swapping relation files'::text
-            WHEN 6 THEN 'rebuilding index'::text
-            WHEN 7 THEN 'performing final cleanup'::text
+            WHEN 5 THEN 'catch-up'::text
+            WHEN 6 THEN 'swapping relation files'::text
+            WHEN 7 THEN 'rebuilding index'::text
+            WHEN 8 THEN 'performing final cleanup'::text
             ELSE NULL::text
         END AS phase,
     (s.param3)::oid AS repack_index_relid,
     s.param4 AS heap_tuples_scanned,
-    s.param5 AS heap_tuples_written,
-    s.param6 AS heap_blks_total,
-    s.param7 AS heap_blks_scanned,
-    s.param8 AS index_rebuild_count
+    s.param5 AS heap_tuples_inserted,
+    s.param6 AS heap_tuples_updated,
+    s.param7 AS heap_tuples_deleted,
+    s.param8 AS heap_blks_total,
+    s.param9 AS heap_blks_scanned,
+    s.param10 AS index_rebuild_count
    FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 98242e25432..b64ab8dfab4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -485,6 +485,8 @@ CompressFileHandle
 CompressionLocation
 CompressorState
 ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
 ConditionVariable
 ConditionVariableMinimallyPadded
 ConditionalStack
@@ -1257,6 +1259,7 @@ IndexElem
 IndexFetchHeapData
 IndexFetchTableData
 IndexInfo
+IndexInsertState
 IndexList
 IndexOnlyScan
 IndexOnlyScanState
@@ -2538,6 +2541,7 @@ ReorderBufferUpdateProgressTxnCB
 ReorderTuple
 RepOriginId
 RepackCommand
+RepackDecodingState
 RepackStmt
 ReparameterizeForeignPathByChild_function
 ReplaceVarsFromTargetList_context
-- 
2.43.0



  [application/octet-stream] v21-0002-Add-REPACK-command.patch (133.3K, 5-v21-0002-Add-REPACK-command.patch)
  download | inline diff:
From 40965dfef0f26a92249cda7a956bd03c9358a026 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <[email protected]>
Date: Sat, 26 Jul 2025 19:57:26 +0200
Subject: [PATCH v21 2/6] Add REPACK command

REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command.  Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
TI world and even in Postgres itself, so it's good that we can avoid it.

This also adds pg_repackdb, a new utility that can invoke the new
commands.  This is heavily based on vacuumdb.  We may still change the
implementation, depending on how does Windows like this one.

Author: Antonin Houska <[email protected]>
Reviewed-by: To fill in
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/[email protected]
---
 doc/src/sgml/monitoring.sgml             | 223 ++++++-
 doc/src/sgml/ref/allfiles.sgml           |   2 +
 doc/src/sgml/ref/cluster.sgml            |  97 +--
 doc/src/sgml/ref/clusterdb.sgml          |   5 +
 doc/src/sgml/ref/pg_repackdb.sgml        | 479 ++++++++++++++
 doc/src/sgml/ref/repack.sgml             | 284 +++++++++
 doc/src/sgml/ref/vacuum.sgml             |  33 +-
 doc/src/sgml/reference.sgml              |   2 +
 src/backend/access/heap/heapam_handler.c |  32 +-
 src/backend/catalog/index.c              |   2 +-
 src/backend/catalog/system_views.sql     |  26 +
 src/backend/commands/cluster.c           | 758 +++++++++++++++--------
 src/backend/commands/vacuum.c            |   3 +-
 src/backend/parser/gram.y                |  88 ++-
 src/backend/tcop/utility.c               |  20 +-
 src/backend/utils/adt/pgstatfuncs.c      |   2 +
 src/bin/psql/tab-complete.in.c           |  33 +-
 src/bin/scripts/Makefile                 |   4 +-
 src/bin/scripts/meson.build              |   2 +
 src/bin/scripts/pg_repackdb.c            | 226 +++++++
 src/bin/scripts/t/103_repackdb.pl        |  24 +
 src/bin/scripts/vacuuming.c              |  60 +-
 src/bin/scripts/vacuuming.h              |  11 +-
 src/include/commands/cluster.h           |   8 +-
 src/include/commands/progress.h          |  61 +-
 src/include/nodes/parsenodes.h           |  20 +-
 src/include/parser/kwlist.h              |   1 +
 src/include/tcop/cmdtaglist.h            |   1 +
 src/include/utils/backend_progress.h     |   1 +
 src/test/regress/expected/cluster.out    | 125 +++-
 src/test/regress/expected/rules.out      |  23 +
 src/test/regress/sql/cluster.sql         |  59 ++
 src/tools/pgindent/typedefs.list         |   3 +
 33 files changed, 2271 insertions(+), 447 deletions(-)
 create mode 100644 doc/src/sgml/ref/pg_repackdb.sgml
 create mode 100644 doc/src/sgml/ref/repack.sgml
 create mode 100644 src/bin/scripts/pg_repackdb.c
 create mode 100644 src/bin/scripts/t/103_repackdb.pl

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f4a27a736e..12e103d319d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -405,6 +405,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+      <entry>One row for each backend running
+       <command>REPACK</command>, showing current progress.  See
+       <xref linkend="repack-progress-reporting"/>.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
       <entry>One row for each WAL sender process streaming a base backup,
@@ -5506,7 +5514,8 @@ FROM pg_stat_get_backend_idset() AS backendid;
    certain commands during command execution.  Currently, the only commands
    which support progress reporting are <command>ANALYZE</command>,
    <command>CLUSTER</command>,
-   <command>CREATE INDEX</command>, <command>VACUUM</command>,
+   <command>CREATE INDEX</command>, <command>REPACK</command>,
+   <command>VACUUM</command>,
    <command>COPY</command>,
    and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
    command that <xref linkend="app-pgbasebackup"/> issues to take
@@ -5965,6 +5974,218 @@ FROM pg_stat_get_backend_idset() AS backendid;
   </table>
  </sect2>
 
+ <sect2 id="repack-progress-reporting">
+  <title>REPACK Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_repack</primary>
+  </indexterm>
+
+  <para>
+   Whenever <command>REPACK</command> is running,
+   the <structname>pg_stat_progress_repack</structname> view will contain a
+   row for each backend that is currently running the command.  The tables
+   below describe the information that will be reported and provide
+   information about how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+   <title><structname>pg_stat_progress_repack</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of backend.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>datid</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the database to which this backend is connected.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>datname</structfield> <type>name</type>
+      </para>
+      <para>
+       Name of the database to which this backend is connected.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relid</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the table being repacked.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="repack-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>repack_index_relid</structfield> <type>oid</type>
+      </para>
+      <para>
+       If the table is being scanned using an index, this is the OID of the
+       index being used; otherwise, it is zero.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples scanned.
+       This counter only advances when the phase is
+       <literal>seq scanning heap</literal>,
+       <literal>index scanning heap</literal>
+       or <literal>writing new heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples written.
+       This counter only advances when the phase is
+       <literal>seq scanning heap</literal>,
+       <literal>index scanning heap</literal>
+       or <literal>writing new heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_blks_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of heap blocks in the table.  This number is reported
+       as of the beginning of <literal>seq scanning heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap blocks scanned.  This counter only advances when the
+       phase is <literal>seq scanning heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>index_rebuild_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of indexes rebuilt.  This counter only advances when the phase
+       is <literal>rebuilding index</literal>.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="repack-phases">
+   <title>REPACK Phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+    <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><literal>initializing</literal></entry>
+     <entry>
+       The command is preparing to begin scanning the heap.  This phase is
+       expected to be very brief.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>seq scanning heap</literal></entry>
+     <entry>
+       The command is currently scanning the table using a sequential scan.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>index scanning heap</literal></entry>
+     <entry>
+       <command>REPACK</command> is currently scanning the table using an index scan.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>sorting tuples</literal></entry>
+     <entry>
+       <command>REPACK</command> is currently sorting tuples.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>writing new heap</literal></entry>
+     <entry>
+       <command>REPACK</command> is currently writing the new heap.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>swapping relation files</literal></entry>
+     <entry>
+       The command is currently swapping newly-built files into place.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>rebuilding index</literal></entry>
+     <entry>
+       The command is currently rebuilding an index.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>performing final cleanup</literal></entry>
+     <entry>
+       The command is performing final cleanup.  When this phase is
+       completed, <command>REPACK</command> will end.
+     </entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+ </sect2>
+
  <sect2 id="copy-progress-reporting">
   <title>COPY Progress Reporting</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index f5be638867a..eabf92e3536 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
 <!ENTITY reindex            SYSTEM "reindex.sgml">
 <!ENTITY releaseSavepoint   SYSTEM "release_savepoint.sgml">
+<!ENTITY repack             SYSTEM "repack.sgml">
 <!ENTITY reset              SYSTEM "reset.sgml">
 <!ENTITY revoke             SYSTEM "revoke.sgml">
 <!ENTITY rollback           SYSTEM "rollback.sgml">
@@ -212,6 +213,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY pgIsready          SYSTEM "pg_isready.sgml">
 <!ENTITY pgReceivewal       SYSTEM "pg_receivewal.sgml">
 <!ENTITY pgRecvlogical      SYSTEM "pg_recvlogical.sgml">
+<!ENTITY pgRepackdb         SYSTEM "pg_repackdb.sgml">
 <!ENTITY pgResetwal         SYSTEM "pg_resetwal.sgml">
 <!ENTITY pgRestore          SYSTEM "pg_restore.sgml">
 <!ENTITY pgRewind           SYSTEM "pg_rewind.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 8811f169ea0..cfcfb65e349 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -33,51 +33,13 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
   <title>Description</title>
 
   <para>
-   <command>CLUSTER</command> instructs <productname>PostgreSQL</productname>
-   to cluster the table specified
-   by <replaceable class="parameter">table_name</replaceable>
-   based on the index specified by
-   <replaceable class="parameter">index_name</replaceable>. The index must
-   already have been defined on
-   <replaceable class="parameter">table_name</replaceable>.
+   The <command>CLUSTER</command> command is equivalent to
+   <xref linkend="sql-repack"/> with an <literal>USING INDEX</literal>
+   clause.  See there for more details.
   </para>
 
-  <para>
-   When a table is clustered, it is physically reordered
-   based on the index information. Clustering is a one-time operation:
-   when the table is subsequently updated, the changes are
-   not clustered.  That is, no attempt is made to store new or
-   updated rows according to their index order.  (If one wishes, one can
-   periodically recluster by issuing the command again.  Also, setting
-   the table's <literal>fillfactor</literal> storage parameter to less than
-   100% can aid in preserving cluster ordering during updates, since updated
-   rows are kept on the same page if enough space is available there.)
-  </para>
-
-  <para>
-   When a table is clustered, <productname>PostgreSQL</productname>
-   remembers which index it was clustered by.  The form
-   <command>CLUSTER <replaceable class="parameter">table_name</replaceable></command>
-   reclusters the table using the same index as before.  You can also
-   use the <literal>CLUSTER</literal> or <literal>SET WITHOUT CLUSTER</literal>
-   forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link> to set the index to be used for
-   future cluster operations, or to clear any previous setting.
-  </para>
+<!-- Do we need to describe exactly which options map to what?  They seem obvious to me. -->
 
-  <para>
-   <command>CLUSTER</command> without a
-   <replaceable class="parameter">table_name</replaceable> reclusters all the
-   previously-clustered tables in the current database that the calling user
-   has privileges for.  This form of <command>CLUSTER</command> cannot be
-   executed inside a transaction block.
-  </para>
-
-  <para>
-   When a table is being clustered, an <literal>ACCESS
-   EXCLUSIVE</literal> lock is acquired on it. This prevents any other
-   database operations (both reads and writes) from operating on the
-   table until the <command>CLUSTER</command> is finished.
-  </para>
  </refsect1>
 
  <refsect1>
@@ -136,63 +98,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
     on the table.
    </para>
 
-   <para>
-    In cases where you are accessing single rows randomly
-    within a table, the actual order of the data in the
-    table is unimportant. However, if you tend to access some
-    data more than others, and there is an index that groups
-    them together, you will benefit from using <command>CLUSTER</command>.
-    If you are requesting a range of indexed values from a table, or a
-    single indexed value that has multiple rows that match,
-    <command>CLUSTER</command> will help because once the index identifies the
-    table page for the first row that matches, all other rows
-    that match are probably already on the same table page,
-    and so you save disk accesses and speed up the query.
-   </para>
-
-   <para>
-    <command>CLUSTER</command> can re-sort the table using either an index scan
-    on the specified index, or (if the index is a b-tree) a sequential
-    scan followed by sorting.  It will attempt to choose the method that
-    will be faster, based on planner cost parameters and available statistical
-    information.
-   </para>
-
    <para>
     While <command>CLUSTER</command> is running, the <xref
     linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
     pg_temp</literal>.
    </para>
 
-   <para>
-    When an index scan is used, a temporary copy of the table is created that
-    contains the table data in the index order.  Temporary copies of each
-    index on the table are created as well.  Therefore, you need free space on
-    disk at least equal to the sum of the table size and the index sizes.
-   </para>
-
-   <para>
-    When a sequential scan and sort is used, a temporary sort file is
-    also created, so that the peak temporary space requirement is as much
-    as double the table size, plus the index sizes.  This method is often
-    faster than the index scan method, but if the disk space requirement is
-    intolerable, you can disable this choice by temporarily setting <xref
-    linkend="guc-enable-sort"/> to <literal>off</literal>.
-   </para>
-
-   <para>
-    It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
-    a reasonably large value (but not more than the amount of RAM you can
-    dedicate to the <command>CLUSTER</command> operation) before clustering.
-   </para>
-
-   <para>
-    Because the planner records statistics about the ordering of
-    tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
-    on the newly clustered table.
-    Otherwise, the planner might make poor choices of query plans.
-   </para>
-
    <para>
     Because <command>CLUSTER</command> remembers which indexes are clustered,
     one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/clusterdb.sgml b/doc/src/sgml/ref/clusterdb.sgml
index 0d2051bf6f1..546c1289c31 100644
--- a/doc/src/sgml/ref/clusterdb.sgml
+++ b/doc/src/sgml/ref/clusterdb.sgml
@@ -64,6 +64,11 @@ PostgreSQL documentation
    this utility and via other methods for accessing the server.
   </para>
 
+  <para>
+   <application>clusterdb</application> has been superceded by
+   <application>pg_repackdb</application>.
+  </para>
+
  </refsect1>
 
 
diff --git a/doc/src/sgml/ref/pg_repackdb.sgml b/doc/src/sgml/ref/pg_repackdb.sgml
new file mode 100644
index 00000000000..32570d071cb
--- /dev/null
+++ b/doc/src/sgml/ref/pg_repackdb.sgml
@@ -0,0 +1,479 @@
+<!--
+doc/src/sgml/ref/pg_repackdb.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgrepackdb">
+ <indexterm zone="app-pgrepackdb">
+  <primary>pg_repackdb</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_repackdb</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_repackdb</refname>
+  <refpurpose>repack and analyze a <productname>PostgreSQL</productname>
+  database</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_repackdb</command>
+   <arg rep="repeat"><replaceable>connection-option</replaceable></arg>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+
+   <arg choice="plain" rep="repeat">
+    <arg choice="opt">
+     <group choice="plain">
+      <arg choice="plain"><option>-t</option></arg>
+      <arg choice="plain"><option>--table</option></arg>
+     </group>
+     <replaceable>table</replaceable>
+     <arg choice="opt">( <replaceable class="parameter">column</replaceable> [,...] )</arg>
+    </arg>
+   </arg>
+
+   <arg choice="opt">
+    <group choice="plain">
+     <arg choice="plain"><replaceable>dbname</replaceable></arg>
+     <arg choice="plain"><option>-a</option></arg>
+     <arg choice="plain"><option>--all</option></arg>
+    </group>
+   </arg>
+  </cmdsynopsis>
+
+  <cmdsynopsis>
+   <command>pg_repackdb</command>
+   <arg rep="repeat"><replaceable>connection-option</replaceable></arg>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+
+   <arg choice="plain" rep="repeat">
+    <arg choice="opt">
+     <group choice="plain">
+      <arg choice="plain"><option>-n</option></arg>
+      <arg choice="plain"><option>--schema</option></arg>
+     </group>
+     <replaceable>schema</replaceable>
+    </arg>
+   </arg>
+
+   <arg choice="opt">
+    <group choice="plain">
+     <arg choice="plain"><replaceable>dbname</replaceable></arg>
+     <arg choice="plain"><option>-a</option></arg>
+     <arg choice="plain"><option>--all</option></arg>
+    </group>
+   </arg>
+  </cmdsynopsis>
+
+  <cmdsynopsis>
+   <command>pg_repackdb</command>
+   <arg rep="repeat"><replaceable>connection-option</replaceable></arg>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+
+   <arg choice="plain" rep="repeat">
+    <arg choice="opt">
+     <group choice="plain">
+      <arg choice="plain"><option>-N</option></arg>
+      <arg choice="plain"><option>--exclude-schema</option></arg>
+     </group>
+     <replaceable>schema</replaceable>
+    </arg>
+   </arg>
+
+   <arg choice="opt">
+    <group choice="plain">
+     <arg choice="plain"><replaceable>dbname</replaceable></arg>
+     <arg choice="plain"><option>-a</option></arg>
+     <arg choice="plain"><option>--all</option></arg>
+    </group>
+   </arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_repackdb</application> is a utility for repacking a
+   <productname>PostgreSQL</productname> database.
+   <application>pg_repackdb</application> will also generate internal
+   statistics used by the <productname>PostgreSQL</productname> query
+   optimizer.
+  </para>
+
+  <para>
+   <application>pg_repackdb</application> is a wrapper around the SQL
+   command <link linkend="sql-repack"><command>REPACK</command></link> There
+   is no effective difference between repacking and analyzing databases via
+   this utility and via other methods for accessing the server.
+  </para>
+
+ </refsect1>
+
+
+ <refsect1>
+  <title>Options</title>
+
+   <para>
+    <application>pg_repackdb</application> accepts the following command-line arguments:
+    <variablelist>
+     <varlistentry>
+      <term><option>-a</option></term>
+      <term><option>--all</option></term>
+      <listitem>
+       <para>
+        Repack all databases.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option><optional>-d</optional> <replaceable class="parameter">dbname</replaceable></option></term>
+      <term><option><optional>--dbname=</optional><replaceable class="parameter">dbname</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the name of the database to be repacked or analyzed,
+        when <option>-a</option>/<option>--all</option> is not used.  If this
+        is not specified, the database name is read from the environment
+        variable <envar>PGDATABASE</envar>.  If that is not set, the user name
+        specified for the connection is used.
+        The <replaceable>dbname</replaceable> can be
+        a <link linkend="libpq-connstring">connection string</link>.  If so,
+        connection string parameters will override any conflicting command
+        line options.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-e</option></term>
+      <term><option>--echo</option></term>
+      <listitem>
+       <para>
+        Echo the commands that <application>pg_repackdb</application>
+        generates and sends to the server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-j <replaceable class="parameter">njobs</replaceable></option></term>
+      <term><option>--jobs=<replaceable class="parameter">njobs</replaceable></option></term>
+      <listitem>
+       <para>
+        Execute the repack or analyze commands in parallel by running
+        <replaceable class="parameter">njobs</replaceable>
+        commands simultaneously.  This option may reduce the processing time
+        but it also increases the load on the database server.
+       </para>
+       <para>
+        <application>pg_repackdb</application> will open
+        <replaceable class="parameter">njobs</replaceable> connections to the
+        database, so make sure your <xref linkend="guc-max-connections"/>
+        setting is high enough to accommodate all connections.
+       </para>
+       <para>
+        Note that using this mode might cause deadlock failures if certain
+        system catalogs are processed in parallel.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-n <replaceable class="parameter">schema</replaceable></option></term>
+      <term><option>--schema=<replaceable class="parameter">schema</replaceable></option></term>
+      <listitem>
+       <para>
+        Repack or analyze all tables in
+        <replaceable class="parameter">schema</replaceable> only.  Multiple
+        schemas can be repacked by writing multiple <option>-n</option>
+        switches.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-N <replaceable class="parameter">schema</replaceable></option></term>
+      <term><option>--exclude-schema=<replaceable class="parameter">schema</replaceable></option></term>
+      <listitem>
+       <para>
+        Do not repack or analyze any tables in
+        <replaceable class="parameter">schema</replaceable>.  Multiple schemas
+        can be excluded by writing multiple <option>-N</option> switches.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-q</option></term>
+      <term><option>--quiet</option></term>
+      <listitem>
+       <para>
+        Do not display progress messages.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-t <replaceable class="parameter">table</replaceable> [ (<replaceable class="parameter">column</replaceable> [,...]) ]</option></term>
+      <term><option>--table=<replaceable class="parameter">table</replaceable> [ (<replaceable class="parameter">column</replaceable> [,...]) ]</option></term>
+      <listitem>
+       <para>
+        Repack or analyze <replaceable class="parameter">table</replaceable>
+        only.  Column names can be specified only in conjunction with
+        the <option>--analyze</option> option.  Multiple tables can be
+        repacked by writing multiple
+        <option>-t</option> switches.
+       </para>
+       <tip>
+        <para>
+         If you specify columns, you probably have to escape the parentheses
+         from the shell.  (See examples below.)
+        </para>
+       </tip>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-v</option></term>
+      <term><option>--verbose</option></term>
+      <listitem>
+       <para>
+        Print detailed information during processing.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+       <term><option>-V</option></term>
+       <term><option>--version</option></term>
+       <listitem>
+       <para>
+       Print the <application>pg_repackdb</application> version and exit.
+       </para>
+       </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-z</option></term>
+      <term><option>--analyze</option></term>
+      <listitem>
+       <para>
+        Also calculate statistics for use by the optimizer.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+       <term><option>-?</option></term>
+       <term><option>--help</option></term>
+       <listitem>
+       <para>
+       Show help about <application>pg_repackdb</application> command line
+       arguments, and exit.
+       </para>
+       </listitem>
+     </varlistentry>
+
+    </variablelist>
+   </para>
+
+   <para>
+    <application>pg_repackdb</application> also accepts
+    the following command-line arguments for connection parameters:
+    <variablelist>
+     <varlistentry>
+      <term><option>-h <replaceable class="parameter">host</replaceable></option></term>
+      <term><option>--host=<replaceable class="parameter">host</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the host name of the machine on which the server
+        is running.  If the value begins with a slash, it is used
+        as the directory for the Unix domain socket.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-p <replaceable class="parameter">port</replaceable></option></term>
+      <term><option>--port=<replaceable class="parameter">port</replaceable></option></term>
+      <listitem>
+       <para>
+        Specifies the TCP port or local Unix domain socket file
+        extension on which the server
+        is listening for connections.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-U <replaceable class="parameter">username</replaceable></option></term>
+      <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
+      <listitem>
+       <para>
+        User name to connect as.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-w</option></term>
+      <term><option>--no-password</option></term>
+      <listitem>
+       <para>
+        Never issue a password prompt.  If the server requires
+        password authentication and a password is not available by
+        other means such as a <filename>.pgpass</filename> file, the
+        connection attempt will fail.  This option can be useful in
+        batch jobs and scripts where no user is present to enter a
+        password.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>-W</option></term>
+      <term><option>--password</option></term>
+      <listitem>
+       <para>
+        Force <application>pg_repackdb</application> to prompt for a
+        password before connecting to a database.
+       </para>
+
+       <para>
+        This option is never essential, since
+        <application>pg_repackdb</application> will automatically prompt
+        for a password if the server demands password authentication.
+        However, <application>pg_repackdb</application> will waste a
+        connection attempt finding out that the server wants a password.
+        In some cases it is worth typing <option>-W</option> to avoid the extra
+        connection attempt.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry>
+      <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
+      <listitem>
+       <para>
+        When the <option>-a</option>/<option>--all</option> is used, connect
+        to this database to gather the list of databases to repack.
+        If not specified, the <literal>postgres</literal> database will be used,
+        or if that does not exist, <literal>template1</literal> will be used.
+        This can be a <link linkend="libpq-connstring">connection
+        string</link>.  If so, connection string parameters will override any
+        conflicting command line options.  Also, connection string parameters
+        other than the database name itself will be re-used when connecting
+        to other databases.
+       </para>
+      </listitem>
+     </varlistentry>
+    </variablelist>
+   </para>
+ </refsect1>
+
+
+ <refsect1>
+  <title>Environment</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><envar>PGDATABASE</envar></term>
+    <term><envar>PGHOST</envar></term>
+    <term><envar>PGPORT</envar></term>
+    <term><envar>PGUSER</envar></term>
+
+    <listitem>
+     <para>
+      Default connection parameters
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><envar>PG_COLOR</envar></term>
+    <listitem>
+     <para>
+      Specifies whether to use color in diagnostic messages. Possible values
+      are <literal>always</literal>, <literal>auto</literal> and
+      <literal>never</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+
+  <para>
+   This utility, like most other <productname>PostgreSQL</productname> utilities,
+   also uses the environment variables supported by <application>libpq</application>
+   (see <xref linkend="libpq-envars"/>).
+  </para>
+
+ </refsect1>
+
+
+ <refsect1>
+  <title>Diagnostics</title>
+
+  <para>
+   In case of difficulty, see
+   <xref linkend="sql-repack"/> and <xref linkend="app-psql"/> for
+   discussions of potential problems and error messages.
+   The database server must be running at the
+   targeted host.  Also, any default connection settings and environment
+   variables used by the <application>libpq</application> front-end
+   library will apply.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+   <para>
+    To repack the database <literal>test</literal>:
+<screen>
+<prompt>$ </prompt><userinput>pg_repackdb test</userinput>
+</screen>
+   </para>
+
+   <para>
+    To repack and analyze for the optimizer a database named
+    <literal>bigdb</literal>:
+<screen>
+<prompt>$ </prompt><userinput>pg_repackdb --analyze bigdb</userinput>
+</screen>
+   </para>
+
+   <para>
+    To repack a single table
+    <literal>foo</literal> in a database named
+    <literal>xyzzy</literal>, and analyze a single column
+    <literal>bar</literal> of the table for the optimizer:
+<screen>
+<prompt>$ </prompt><userinput>pg_repackdb --analyze --verbose --table='foo(bar)' xyzzy</userinput>
+</screen></para>
+
+   <para>
+    To repack all tables in the <literal>foo</literal> and <literal>bar</literal> schemas
+    in a database named <literal>xyzzy</literal>:
+<screen>
+<prompt>$ </prompt><userinput>pg_repackdb --schema='foo' --schema='bar' xyzzy</userinput>
+</screen></para>
+
+
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="sql-repack"/></member>
+  </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..fd9d89f8aaa
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,284 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+  <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>REPACK</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>REPACK</refname>
+  <refpurpose>rewrite a table to reclaim disk space</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_name</replaceable> [ USING INDEX [ <replaceable class="parameter">index_name</replaceable> ] ] ]
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+    VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+    ANALYSE | ANALYZE
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>REPACK</command> reclaims storage occupied by dead
+   tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+   entire contents of the table specified
+   by <replaceable class="parameter">table_name</replaceable> into a new disk
+   file with no extra space (except for the space guaranteed by
+   the <literal>fillfactor</literal> storage parameter), allowing unused space
+   to be returned to the operating system.
+  </para>
+
+  <para>
+   Without
+   a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+   processes every table and materialized view in the current database that
+   the current user has the <literal>MAINTAIN</literal> privilege on. This
+   form of <command>REPACK</command> cannot be executed inside a transaction
+   block.
+  </para>
+
+  <para>
+   If a <literal>USING INDEX</literal> clause is specified, the rows are
+   physically reordered based on information from an index.  Please see the
+   notes on clustering below.
+  </para>
+
+  <para>
+   When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+   is acquired on it. This prevents any other database operations (both reads
+   and writes) from operating on the table until the <command>REPACK</command>
+   is finished.
+  </para>
+
+  <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+   <title>Notes on Clustering</title>
+
+   <para>
+    If the <literal>USING INDEX</literal> clause is specified, the rows in
+    the table are physically reordered following an index: if an index name
+    is specified in the command, then that index is used; if no index name
+    is specified, then the index that has been configured as the index to
+    cluster on.  If no index has been configured in this way, an error is
+    thrown.  The index given in the <literal>USING INDEX</literal> clause
+    is configured as the index to cluster on, as well as an index given
+    to the <command>CLUSTER</command> command.  An index can be set
+    manually using <command>ALTER TABLE ... CLUSTER ON</command>, and reset
+    with <command>ALTER TABLE ... SET WITHOUT CLUSTER</command>.
+   </para>
+
+   <para>
+    If no table name is specified in <command>REPACK USING INDEX</command>,
+    all tables which have a clustering index defined and which the calling
+    user has privileges for are processed.
+   </para>
+
+   <para>
+    Clustering is a one-time operation: when the table is
+    subsequently updated, the changes are not clustered.  That is, no attempt
+    is made to store new or updated rows according to their index order.  (If
+    one wishes, one can periodically recluster by issuing the command again.
+    Also, setting the table's <literal>fillfactor</literal> storage parameter
+    to less than 100% can aid in preserving cluster ordering during updates,
+    since updated rows are kept on the same page if enough space is available
+    there.)
+   </para>
+
+   <para>
+    In cases where you are accessing single rows randomly within a table, the
+    actual order of the data in the table is unimportant. However, if you tend
+    to access some data more than others, and there is an index that groups
+    them together, you will benefit from using clustering.  If
+    you are requesting a range of indexed values from a table, or a single
+    indexed value that has multiple rows that match,
+    <command>REPACK</command> will help because once the index identifies the
+    table page for the first row that matches, all other rows that match are
+    probably already on the same table page, and so you save disk accesses and
+    speed up the query.
+   </para>
+
+   <para>
+    <command>REPACK</command> can re-sort the table using either an index scan
+    on the specified index (if the index is a b-tree), or a sequential scan
+    followed by sorting.  It will attempt to choose the method that will be
+    faster, based on planner cost parameters and available statistical
+    information.
+   </para>
+
+   <para>
+    Because the planner records statistics about the ordering of tables, it is
+    advisable to
+    run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+    newly repacked table.  Otherwise, the planner might make poor choices of
+    query plans.
+   </para>
+  </refsect2>
+
+  <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+   <title>Notes on Resources</title>
+
+   <para>
+    When an index scan or a sequential scan without sort is used, a temporary
+    copy of the table is created that contains the table data in the index
+    order.  Temporary copies of each index on the table are created as well.
+    Therefore, you need free space on disk at least equal to the sum of the
+    table size and the index sizes.
+   </para>
+
+   <para>
+    When a sequential scan and sort is used, a temporary sort file is also
+    created, so that the peak temporary space requirement is as much as double
+    the table size, plus the index sizes.  This method is often faster than
+    the index scan method, but if the disk space requirement is intolerable,
+    you can disable this choice by temporarily setting
+    <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+   </para>
+
+   <para>
+    It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+    reasonably large value (but not more than the amount of RAM you can
+    dedicate to the <command>REPACK</command> operation) before repacking.
+   </para>
+  </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">table_name</replaceable></term>
+    <listitem>
+     <para>
+      The name (possibly schema-qualified) of a table.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">index_name</replaceable></term>
+    <listitem>
+     <para>
+      The name of an index.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>VERBOSE</literal></term>
+    <listitem>
+     <para>
+      Prints a progress report as each table is repacked
+      at <literal>INFO</literal> level.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>ANALYZE</literal></term>
+    <term><literal>ANALYSE</literal></term>
+    <listitem>
+     <para>
+      Applies <xref linkend="sql-analyze"/> on the table after repacking.  This is
+      currently only supported when a single (non-partitioned) table is specified.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">boolean</replaceable></term>
+    <listitem>
+     <para>
+      Specifies whether the selected option should be turned on or off.
+      You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+      <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+      <literal>OFF</literal>, or <literal>0</literal> to disable it.  The
+      <replaceable class="parameter">boolean</replaceable> value can also
+      be omitted, in which case <literal>TRUE</literal> is assumed.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+   <para>
+    To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+    on the table.
+   </para>
+
+   <para>
+    While <command>REPACK</command> is running, the <xref
+    linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+    pg_temp</literal>.
+   </para>
+
+  <para>
+    Each backend running <command>REPACK</command> will report its progress
+    in the <structname>pg_stat_progress_repack</structname> view. See
+    <xref linkend="repack-progress-reporting"/> for details.
+  </para>
+
+   <para>
+    Repacking a partitioned table repacks each of its partitions. If an index
+    is specified, each partition is repacked using the partition of that
+    index. <command>REPACK</command> on a partitioned table cannot be executed
+    inside a transaction block.
+   </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+  </para>
+
+  <para>
+   Repack the table <literal>employees</literal> on the basis of its
+   index <literal>employees_ind</literal> (Since index is used here, this is
+   effectively clustering):
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+  </para>
+
+  <para>
+   Repack all tables in the database on which you have
+   the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting></para>
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>REPACK</command> statement in the SQL standard.
+  </para>
+
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index bd5dcaf86a5..062b658cfcd 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -25,7 +25,6 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
 
-    FULL [ <replaceable class="parameter">boolean</replaceable> ]
     FREEZE [ <replaceable class="parameter">boolean</replaceable> ]
     VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
     ANALYZE [ <replaceable class="parameter">boolean</replaceable> ]
@@ -39,6 +38,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
     ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
     BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+    FULL [ <replaceable class="parameter">boolean</replaceable> ]
 
 <phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
 
@@ -95,20 +95,6 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
   <title>Parameters</title>
 
   <variablelist>
-   <varlistentry>
-    <term><literal>FULL</literal></term>
-    <listitem>
-     <para>
-      Selects <quote>full</quote> vacuum, which can reclaim more
-      space, but takes much longer and exclusively locks the table.
-      This method also requires extra disk space, since it writes a
-      new copy of the table and doesn't release the old copy until
-      the operation is complete.  Usually this should only be used when a
-      significant amount of space needs to be reclaimed from within the table.
-     </para>
-    </listitem>
-   </varlistentry>
-
    <varlistentry>
     <term><literal>FREEZE</literal></term>
     <listitem>
@@ -362,6 +348,23 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>FULL</literal></term>
+    <listitem>
+     <para>
+      This option, which is deprecated, makes <command>VACUUM</command>
+      behave like <command>REPACK</command> without a
+      <literal>USING INDEX</literal> clause.
+      This method of compacting the table takes much longer than
+      <command>VACUUM</command> and exclusively locks the table.
+      This method also requires extra disk space, since it writes a
+      new copy of the table and doesn't release the old copy until
+      the operation is complete.  Usually this should only be used when a
+      significant amount of space needs to be reclaimed from within the table.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><replaceable class="parameter">boolean</replaceable></term>
     <listitem>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ff85ace83fc..2ee08e21f41 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
    &refreshMaterializedView;
    &reindex;
    &releaseSavepoint;
+   &repack;
    &reset;
    &revoke;
    &rollback;
@@ -257,6 +258,7 @@
    &pgIsready;
    &pgReceivewal;
    &pgRecvlogical;
+   &pgRepackdb;
    &pgRestore;
    &pgVerifyBackup;
    &psqlRef;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..79f9de5d760 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	if (OldIndex != NULL && !use_sort)
 	{
 		const int	ci_index[] = {
-			PROGRESS_CLUSTER_PHASE,
-			PROGRESS_CLUSTER_INDEX_RELID
+			PROGRESS_REPACK_PHASE,
+			PROGRESS_REPACK_INDEX_RELID
 		};
 		int64		ci_val[2];
 
 		/* Set phase and OIDOldIndex to columns */
-		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+		ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
 		ci_val[1] = RelationGetRelid(OldIndex);
 		pgstat_progress_update_multi_param(2, ci_index, ci_val);
 
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	else
 	{
 		/* In scan-and-sort mode and also VACUUM FULL, set phase */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-									 PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
 
 		tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
 		heapScan = (HeapScanDesc) tableScan;
 		indexScan = NULL;
 
 		/* Set total heap blocks */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+		pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
 									 heapScan->rs_nblocks);
 	}
 
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 				 * is manually updated to the correct value when the table
 				 * scan finishes.
 				 */
-				pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+				pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
 											 heapScan->rs_nblocks);
 				break;
 			}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			 */
 			if (prev_cblock != heapScan->rs_cblock)
 			{
-				pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+				pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
 											 (heapScan->rs_cblock +
 											  heapScan->rs_nblocks -
 											  heapScan->rs_startblock
@@ -912,14 +912,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			 * In scan-and-sort mode, report increase in number of tuples
 			 * scanned
 			 */
-			pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
 										 *num_tuples);
 		}
 		else
 		{
 			const int	ct_index[] = {
-				PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
-				PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+				PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+				PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
 			};
 			int64		ct_val[2];
 
@@ -952,14 +952,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		double		n_tuples = 0;
 
 		/* Report that we are now sorting tuples */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-									 PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_SORT_TUPLES);
 
 		tuplesort_performsort(tuplesort);
 
 		/* Report that we are now writing new heap */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-									 PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
 
 		for (;;)
 		{
@@ -977,7 +977,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 									 values, isnull,
 									 rwstate);
 			/* Report n_tuples */
-			pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
 										 n_tuples);
 		}
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c4029a4f3d3..3063abff9a5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4079,7 +4079,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 		Assert(!ReindexIsProcessingIndex(indexOid));
 
 		/* Set index rebuild count */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+		pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
 									 i);
 		i++;
 	}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1b3c5a55882..b2b7b10c2be 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1279,6 +1279,32 @@ CREATE VIEW pg_stat_progress_cluster AS
     FROM pg_stat_get_progress_info('CLUSTER') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_progress_repack AS
+    SELECT
+        S.pid AS pid,
+        S.datid AS datid,
+        D.datname AS datname,
+        S.relid AS relid,
+	-- param1 is currently unused
+        CASE S.param2 WHEN 0 THEN 'initializing'
+                      WHEN 1 THEN 'seq scanning heap'
+                      WHEN 2 THEN 'index scanning heap'
+                      WHEN 3 THEN 'sorting tuples'
+                      WHEN 4 THEN 'writing new heap'
+                      WHEN 5 THEN 'swapping relation files'
+                      WHEN 6 THEN 'rebuilding index'
+                      WHEN 7 THEN 'performing final cleanup'
+                      END AS phase,
+        CAST(S.param3 AS oid) AS repack_index_relid,
+        S.param4 AS heap_tuples_scanned,
+        S.param5 AS heap_tuples_written,
+        S.param6 AS heap_blks_total,
+        S.param7 AS heap_blks_scanned,
+        S.param8 AS index_rebuild_count
+    FROM pg_stat_get_progress_info('REPACK') AS S
+        LEFT JOIN pg_database D ON S.datid = D.oid;
+
+
 CREATE VIEW pg_stat_progress_create_index AS
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b55221d44cd..8b64f9e6795 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -67,18 +67,41 @@ typedef struct
 	Oid			indexOid;
 } RelToCluster;
 
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+static bool cluster_rel_recheck(RepackCommand cmd, Relation OldHeap,
+								Oid indexOid, Oid userid, int options);
+static void rebuild_relation(RepackCommand cmd, bool usingindex,
+							 Relation OldHeap, Relation index, bool verbose);
 static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
 							bool verbose, bool *pSwapToastByContent,
 							TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
-static List *get_tables_to_cluster(MemoryContext cluster_context);
-static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
-											   Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static List *get_tables_to_repack(RepackCommand cmd, bool usingindex,
+								  MemoryContext permcxt);
+static List *get_tables_to_repack_partitioned(RepackCommand cmd,
+											  MemoryContext cluster_context,
+											  Oid relid, bool rel_is_index);
+static bool cluster_is_permitted_for_relation(RepackCommand cmd,
+											  Oid relid, Oid userid);
+static Relation process_single_relation(RepackStmt *stmt,
+										ClusterParams *params);
+static Oid	determine_clustered_index(Relation rel, bool usingindex,
+									  const char *indexname);
 
 
+static const char *
+RepackCommandAsString(RepackCommand cmd)
+{
+	switch (cmd)
+	{
+		case REPACK_COMMAND_REPACK:
+			return "REPACK";
+		case REPACK_COMMAND_VACUUMFULL:
+			return "VACUUM";
+		case REPACK_COMMAND_CLUSTER:
+			return "CLUSTER";
+	}
+	return "???";
+}
+
 /*---------------------------------------------------------------------------
  * This cluster code allows for clustering multiple tables at once. Because
  * of this, we cannot just run everything on a single transaction, or we
@@ -104,191 +127,155 @@ static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
  *---------------------------------------------------------------------------
  */
 void
-cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
+ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 {
-	ListCell   *lc;
 	ClusterParams params = {0};
-	bool		verbose = false;
 	Relation	rel = NULL;
-	Oid			indexOid = InvalidOid;
-	MemoryContext cluster_context;
+	MemoryContext repack_context;
 	List	   *rtcs;
 
 	/* Parse option list */
-	foreach(lc, stmt->params)
+	foreach_node(DefElem, opt, stmt->params)
 	{
-		DefElem    *opt = (DefElem *) lfirst(lc);
-
 		if (strcmp(opt->defname, "verbose") == 0)
-			verbose = defGetBoolean(opt);
+			params.options |= defGetBoolean(opt) ? CLUOPT_VERBOSE : 0;
+		else if (strcmp(opt->defname, "analyze") == 0 ||
+				 strcmp(opt->defname, "analyse") == 0)
+			params.options |= defGetBoolean(opt) ? CLUOPT_ANALYZE : 0;
 		else
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("unrecognized CLUSTER option \"%s\"",
+					 errmsg("unrecognized %s option \"%s\"",
+							RepackCommandAsString(stmt->command),
 							opt->defname),
 					 parser_errposition(pstate, opt->location)));
 	}
 
-	params.options = (verbose ? CLUOPT_VERBOSE : 0);
-
+	/*
+	 * If a single relation is specified, process it and we're done ... unless
+	 * the relation is a partitioned table, in which case we fall through.
+	 */
 	if (stmt->relation != NULL)
 	{
-		/* This is the single-relation case. */
-		Oid			tableOid;
-
-		/*
-		 * Find, lock, and check permissions on the table.  We obtain
-		 * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
-		 * single-transaction case.
-		 */
-		tableOid = RangeVarGetRelidExtended(stmt->relation,
-											AccessExclusiveLock,
-											0,
-											RangeVarCallbackMaintainsTable,
-											NULL);
-		rel = table_open(tableOid, NoLock);
-
-		/*
-		 * Reject clustering a remote temp table ... their local buffer
-		 * manager is not going to cope.
-		 */
-		if (RELATION_IS_OTHER_TEMP(rel))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("cannot cluster temporary tables of other sessions")));
-
-		if (stmt->indexname == NULL)
-		{
-			ListCell   *index;
-
-			/* We need to find the index that has indisclustered set. */
-			foreach(index, RelationGetIndexList(rel))
-			{
-				indexOid = lfirst_oid(index);
-				if (get_index_isclustered(indexOid))
-					break;
-				indexOid = InvalidOid;
-			}
-
-			if (!OidIsValid(indexOid))
-				ereport(ERROR,
-						(errcode(ERRCODE_UNDEFINED_OBJECT),
-						 errmsg("there is no previously clustered index for table \"%s\"",
-								stmt->relation->relname)));
-		}
-		else
-		{
-			/*
-			 * The index is expected to be in the same namespace as the
-			 * relation.
-			 */
-			indexOid = get_relname_relid(stmt->indexname,
-										 rel->rd_rel->relnamespace);
-			if (!OidIsValid(indexOid))
-				ereport(ERROR,
-						(errcode(ERRCODE_UNDEFINED_OBJECT),
-						 errmsg("index \"%s\" for table \"%s\" does not exist",
-								stmt->indexname, stmt->relation->relname)));
-		}
-
-		/* For non-partitioned tables, do what we came here to do. */
-		if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		{
-			cluster_rel(rel, indexOid, &params);
-			/* cluster_rel closes the relation, but keeps lock */
-
+		rel = process_single_relation(stmt, &params);
+		if (rel == NULL)
 			return;
-		}
 	}
 
+	/* Don't allow this for now.  Maybe we can add support for this later */
+	if (params.options & CLUOPT_ANALYZE)
+		ereport(ERROR,
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("cannot ANALYZE multiple tables"));
+
 	/*
 	 * By here, we know we are in a multi-table situation.  In order to avoid
 	 * holding locks for too long, we want to process each table in its own
 	 * transaction.  This forces us to disallow running inside a user
 	 * transaction block.
 	 */
-	PreventInTransactionBlock(isTopLevel, "CLUSTER");
+	PreventInTransactionBlock(isTopLevel, RepackCommandAsString(stmt->command));
 
 	/* Also, we need a memory context to hold our list of relations */
-	cluster_context = AllocSetContextCreate(PortalContext,
-											"Cluster",
-											ALLOCSET_DEFAULT_SIZES);
+	repack_context = AllocSetContextCreate(PortalContext,
+										   "Repack",
+										   ALLOCSET_DEFAULT_SIZES);
 
-	/*
-	 * Either we're processing a partitioned table, or we were not given any
-	 * table name at all.  In either case, obtain a list of relations to
-	 * process.
-	 *
-	 * In the former case, an index name must have been given, so we don't
-	 * need to recheck its "indisclustered" bit, but we have to check that it
-	 * is an index that we can cluster on.  In the latter case, we set the
-	 * option bit to have indisclustered verified.
-	 *
-	 * Rechecking the relation itself is necessary here in all cases.
-	 */
 	params.options |= CLUOPT_RECHECK;
-	if (rel != NULL)
+
+	/*
+	 * If we don't have a relation yet, determine a relation list.  If we do,
+	 * then it must be a partitioned table, and we want to process its
+	 * partitions.
+	 */
+	if (rel == NULL)
 	{
+		Assert(stmt->indexname == NULL);
+		rtcs = get_tables_to_repack(stmt->command, stmt->usingindex,
+									repack_context);
+	}
+	else
+	{
+		Oid			relid;
+		bool		rel_is_index;
+
 		Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-		check_index_is_clusterable(rel, indexOid, AccessShareLock);
-		rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
 
-		/* close relation, releasing lock on parent table */
+		/*
+		 * If an index name was specified, resolve it now and pass it down.
+		 */
+		if (stmt->usingindex)
+		{
+			/*
+			 * XXX how should this behave?  Passing no index to a partitioned
+			 * table could be useful to have certain partitions clustered by
+			 * some index, and other partitions by a different index.
+			 */
+			if (!stmt->indexname)
+				ereport(ERROR,
+						errmsg("there is no previously clustered index for table \"%s\"",
+							   RelationGetRelationName(rel)));
+
+			relid = determine_clustered_index(rel, true, stmt->indexname);
+			if (!OidIsValid(relid))
+				elog(ERROR, "unable to determine index to cluster on");
+			/* XXX is this the right place for this check? */
+			check_index_is_clusterable(rel, relid, AccessExclusiveLock);
+			rel_is_index = true;
+		}
+		else
+		{
+			relid = RelationGetRelid(rel);
+			rel_is_index = false;
+		}
+
+		rtcs = get_tables_to_repack_partitioned(stmt->command, repack_context,
+												relid, rel_is_index);
+
+		/* close parent relation, releasing lock on it */
 		table_close(rel, AccessExclusiveLock);
+		rel = NULL;
 	}
-	else
-	{
-		rtcs = get_tables_to_cluster(cluster_context);
-		params.options |= CLUOPT_RECHECK_ISCLUSTERED;
-	}
-
-	/* Do the job. */
-	cluster_multiple_rels(rtcs, &params);
-
-	/* Start a new transaction for the cleanup work. */
-	StartTransactionCommand();
-
-	/* Clean up working storage */
-	MemoryContextDelete(cluster_context);
-}
-
-/*
- * Given a list of relations to cluster, process each of them in a separate
- * transaction.
- *
- * We expect to be in a transaction at start, but there isn't one when we
- * return.
- */
-static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
-{
-	ListCell   *lc;
 
 	/* Commit to get out of starting transaction */
 	PopActiveSnapshot();
 	CommitTransactionCommand();
 
 	/* Cluster the tables, each in a separate transaction */
-	foreach(lc, rtcs)
+	Assert(rel == NULL);
+	foreach_ptr(RelToCluster, rtc, rtcs)
 	{
-		RelToCluster *rtc = (RelToCluster *) lfirst(lc);
-		Relation	rel;
-
 		/* Start a new transaction for each relation. */
 		StartTransactionCommand();
 
+		/*
+		 * Open the target table, coping with the case where it has been
+		 * dropped.
+		 */
+		rel = try_table_open(rtc->tableOid, AccessExclusiveLock);
+		if (rel == NULL)
+		{
+			CommitTransactionCommand();
+			continue;
+		}
+
 		/* functions in indexes may want a snapshot set */
 		PushActiveSnapshot(GetTransactionSnapshot());
 
-		rel = table_open(rtc->tableOid, AccessExclusiveLock);
-
 		/* Process this table */
-		cluster_rel(rel, rtc->indexOid, params);
+		cluster_rel(stmt->command, stmt->usingindex,
+					rel, rtc->indexOid, &params);
 		/* cluster_rel closes the relation, but keeps lock */
 
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
+
+	/* Start a new transaction for the cleanup work. */
+	StartTransactionCommand();
+
+	/* Clean up working storage */
+	MemoryContextDelete(repack_context);
 }
 
 /*
@@ -304,11 +291,14 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
  * them incrementally while we load the table.
  *
  * If indexOid is InvalidOid, the table will be rewritten in physical order
- * instead of index order.  This is the new implementation of VACUUM FULL,
- * and error messages should refer to the operation as VACUUM not CLUSTER.
+ * instead of index order.
+ *
+ * 'cmd' indicates which command is being executed, to be used for error
+ * messages.
  */
 void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(RepackCommand cmd, bool usingindex,
+			Relation OldHeap, Oid indexOid, ClusterParams *params)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 	Oid			save_userid;
@@ -323,13 +313,25 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	/* Check for user-requested abort. */
 	CHECK_FOR_INTERRUPTS();
 
-	pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
-	if (OidIsValid(indexOid))
-		pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+	if (cmd == REPACK_COMMAND_REPACK)
+		pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+	else
+		pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
+	if (cmd == REPACK_COMMAND_REPACK)
+		pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
+									 PROGRESS_REPACK_COMMAND_REPACK);
+	else if (cmd == REPACK_COMMAND_CLUSTER)
+	{
+		pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
 									 PROGRESS_CLUSTER_COMMAND_CLUSTER);
+	}
 	else
-		pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+	{
+		Assert(cmd == REPACK_COMMAND_VACUUMFULL);
+		pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
 									 PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+	}
 
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
@@ -351,63 +353,21 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	 * to cluster a not-previously-clustered index.
 	 */
 	if (recheck)
-	{
-		/* Check that the user still has privileges for the relation */
-		if (!cluster_is_permitted_for_relation(tableOid, save_userid))
-		{
-			relation_close(OldHeap, AccessExclusiveLock);
+		if (!cluster_rel_recheck(cmd, OldHeap, indexOid, save_userid,
+								 params->options))
 			goto out;
-		}
-
-		/*
-		 * Silently skip a temp table for a remote session.  Only doing this
-		 * check in the "recheck" case is appropriate (which currently means
-		 * somebody is executing a database-wide CLUSTER or on a partitioned
-		 * table), because there is another check in cluster() which will stop
-		 * any attempt to cluster remote temp tables by name.  There is
-		 * another check in cluster_rel which is redundant, but we leave it
-		 * for extra safety.
-		 */
-		if (RELATION_IS_OTHER_TEMP(OldHeap))
-		{
-			relation_close(OldHeap, AccessExclusiveLock);
-			goto out;
-		}
-
-		if (OidIsValid(indexOid))
-		{
-			/*
-			 * Check that the index still exists
-			 */
-			if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
-			{
-				relation_close(OldHeap, AccessExclusiveLock);
-				goto out;
-			}
-
-			/*
-			 * Check that the index is still the one with indisclustered set,
-			 * if needed.
-			 */
-			if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
-				!get_index_isclustered(indexOid))
-			{
-				relation_close(OldHeap, AccessExclusiveLock);
-				goto out;
-			}
-		}
-	}
 
 	/*
-	 * We allow VACUUM FULL, but not CLUSTER, on shared catalogs.  CLUSTER
-	 * would work in most respects, but the index would only get marked as
-	 * indisclustered in the current database, leading to unexpected behavior
-	 * if CLUSTER were later invoked in another database.
+	 * We allow repacking shared catalogs only when not using an index. It
+	 * would work to use an index in most respects, but the index would only
+	 * get marked as indisclustered in the current database, leading to
+	 * unexpected behavior if CLUSTER were later invoked in another database.
 	 */
-	if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
+	if (usingindex && OldHeap->rd_rel->relisshared)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("cannot cluster a shared catalog")));
+				 errmsg("cannot run \"%s\" on a shared catalog",
+						RepackCommandAsString(cmd))));
 
 	/*
 	 * Don't process temp tables of other backends ... their local buffer
@@ -415,21 +375,30 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	 */
 	if (RELATION_IS_OTHER_TEMP(OldHeap))
 	{
-		if (OidIsValid(indexOid))
+		if (cmd == REPACK_COMMAND_CLUSTER)
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot cluster temporary tables of other sessions")));
+		else if (cmd == REPACK_COMMAND_REPACK)
+		{
+			ereport(ERROR,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("cannot repack temporary tables of other sessions")));
+		}
 		else
+		{
+			Assert(cmd == REPACK_COMMAND_VACUUMFULL);
 			ereport(ERROR,
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("cannot vacuum temporary tables of other sessions")));
+		}
 	}
 
 	/*
 	 * Also check for active uses of the relation in the current transaction,
 	 * including open scans and pending AFTER trigger events.
 	 */
-	CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+	CheckTableNotInUse(OldHeap, RepackCommandAsString(cmd));
 
 	/* Check heap and index are valid to cluster on */
 	if (OidIsValid(indexOid))
@@ -469,7 +438,7 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	TransferPredicateLocksToHeapRelation(OldHeap);
 
 	/* rebuild_relation does all the dirty work */
-	rebuild_relation(OldHeap, index, verbose);
+	rebuild_relation(cmd, usingindex, OldHeap, index, verbose);
 	/* rebuild_relation closes OldHeap, and index if valid */
 
 out:
@@ -482,6 +451,63 @@ out:
 	pgstat_progress_end_command();
 }
 
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
+					Oid userid, int options)
+{
+	Oid			tableOid = RelationGetRelid(OldHeap);
+
+	/* Check that the user still has privileges for the relation */
+	if (!cluster_is_permitted_for_relation(cmd, tableOid, userid))
+	{
+		relation_close(OldHeap, AccessExclusiveLock);
+		return false;
+	}
+
+	/*
+	 * Silently skip a temp table for a remote session.  Only doing this check
+	 * in the "recheck" case is appropriate (which currently means somebody is
+	 * executing a database-wide CLUSTER or on a partitioned table), because
+	 * there is another check in cluster() which will stop any attempt to
+	 * cluster remote temp tables by name.  There is another check in
+	 * cluster_rel which is redundant, but we leave it for extra safety.
+	 */
+	if (RELATION_IS_OTHER_TEMP(OldHeap))
+	{
+		relation_close(OldHeap, AccessExclusiveLock);
+		return false;
+	}
+
+	if (OidIsValid(indexOid))
+	{
+		/*
+		 * Check that the index still exists
+		 */
+		if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+		{
+			relation_close(OldHeap, AccessExclusiveLock);
+			return false;
+		}
+
+		/*
+		 * Check that the index is still the one with indisclustered set, if
+		 * needed.
+		 */
+		if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+			!get_index_isclustered(indexOid))
+		{
+			relation_close(OldHeap, AccessExclusiveLock);
+			return false;
+		}
+	}
+
+	return true;
+}
+
 /*
  * Verify that the specified heap and index are valid to cluster on
  *
@@ -626,7 +652,8 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
  * On exit, they are closed, but locks on them are not released.
  */
 static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+rebuild_relation(RepackCommand cmd, bool usingindex,
+				 Relation OldHeap, Relation index, bool verbose)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 	Oid			accessMethod = OldHeap->rd_rel->relam;
@@ -642,8 +669,8 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
 	Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
 		   (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
 
-	if (index)
-		/* Mark the correct index as clustered */
+	/* for CLUSTER or REPACK USING INDEX, mark the index as the one to use */
+	if (usingindex)
 		mark_index_clustered(OldHeap, RelationGetRelid(index), true);
 
 	/* Remember info about rel before closing OldHeap */
@@ -1458,8 +1485,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	int			i;
 
 	/* Report that we are now swapping relation files */
-	pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-								 PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
 
 	/* Zero out possible results from swapped_relation_files */
 	memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1536,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 		reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
 
 	/* Report that we are now reindexing relations */
-	pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-								 PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
 
 	reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
 
 	/* Report that we are now doing clean up */
-	pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-								 PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
 
 	/*
 	 * If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1632,69 +1659,137 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	}
 }
 
-
 /*
- * Get a list of tables that the current user has privileges on and
- * have indisclustered set.  Return the list in a List * of RelToCluster
- * (stored in the specified memory context), each one giving the tableOid
- * and the indexOid on which the table is already clustered.
+ * Determine which relations to process, when REPACK/CLUSTER is called
+ * without specifying a table name.  The exact process depends on whether
+ * USING INDEX was given or not, and in any case we only return tables and
+ * materialized views that the current user has privileges to repack/cluster.
+ *
+ * If USING INDEX was given, we scan pg_index to find those that have
+ * indisclustered set; if it was not given, scan pg_class and return all
+ * tables.
+ *
+ * Return it as a list of RelToCluster in the given memory context.
  */
 static List *
-get_tables_to_cluster(MemoryContext cluster_context)
+get_tables_to_repack(RepackCommand command, bool usingindex,
+					 MemoryContext permcxt)
 {
-	Relation	indRelation;
+	Relation	catalog;
 	TableScanDesc scan;
-	ScanKeyData entry;
-	HeapTuple	indexTuple;
-	Form_pg_index index;
+	HeapTuple	tuple;
 	MemoryContext old_context;
 	List	   *rtcs = NIL;
 
-	/*
-	 * Get all indexes that have indisclustered set and that the current user
-	 * has the appropriate privileges for.
-	 */
-	indRelation = table_open(IndexRelationId, AccessShareLock);
-	ScanKeyInit(&entry,
-				Anum_pg_index_indisclustered,
-				BTEqualStrategyNumber, F_BOOLEQ,
-				BoolGetDatum(true));
-	scan = table_beginscan_catalog(indRelation, 1, &entry);
-	while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	if (usingindex)
 	{
-		RelToCluster *rtc;
+		ScanKeyData entry;
 
-		index = (Form_pg_index) GETSTRUCT(indexTuple);
+		catalog = table_open(IndexRelationId, AccessShareLock);
+		ScanKeyInit(&entry,
+					Anum_pg_index_indisclustered,
+					BTEqualStrategyNumber, F_BOOLEQ,
+					BoolGetDatum(true));
+		scan = table_beginscan_catalog(catalog, 1, &entry);
+		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		{
+			RelToCluster *rtc;
+			Form_pg_index index;
 
-		if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
-			continue;
+			index = (Form_pg_index) GETSTRUCT(tuple);
 
-		/* Use a permanent memory context for the result list */
-		old_context = MemoryContextSwitchTo(cluster_context);
+			/*
+			 * XXX I think the only reason there's no test failure here is
+			 * that we seldom have clustered indexes that would be affected by
+			 * concurrency.  Maybe we should also do the
+			 * ConditionalLockRelationOid+SearchSysCacheExists dance that we
+			 * do below.
+			 */
+			if (!cluster_is_permitted_for_relation(command, index->indrelid,
+												   GetUserId()))
+				continue;
 
-		rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
-		rtc->tableOid = index->indrelid;
-		rtc->indexOid = index->indexrelid;
-		rtcs = lappend(rtcs, rtc);
+			/* Use a permanent memory context for the result list */
+			old_context = MemoryContextSwitchTo(permcxt);
 
-		MemoryContextSwitchTo(old_context);
+			rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+			rtc->tableOid = index->indrelid;
+			rtc->indexOid = index->indexrelid;
+			rtcs = lappend(rtcs, rtc);
+
+			MemoryContextSwitchTo(old_context);
+		}
 	}
+	else
+	{
+		catalog = table_open(RelationRelationId, AccessShareLock);
+		scan = table_beginscan_catalog(catalog, 0, NULL);
+
+		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		{
+			RelToCluster *rtc;
+			Form_pg_class class;
+
+			class = (Form_pg_class) GETSTRUCT(tuple);
+
+			/*
+			 * Try to obtain a light lock on the table, to ensure it doesn't
+			 * go away while we collect the list.  If we cannot, just
+			 * disregard the table.  XXX we could release at the bottom of the
+			 * loop, but for now just hold it until this transaction is
+			 * finished.
+			 */
+			if (!ConditionalLockRelationOid(class->oid, AccessShareLock))
+				continue;
+
+			/* Verify that the table still exists. */
+			if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(class->oid)))
+			{
+				/* Release useless lock */
+				UnlockRelationOid(class->oid, AccessShareLock);
+				continue;
+			}
+
+			/* Can only process plain tables and matviews */
+			if (class->relkind != RELKIND_RELATION &&
+				class->relkind != RELKIND_MATVIEW)
+				continue;
+
+			if (!cluster_is_permitted_for_relation(command, class->oid,
+												   GetUserId()))
+				continue;
+
+			/* Use a permanent memory context for the result list */
+			old_context = MemoryContextSwitchTo(permcxt);
+
+			rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
+			rtc->tableOid = class->oid;
+			rtc->indexOid = InvalidOid;
+			rtcs = lappend(rtcs, rtc);
+
+			MemoryContextSwitchTo(old_context);
+		}
+	}
+
 	table_endscan(scan);
-
-	relation_close(indRelation, AccessShareLock);
+	relation_close(catalog, AccessShareLock);
 
 	return rtcs;
 }
 
 /*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Given a partitioned table or its index, return a list of RelToCluster for
  * all the children leaves tables/indexes.
  *
  * Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
  * on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
  */
 static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_repack_partitioned(RepackCommand cmd, MemoryContext cluster_context,
+								 Oid relid, bool rel_is_index)
 {
 	List	   *inhoids;
 	ListCell   *lc;
@@ -1702,17 +1797,33 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 	MemoryContext old_context;
 
 	/* Do not lock the children until they're processed */
-	inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+	inhoids = find_all_inheritors(relid, NoLock, NULL);
 
 	foreach(lc, inhoids)
 	{
-		Oid			indexrelid = lfirst_oid(lc);
-		Oid			relid = IndexGetRelation(indexrelid, false);
+		Oid			inhoid = lfirst_oid(lc);
+		Oid			inhrelid,
+					inhindid;
 		RelToCluster *rtc;
 
-		/* consider only leaf indexes */
-		if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
-			continue;
+		if (rel_is_index)
+		{
+			/* consider only leaf indexes */
+			if (get_rel_relkind(inhoid) != RELKIND_INDEX)
+				continue;
+
+			inhrelid = IndexGetRelation(inhoid, false);
+			inhindid = inhoid;
+		}
+		else
+		{
+			/* consider only leaf relations */
+			if (get_rel_relkind(inhoid) != RELKIND_RELATION)
+				continue;
+
+			inhrelid = inhoid;
+			inhindid = InvalidOid;
+		}
 
 		/*
 		 * It's possible that the user does not have privileges to CLUSTER the
@@ -1720,15 +1831,15 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
 		 * table.  We skip any partitions which the user is not permitted to
 		 * CLUSTER.
 		 */
-		if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+		if (!cluster_is_permitted_for_relation(cmd, inhrelid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
 		old_context = MemoryContextSwitchTo(cluster_context);
 
 		rtc = (RelToCluster *) palloc(sizeof(RelToCluster));
-		rtc->tableOid = relid;
-		rtc->indexOid = indexrelid;
+		rtc->tableOid = inhrelid;
+		rtc->indexOid = inhindid;
 		rtcs = lappend(rtcs, rtc);
 
 		MemoryContextSwitchTo(old_context);
@@ -1742,13 +1853,148 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
  * function emits a WARNING.
  */
 static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(RepackCommand cmd, Oid relid, Oid userid)
 {
 	if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
 		return true;
 
+	Assert(cmd == REPACK_COMMAND_CLUSTER || cmd == REPACK_COMMAND_REPACK);
 	ereport(WARNING,
-			(errmsg("permission denied to cluster \"%s\", skipping it",
-					get_rel_name(relid))));
+			errmsg("permission denied to execute %s on \"%s\", skipping it",
+				   cmd == REPACK_COMMAND_CLUSTER ? "CLUSTER" : "REPACK",
+				   get_rel_name(relid)));
+
 	return false;
 }
+
+
+/*
+ * Given a RepackStmt with an indicated relation name, resolve the relation
+ * name, obtain lock on it, then determine what to do based on the relation
+ * type: if it's not a partitioned table, repack it as indicated (using an
+ * existing clustered index, or following the indicated index), and return
+ * NULL.
+ *
+ * On the other hand, if the table is partitioned, do nothing further and
+ * instead return the opened relcache entry, so that caller can process the
+ * partitions using the multiple-table handling code.  The index name is not
+ * resolve in this case.
+ */
+static Relation
+process_single_relation(RepackStmt *stmt, ClusterParams *params)
+{
+	Relation	rel;
+	Oid			tableOid;
+
+	Assert(stmt->relation != NULL);
+	Assert(stmt->command == REPACK_COMMAND_CLUSTER ||
+		   stmt->command == REPACK_COMMAND_REPACK);
+
+	/*
+	 * Find, lock, and check permissions on the table.  We obtain
+	 * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+	 * single-transaction case.
+	 */
+	tableOid = RangeVarGetRelidExtended(stmt->relation,
+										AccessExclusiveLock,
+										0,
+										RangeVarCallbackMaintainsTable,
+										NULL);
+	rel = table_open(tableOid, NoLock);
+
+	/*
+	 * Reject clustering a remote temp table ... their local buffer manager is
+	 * not going to cope.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("cannot execute %s on temporary tables of other sessions",
+					   RepackCommandAsString(stmt->command)));
+	}
+
+	/*
+	 * For partitioned tables, let caller handle this.  Otherwise, process it
+	 * here and we're done.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		return rel;
+	else
+	{
+		Oid			indexOid;
+
+		indexOid = determine_clustered_index(rel, stmt->usingindex,
+											 stmt->indexname);
+		if (OidIsValid(indexOid))
+			check_index_is_clusterable(rel, indexOid, AccessExclusiveLock);
+		cluster_rel(stmt->command, stmt->usingindex, rel, indexOid, params);
+
+		/* Do an analyze, if requested */
+		if (params->options & CLUOPT_ANALYZE)
+		{
+			VacuumParams vac_params = {0};
+
+			vac_params.options |= VACOPT_ANALYZE;
+			if (params->options & CLUOPT_VERBOSE)
+				vac_params.options |= VACOPT_VERBOSE;
+			analyze_rel(RelationGetRelid(rel), NULL, vac_params, NIL, true,
+						NULL);
+		}
+
+		return NULL;
+	}
+}
+
+/*
+ * Given a relation and the usingindex/indexname options in a
+ * REPACK USING INDEX or CLUSTER command, return the OID of the index to use
+ * for clustering the table.
+ *
+ * Caller must hold lock on the relation so that the set of indexes doesn't
+ * change, and must call check_index_is_clusterable.
+ */
+static Oid
+determine_clustered_index(Relation rel, bool usingindex, const char *indexname)
+{
+	Oid			indexOid;
+
+	if (indexname == NULL && usingindex)
+	{
+		ListCell   *lc;
+
+		/* Find an index with indisclustered set, or report error */
+		foreach(lc, RelationGetIndexList(rel))
+		{
+			indexOid = lfirst_oid(lc);
+
+			if (get_index_isclustered(indexOid))
+				break;
+			indexOid = InvalidOid;
+		}
+
+		if (!OidIsValid(indexOid))
+			ereport(ERROR,
+					errcode(ERRCODE_UNDEFINED_OBJECT),
+					errmsg("there is no previously clustered index for table \"%s\"",
+						   RelationGetRelationName(rel)));
+	}
+	else if (indexname != NULL)
+	{
+		/*
+		 * An index was specified; figure out its OID.  It must be in the same
+		 * namespace as the relation.
+		 */
+		indexOid = get_relname_relid(indexname,
+									 rel->rd_rel->relnamespace);
+		if (!OidIsValid(indexOid))
+			ereport(ERROR,
+					errcode(ERRCODE_UNDEFINED_OBJECT),
+					errmsg("index \"%s\" for table \"%s\" does not exist",
+						   indexname, RelationGetRelationName(rel)));
+	}
+	else
+		indexOid = InvalidOid;
+
+	return indexOid;
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..8863ad0e8bd 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2287,7 +2287,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
 				cluster_params.options |= CLUOPT_VERBOSE;
 
 			/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
-			cluster_rel(rel, InvalidOid, &cluster_params);
+			cluster_rel(REPACK_COMMAND_VACUUMFULL, false, rel, InvalidOid,
+						&cluster_params);
 			/* cluster_rel closes the relation, but keeps lock */
 
 			rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index db43034b9db..f9152728021 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -280,7 +280,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		AlterCompositeTypeStmt AlterUserMappingStmt
 		AlterRoleStmt AlterRoleSetStmt AlterPolicyStmt AlterStatsStmt
 		AlterDefaultPrivilegesStmt DefACLAction
-		AnalyzeStmt CallStmt ClosePortalStmt ClusterStmt CommentStmt
+		AnalyzeStmt CallStmt ClosePortalStmt CommentStmt
 		ConstraintsSetStmt CopyStmt CreateAsStmt CreateCastStmt
 		CreateDomainStmt CreateExtensionStmt CreateGroupStmt CreateOpClassStmt
 		CreateOpFamilyStmt AlterOpFamilyStmt CreatePLangStmt
@@ -297,7 +297,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
 		ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
 		CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
-		RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+		RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
 		RuleActionStmt RuleActionStmtOrEmpty RuleStmt
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
@@ -316,7 +316,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <str>			opt_single_name
 %type <list>		opt_qualified_name
-%type <boolean>		opt_concurrently
+%type <boolean>		opt_concurrently opt_usingindex
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
 %type <list>		utility_option_list
@@ -763,7 +763,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
 	RESET RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -1025,7 +1025,6 @@ stmt:
 			| CallStmt
 			| CheckPointStmt
 			| ClosePortalStmt
-			| ClusterStmt
 			| CommentStmt
 			| ConstraintsSetStmt
 			| CopyStmt
@@ -1099,6 +1098,7 @@ stmt:
 			| RemoveFuncStmt
 			| RemoveOperStmt
 			| RenameStmt
+			| RepackStmt
 			| RevokeStmt
 			| RevokeRoleStmt
 			| RuleStmt
@@ -1135,6 +1135,11 @@ opt_concurrently:
 			| /*EMPTY*/						{ $$ = false; }
 		;
 
+opt_usingindex:
+			USING INDEX						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+		;
+
 opt_drop_behavior:
 			CASCADE							{ $$ = DROP_CASCADE; }
 			| RESTRICT						{ $$ = DROP_RESTRICT; }
@@ -11912,38 +11917,91 @@ CreateConversionStmt:
 /*****************************************************************************
  *
  *		QUERY:
+ *				REPACK [ (options) ] [ <qualified_name> [ USING INDEX <index_name> ] ]
+ *
+ *			obsolete variants:
  *				CLUSTER (options) [ <qualified_name> [ USING <index_name> ] ]
  *				CLUSTER [VERBOSE] [ <qualified_name> [ USING <index_name> ] ]
  *				CLUSTER [VERBOSE] <index_name> ON <qualified_name> (for pre-8.3)
  *
  *****************************************************************************/
 
-ClusterStmt:
-			CLUSTER '(' utility_option_list ')' qualified_name cluster_index_specification
+RepackStmt:
+			REPACK opt_utility_option_list qualified_name USING INDEX name
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = $3;
+					n->indexname = $6;
+					n->usingindex = true;
+					n->params = $2;
+					$$ = (Node *) n;
+				}
+			| REPACK opt_utility_option_list qualified_name opt_usingindex
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = $3;
+					n->indexname = NULL;
+					n->usingindex = $4;
+					n->params = $2;
+					$$ = (Node *) n;
+				}
+			| REPACK '(' utility_option_list ')'
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = NULL;
+					n->indexname = NULL;
+					n->usingindex = false;
+					n->params = $3;
+					$$ = (Node *) n;
+				}
+			| REPACK opt_usingindex
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = NULL;
+					n->indexname = NULL;
+					n->usingindex = $2;
+					n->params = NIL;
+					$$ = (Node *) n;
+				}
+			| CLUSTER '(' utility_option_list ')' qualified_name cluster_index_specification
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = $5;
 					n->indexname = $6;
+					n->usingindex = true;
 					n->params = $3;
 					$$ = (Node *) n;
 				}
 			| CLUSTER opt_utility_option_list
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = NULL;
 					n->indexname = NULL;
+					n->usingindex = true;
 					n->params = $2;
 					$$ = (Node *) n;
 				}
 			/* unparenthesized VERBOSE kept for pre-14 compatibility */
 			| CLUSTER opt_verbose qualified_name cluster_index_specification
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = $3;
 					n->indexname = $4;
+					n->usingindex = true;
 					if ($2)
 						n->params = list_make1(makeDefElem("verbose", NULL, @2));
 					$$ = (Node *) n;
@@ -11951,20 +12009,24 @@ ClusterStmt:
 			/* unparenthesized VERBOSE kept for pre-17 compatibility */
 			| CLUSTER VERBOSE
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = NULL;
 					n->indexname = NULL;
+					n->usingindex = true;
 					n->params = list_make1(makeDefElem("verbose", NULL, @2));
 					$$ = (Node *) n;
 				}
 			/* kept for pre-8.3 compatibility */
 			| CLUSTER opt_verbose name ON qualified_name
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = $5;
 					n->indexname = $3;
+					n->usingindex = true;
 					if ($2)
 						n->params = list_make1(makeDefElem("verbose", NULL, @2));
 					$$ = (Node *) n;
@@ -17960,6 +18022,7 @@ unreserved_keyword:
 			| RELATIVE_P
 			| RELEASE
 			| RENAME
+			| REPACK
 			| REPEATABLE
 			| REPLACE
 			| REPLICA
@@ -18592,6 +18655,7 @@ bare_label_keyword:
 			| RELATIVE_P
 			| RELEASE
 			| RENAME
+			| REPACK
 			| REPEATABLE
 			| REPLACE
 			| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 5f442bc3bd4..cf6db581007 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -277,9 +277,9 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_OK_IN_RECOVERY | COMMAND_OK_IN_READ_ONLY_TXN;
 			}
 
-		case T_ClusterStmt:
 		case T_ReindexStmt:
 		case T_VacuumStmt:
+		case T_RepackStmt:
 			{
 				/*
 				 * These commands write WAL, so they're not strictly
@@ -854,14 +854,14 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			ExecuteCallStmt(castNode(CallStmt, parsetree), params, isAtomicContext, dest);
 			break;
 
-		case T_ClusterStmt:
-			cluster(pstate, (ClusterStmt *) parsetree, isTopLevel);
-			break;
-
 		case T_VacuumStmt:
 			ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
 			break;
 
+		case T_RepackStmt:
+			ExecRepack(pstate, (RepackStmt *) parsetree, isTopLevel);
+			break;
+
 		case T_ExplainStmt:
 			ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
 			break;
@@ -2851,10 +2851,6 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_CALL;
 			break;
 
-		case T_ClusterStmt:
-			tag = CMDTAG_CLUSTER;
-			break;
-
 		case T_VacuumStmt:
 			if (((VacuumStmt *) parsetree)->is_vacuumcmd)
 				tag = CMDTAG_VACUUM;
@@ -2862,6 +2858,10 @@ CreateCommandTag(Node *parsetree)
 				tag = CMDTAG_ANALYZE;
 			break;
 
+		case T_RepackStmt:
+			tag = CMDTAG_REPACK;
+			break;
+
 		case T_ExplainStmt:
 			tag = CMDTAG_EXPLAIN;
 			break;
@@ -3499,7 +3499,7 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
-		case T_ClusterStmt:
+		case T_RepackStmt:
 			lev = LOGSTMT_DDL;
 			break;
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index c756c2bebaa..a1e10e8c2f6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -268,6 +268,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_ANALYZE;
 	else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
 		cmdtype = PROGRESS_COMMAND_CLUSTER;
+	else if (pg_strcasecmp(cmd, "REPACK") == 0)
+		cmdtype = PROGRESS_COMMAND_REPACK;
 	else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
 		cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
 	else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 8b10f2313f3..59ff6e0923b 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1247,7 +1247,7 @@ static const char *const sql_commands[] = {
 	"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
 	"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
 	"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
-	"REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+	"REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
 	"RESET", "REVOKE", "ROLLBACK",
 	"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
 	"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES", "WITH",
@@ -4997,6 +4997,37 @@ match_previous_words(int pattern_id,
 			COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
 	}
 
+/* REPACK */
+	else if (Matches("REPACK"))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+	else if (Matches("REPACK", "(*)"))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_clusterables);
+	/* If we have REPACK <sth>, then add "USING INDEX" */
+	else if (Matches("REPACK", MatchAnyExcept("(")))
+		COMPLETE_WITH("USING INDEX");
+	/* If we have REPACK (*) <sth>, then add "USING INDEX" */
+	else if (Matches("REPACK", "(*)", MatchAny))
+		COMPLETE_WITH("USING INDEX");
+	/* If we have REPACK <sth> USING, then add the index as well */
+	else if (Matches("REPACK", MatchAny, "USING", "INDEX"))
+	{
+		set_completion_reference(prev3_wd);
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+	}
+	else if (HeadMatches("REPACK", "(*") &&
+			 !HeadMatches("REPACK", "(*)"))
+	{
+		/*
+		 * This fires if we're in an unfinished parenthesized option list.
+		 * get_previous_words treats a completed parenthesized option list as
+		 * one word, so the above test is correct.
+		 */
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("VERBOSE");
+		else if (TailMatches("VERBOSE"))
+			COMPLETE_WITH("ON", "OFF");
+	}
+
 /* SECURITY LABEL */
 	else if (Matches("SECURITY"))
 		COMPLETE_WITH("LABEL");
diff --git a/src/bin/scripts/Makefile b/src/bin/scripts/Makefile
index 019ca06455d..f0c1bd4175c 100644
--- a/src/bin/scripts/Makefile
+++ b/src/bin/scripts/Makefile
@@ -16,7 +16,7 @@ subdir = src/bin/scripts
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-PROGRAMS = createdb createuser dropdb dropuser clusterdb vacuumdb reindexdb pg_isready
+PROGRAMS = createdb createuser dropdb dropuser clusterdb vacuumdb reindexdb pg_isready pg_repackdb
 
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
@@ -31,6 +31,7 @@ clusterdb: clusterdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport su
 vacuumdb: vacuumdb.o vacuuming.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 reindexdb: reindexdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 pg_isready: pg_isready.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
+pg_repackdb: pg_repackdb.o vacuuming.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 
 install: all installdirs
 	$(INSTALL_PROGRAM) createdb$(X)   '$(DESTDIR)$(bindir)'/createdb$(X)
@@ -41,6 +42,7 @@ install: all installdirs
 	$(INSTALL_PROGRAM) vacuumdb$(X)   '$(DESTDIR)$(bindir)'/vacuumdb$(X)
 	$(INSTALL_PROGRAM) reindexdb$(X)  '$(DESTDIR)$(bindir)'/reindexdb$(X)
 	$(INSTALL_PROGRAM) pg_isready$(X) '$(DESTDIR)$(bindir)'/pg_isready$(X)
+	$(INSTALL_PROGRAM) pg_repackdb$(X) '$(DESTDIR)$(bindir)'/pg_repackdb$(X)
 
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
diff --git a/src/bin/scripts/meson.build b/src/bin/scripts/meson.build
index a4fed59d1c9..18410fb80dd 100644
--- a/src/bin/scripts/meson.build
+++ b/src/bin/scripts/meson.build
@@ -42,6 +42,7 @@ vacuuming_common = static_library('libvacuuming_common',
 
 binaries = [
   'vacuumdb',
+  'pg_repackdb'
 ]
 foreach binary : binaries
   binary_sources = files('@[email protected]'.format(binary))
@@ -80,6 +81,7 @@ tests += {
       't/100_vacuumdb.pl',
       't/101_vacuumdb_all.pl',
       't/102_vacuumdb_stages.pl',
+      't/103_repackdb.pl',
       't/200_connstr.pl',
     ],
   },
diff --git a/src/bin/scripts/pg_repackdb.c b/src/bin/scripts/pg_repackdb.c
new file mode 100644
index 00000000000..23326372a77
--- /dev/null
+++ b/src/bin/scripts/pg_repackdb.c
@@ -0,0 +1,226 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_repackdb
+ *		An utility to run REPACK
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * FIXME: this is missing a way to specify the index to use to repack one
+ * table, or whether to pass a WITH INDEX clause when multiple tables are
+ * used.  Something like --index[=indexname].  Adding that bleeds into
+ * vacuuming.c as well.
+ *
+ * src/bin/scripts/pg_repackdb.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <limits.h>
+
+#include "common.h"
+#include "common/logging.h"
+#include "fe_utils/option_utils.h"
+#include "vacuuming.h"
+
+static void help(const char *progname);
+void		check_objfilter(void);
+
+int
+main(int argc, char *argv[])
+{
+	static struct option long_options[] = {
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"echo", no_argument, NULL, 'e'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"dbname", required_argument, NULL, 'd'},
+		{"all", no_argument, NULL, 'a'},
+		{"table", required_argument, NULL, 't'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"schema", required_argument, NULL, 'n'},
+		{"exclude-schema", required_argument, NULL, 'N'},
+		{"maintenance-db", required_argument, NULL, 2},
+		{NULL, 0, NULL, 0}
+	};
+
+	const char *progname;
+	int			optindex;
+	int			c;
+	const char *dbname = NULL;
+	const char *maintenance_db = NULL;
+	ConnParams	cparams;
+	bool		echo = false;
+	bool		quiet = false;
+	vacuumingOptions vacopts;
+	SimpleStringList objects = {NULL, NULL};
+	int			concurrentCons = 1;
+	int			tbl_count = 0;
+
+	/* initialize options */
+	memset(&vacopts, 0, sizeof(vacopts));
+	vacopts.mode = MODE_REPACK;
+
+	/* the same for connection parameters */
+	memset(&cparams, 0, sizeof(cparams));
+	cparams.prompt_password = TRI_DEFAULT;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pgscripts"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	while ((c = getopt_long(argc, argv, "ad:eh:j:n:N:p:qt:U:vwW",
+							long_options, &optindex)) != -1)
+	{
+		switch (c)
+		{
+			case 'a':
+				objfilter |= OBJFILTER_ALL_DBS;
+				break;
+			case 'd':
+				objfilter |= OBJFILTER_DATABASE;
+				dbname = pg_strdup(optarg);
+				break;
+			case 'e':
+				echo = true;
+				break;
+			case 'h':
+				cparams.pghost = pg_strdup(optarg);
+				break;
+			case 'j':
+				if (!option_parse_int(optarg, "-j/--jobs", 1, INT_MAX,
+									  &concurrentCons))
+					exit(1);
+				break;
+			case 'n':
+				objfilter |= OBJFILTER_SCHEMA;
+				simple_string_list_append(&objects, optarg);
+				break;
+			case 'N':
+				objfilter |= OBJFILTER_SCHEMA_EXCLUDE;
+				simple_string_list_append(&objects, optarg);
+				break;
+			case 'p':
+				cparams.pgport = pg_strdup(optarg);
+				break;
+			case 'q':
+				quiet = true;
+				break;
+			case 't':
+				objfilter |= OBJFILTER_TABLE;
+				simple_string_list_append(&objects, optarg);
+				tbl_count++;
+				break;
+			case 'U':
+				cparams.pguser = pg_strdup(optarg);
+				break;
+			case 'v':
+				vacopts.verbose = true;
+				break;
+			case 'w':
+				cparams.prompt_password = TRI_NO;
+				break;
+			case 'W':
+				cparams.prompt_password = TRI_YES;
+				break;
+			case 2:
+				maintenance_db = pg_strdup(optarg);
+				break;
+			default:
+				/* getopt_long already emitted a complaint */
+				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+				exit(1);
+		}
+	}
+
+	/*
+	 * Non-option argument specifies database name as long as it wasn't
+	 * already specified with -d / --dbname
+	 */
+	if (optind < argc && dbname == NULL)
+	{
+		objfilter |= OBJFILTER_DATABASE;
+		dbname = argv[optind];
+		optind++;
+	}
+
+	if (optind < argc)
+	{
+		pg_log_error("too many command-line arguments (first is \"%s\")",
+					 argv[optind]);
+		pg_log_error_hint("Try \"%s --help\" for more information.", progname);
+		exit(1);
+	}
+
+	/*
+	 * Validate the combination of filters specified in the command-line
+	 * options.
+	 */
+	check_objfilter();
+
+	vacuuming_main(&cparams, dbname, maintenance_db, &vacopts, &objects,
+				   false, tbl_count, concurrentCons,
+				   progname, echo, quiet);
+	exit(0);
+}
+
+/*
+ * Verify that the filters used at command line are compatible.
+ */
+void
+check_objfilter(void)
+{
+	if ((objfilter & OBJFILTER_ALL_DBS) &&
+		(objfilter & OBJFILTER_DATABASE))
+		pg_fatal("cannot repack all databases and a specific one at the same time");
+
+	if ((objfilter & OBJFILTER_TABLE) &&
+		(objfilter & OBJFILTER_SCHEMA))
+		pg_fatal("cannot repack all tables in schema(s) and specific table(s) at the same time");
+
+	if ((objfilter & OBJFILTER_TABLE) &&
+		(objfilter & OBJFILTER_SCHEMA_EXCLUDE))
+		pg_fatal("cannot repack specific table(s) and exclude schema(s) at the same time");
+
+	if ((objfilter & OBJFILTER_SCHEMA) &&
+		(objfilter & OBJFILTER_SCHEMA_EXCLUDE))
+		pg_fatal("cannot repack all tables in schema(s) and exclude schema(s) at the same time");
+}
+
+static void
+help(const char *progname)
+{
+	printf(_("%s repacks a PostgreSQL database.\n\n"), progname);
+	printf(_("Usage:\n"));
+	printf(_("  %s [OPTION]... [DBNAME]\n"), progname);
+	printf(_("\nOptions:\n"));
+	printf(_("  -a, --all                       repack all databases\n"));
+	printf(_("  -d, --dbname=DBNAME             database to repack\n"));
+	printf(_("  -e, --echo                      show the commands being sent to the server\n"));
+	printf(_("  -j, --jobs=NUM                  use this many concurrent connections to repack\n"));
+	printf(_("  -n, --schema=SCHEMA             repack tables in the specified schema(s) only\n"));
+	printf(_("  -N, --exclude-schema=SCHEMA     do not repack tables in the specified schema(s)\n"));
+	printf(_("  -q, --quiet                     don't write any messages\n"));
+	printf(_("  -t, --table='TABLE'             repack specific table(s) only\n"));
+	printf(_("  -v, --verbose                   write a lot of output\n"));
+	printf(_("  -V, --version                   output version information, then exit\n"));
+	printf(_("  -?, --help                      show this help, then exit\n"));
+	printf(_("\nConnection options:\n"));
+	printf(_("  -h, --host=HOSTNAME       database server host or socket directory\n"));
+	printf(_("  -p, --port=PORT           database server port\n"));
+	printf(_("  -U, --username=USERNAME   user name to connect as\n"));
+	printf(_("  -w, --no-password         never prompt for password\n"));
+	printf(_("  -W, --password            force password prompt\n"));
+	printf(_("  --maintenance-db=DBNAME   alternate maintenance database\n"));
+	printf(_("\nRead the description of the SQL command REPACK for details.\n"));
+	printf(_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
+	printf(_("%s home page: <%s>\n"), PACKAGE_NAME, PACKAGE_URL);
+}
diff --git a/src/bin/scripts/t/103_repackdb.pl b/src/bin/scripts/t/103_repackdb.pl
new file mode 100644
index 00000000000..51de4d7ab34
--- /dev/null
+++ b/src/bin/scripts/t/103_repackdb.pl
@@ -0,0 +1,24 @@
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+program_help_ok('pg_repackdb');
+program_version_ok('pg_repackdb');
+program_options_handling_ok('pg_repackdb');
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+$node->issues_sql_like(
+	[ 'pg_repackdb', 'postgres' ],
+	qr/statement: REPACK.*;/,
+	'SQL REPACK run');
+
+
+done_testing();
diff --git a/src/bin/scripts/vacuuming.c b/src/bin/scripts/vacuuming.c
index 9be37fcc45a..e07071c38ee 100644
--- a/src/bin/scripts/vacuuming.c
+++ b/src/bin/scripts/vacuuming.c
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  * vacuuming.c
- *		Common routines for vacuumdb
+ *		Common routines for vacuumdb and pg_repackdb
  *
  * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -166,6 +166,14 @@ vacuum_one_database(ConnParams *cparams,
 
 	conn = connectDatabase(cparams, progname, echo, false, true);
 
+	if (vacopts->mode == MODE_REPACK && PQserverVersion(conn) < 190000)
+	{
+		/* XXX arguably, here we should use VACUUM FULL instead of failing */
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" command on server versions older than PostgreSQL %s",
+				 "REPACK", "19");
+	}
+
 	if (vacopts->disable_page_skipping && PQserverVersion(conn) < 90600)
 	{
 		PQfinish(conn);
@@ -258,9 +266,15 @@ vacuum_one_database(ConnParams *cparams,
 		if (stage != ANALYZE_NO_STAGE)
 			printf(_("%s: processing database \"%s\": %s\n"),
 				   progname, PQdb(conn), _(stage_messages[stage]));
-		else
+		else if (vacopts->mode == MODE_VACUUM)
 			printf(_("%s: vacuuming database \"%s\"\n"),
 				   progname, PQdb(conn));
+		else
+		{
+			Assert(vacopts->mode == MODE_REPACK);
+			printf(_("%s: repacking database \"%s\"\n"),
+				   progname, PQdb(conn));
+		}
 		fflush(stdout);
 	}
 
@@ -350,7 +364,7 @@ vacuum_one_database(ConnParams *cparams,
 		 * through ParallelSlotsGetIdle.
 		 */
 		ParallelSlotSetHandler(free_slot, TableCommandResultHandler, NULL);
-		run_vacuum_command(free_slot->connection, sql.data,
+		run_vacuum_command(free_slot->connection, vacopts, sql.data,
 						   echo, tabname);
 
 		cell = cell->next;
@@ -363,7 +377,7 @@ vacuum_one_database(ConnParams *cparams,
 	}
 
 	/* If we used SKIP_DATABASE_STATS, mop up with ONLY_DATABASE_STATS */
-	if (vacopts->skip_database_stats &&
+	if (vacopts->mode == MODE_VACUUM && vacopts->skip_database_stats &&
 		stage == ANALYZE_NO_STAGE)
 	{
 		const char *cmd = "VACUUM (ONLY_DATABASE_STATS);";
@@ -376,7 +390,7 @@ vacuum_one_database(ConnParams *cparams,
 		}
 
 		ParallelSlotSetHandler(free_slot, TableCommandResultHandler, NULL);
-		run_vacuum_command(free_slot->connection, cmd, echo, NULL);
+		run_vacuum_command(free_slot->connection, vacopts, cmd, echo, NULL);
 
 		if (!ParallelSlotsWaitCompletion(sa))
 			failed = true;
@@ -708,6 +722,12 @@ vacuum_all_databases(ConnParams *cparams,
 	int			i;
 
 	conn = connectMaintenanceDatabase(cparams, progname, echo);
+	if (vacopts->mode == MODE_REPACK && PQserverVersion(conn) < 190000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" command on server versions older than PostgreSQL %s",
+				 "REPACK", "19");
+	}
 	result = executeQuery(conn,
 						  "SELECT datname FROM pg_database WHERE datallowconn AND datconnlimit <> -2 ORDER BY 1;",
 						  echo);
@@ -761,7 +781,7 @@ vacuum_all_databases(ConnParams *cparams,
 }
 
 /*
- * Construct a vacuum/analyze command to run based on the given
+ * Construct a vacuum/analyze/repack command to run based on the given
  * options, in the given string buffer, which may contain previous garbage.
  *
  * The table name used must be already properly quoted.  The command generated
@@ -777,7 +797,13 @@ prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
 
 	resetPQExpBuffer(sql);
 
-	if (vacopts->analyze_only)
+	if (vacopts->mode == MODE_REPACK)
+	{
+		appendPQExpBufferStr(sql, "REPACK");
+		if (vacopts->verbose)
+			appendPQExpBufferStr(sql, " (VERBOSE)");
+	}
+	else if (vacopts->analyze_only)
 	{
 		appendPQExpBufferStr(sql, "ANALYZE");
 
@@ -938,8 +964,8 @@ prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
  * Any errors during command execution are reported to stderr.
  */
 void
-run_vacuum_command(PGconn *conn, const char *sql, bool echo,
-				   const char *table)
+run_vacuum_command(PGconn *conn, vacuumingOptions *vacopts,
+				   const char *sql, bool echo, const char *table)
 {
 	bool		status;
 
@@ -952,13 +978,21 @@ run_vacuum_command(PGconn *conn, const char *sql, bool echo,
 	{
 		if (table)
 		{
-			pg_log_error("vacuuming of table \"%s\" in database \"%s\" failed: %s",
-						 table, PQdb(conn), PQerrorMessage(conn));
+			if (vacopts->mode == MODE_VACUUM)
+				pg_log_error("vacuuming of table \"%s\" in database \"%s\" failed: %s",
+							 table, PQdb(conn), PQerrorMessage(conn));
+			else
+				pg_log_error("repacking of table \"%s\" in database \"%s\" failed: %s",
+							 table, PQdb(conn), PQerrorMessage(conn));
 		}
 		else
 		{
-			pg_log_error("vacuuming of database \"%s\" failed: %s",
-						 PQdb(conn), PQerrorMessage(conn));
+			if (vacopts->mode == MODE_VACUUM)
+				pg_log_error("vacuuming of database \"%s\" failed: %s",
+							 PQdb(conn), PQerrorMessage(conn));
+			else
+				pg_log_error("repacking of database \"%s\" failed: %s",
+							 PQdb(conn), PQerrorMessage(conn));
 		}
 	}
 }
diff --git a/src/bin/scripts/vacuuming.h b/src/bin/scripts/vacuuming.h
index d3f000840fa..154bc9925c0 100644
--- a/src/bin/scripts/vacuuming.h
+++ b/src/bin/scripts/vacuuming.h
@@ -17,6 +17,12 @@
 #include "fe_utils/connect_utils.h"
 #include "fe_utils/simple_list.h"
 
+typedef enum
+{
+	MODE_VACUUM,
+	MODE_REPACK
+} RunMode;
+
 /* For analyze-in-stages mode */
 #define ANALYZE_NO_STAGE	-1
 #define ANALYZE_NUM_STAGES	3
@@ -24,6 +30,7 @@
 /* vacuum options controlled by user flags */
 typedef struct vacuumingOptions
 {
+	RunMode		mode;
 	bool		analyze_only;
 	bool		verbose;
 	bool		and_analyze;
@@ -87,8 +94,8 @@ extern void vacuum_all_databases(ConnParams *cparams,
 extern void prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
 								   vacuumingOptions *vacopts, const char *table);
 
-extern void run_vacuum_command(PGconn *conn, const char *sql, bool echo,
-							   const char *table);
+extern void run_vacuum_command(PGconn *conn, vacuumingOptions *vacopts,
+							   const char *sql, bool echo, const char *table);
 
 extern char *escape_quotes(const char *src);
 
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 60088a64cbb..890998d84bb 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -24,6 +24,7 @@
 #define CLUOPT_RECHECK 0x02		/* recheck relation state */
 #define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
 										 * indisclustered */
+#define CLUOPT_ANALYZE 0x08		/* do an ANALYZE */
 
 /* options for CLUSTER */
 typedef struct ClusterParams
@@ -31,8 +32,11 @@ typedef struct ClusterParams
 	bits32		options;		/* bitmask of CLUOPT_* */
 } ClusterParams;
 
-extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+
+extern void ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
+
+extern void cluster_rel(RepackCommand command, bool usingindex,
+						Relation OldHeap, Oid indexOid, ClusterParams *params);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 									   LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..5b6639c114c 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -56,24 +56,51 @@
 #define PROGRESS_ANALYZE_PHASE_COMPUTE_EXT_STATS		4
 #define PROGRESS_ANALYZE_PHASE_FINALIZE_ANALYZE			5
 
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND				0
-#define PROGRESS_CLUSTER_PHASE					1
-#define PROGRESS_CLUSTER_INDEX_RELID			2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED	3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN	4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS		5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED		6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT	7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Note: Since REPACK shares some code with CLUSTER, these values are also
+ * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
+ * introduce a separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND					0
+#define PROGRESS_REPACK_PHASE					1
+#define PROGRESS_REPACK_INDEX_RELID				2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED		3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN		4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS			5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED		6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT		7
 
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP	1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP	2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES		3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP	4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES	5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX	6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP	7
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP		1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP	2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES		3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP	4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES	5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX		6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP		7
+
+/*
+ * Commands of PROGRESS_REPACK
+ *
+ * Currently we only have one command, so the PROGRESS_REPACK_COMMAND
+ * parameter is not necessary. However it makes cluster.c simpler if we have
+ * the same set of parameters for CLUSTER and REPACK - see the note on REPACK
+ * parameters above.
+ */
+#define PROGRESS_REPACK_COMMAND_REPACK			1
+
+/*
+ * Progress parameters for cluster.
+ *
+ * Although we need to report REPACK and CLUSTER in separate views, the
+ * parameters and phases of CLUSTER are a subset of those of REPACK. Therefore
+ * we just use the appropriate values defined for REPACK above instead of
+ * defining a separate set of constants here.
+ */
 
 /* Commands of PROGRESS_CLUSTER */
 #define PROGRESS_CLUSTER_COMMAND_CLUSTER		1
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 86a236bd58b..fcc25a0c592 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3949,16 +3949,26 @@ typedef struct AlterSystemStmt
 } AlterSystemStmt;
 
 /* ----------------------
- *		Cluster Statement (support pbrown's cluster index implementation)
+ *		Repack Statement
  * ----------------------
  */
-typedef struct ClusterStmt
+typedef enum RepackCommand
+{
+	REPACK_COMMAND_CLUSTER,
+	REPACK_COMMAND_REPACK,
+	REPACK_COMMAND_VACUUMFULL,
+} RepackCommand;
+
+typedef struct RepackStmt
 {
 	NodeTag		type;
-	RangeVar   *relation;		/* relation being indexed, or NULL if all */
-	char	   *indexname;		/* original index defined */
+	RepackCommand command;		/* type of command being run */
+	RangeVar   *relation;		/* relation being repacked */
+	char	   *indexname;		/* order tuples by this index */
+	bool		usingindex;		/* whether USING INDEX is specified */
 	List	   *params;			/* list of DefElem nodes */
-} ClusterStmt;
+} RepackStmt;
+
 
 /* ----------------------
  *		Vacuum and Analyze Statements
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index a4af3f717a1..22559369e2c 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -374,6 +374,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index d250a714d59..cceb312f2b3 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
 PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
 PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
 PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
 PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
 PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
 PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index dda813ab407..e69e366dcdc 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -28,6 +28,7 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
 	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_REPACK,
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..5256628b51d 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -254,6 +254,63 @@ ORDER BY 1;
  clstr_tst_pkey
 (3 rows)
 
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a  |  b  |        c         |           substring            | length 
+----+-----+------------------+--------------------------------+--------
+ 10 |  14 | catorce          |                                |       
+ 18 |   5 | cinco            |                                |       
+  9 |   4 | cuatro           |                                |       
+ 26 |  19 | diecinueve       |                                |       
+ 12 |  18 | dieciocho        |                                |       
+ 30 |  16 | dieciseis        |                                |       
+ 24 |  17 | diecisiete       |                                |       
+  2 |  10 | diez             |                                |       
+ 23 |  12 | doce             |                                |       
+ 11 |   2 | dos              |                                |       
+ 25 |   9 | nueve            |                                |       
+ 31 |   8 | ocho             |                                |       
+  1 |  11 | once             |                                |       
+ 28 |  15 | quince           |                                |       
+ 32 |   6 | seis             | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 |   7 | siete            |                                |       
+ 15 |  13 | trece            |                                |       
+ 22 |  30 | treinta          |                                |       
+ 17 |  32 | treinta y dos    |                                |       
+  3 |  31 | treinta y uno    |                                |       
+  5 |   3 | tres             |                                |       
+ 20 |   1 | uno              |                                |       
+  6 |  20 | veinte           |                                |       
+ 14 |  25 | veinticinco      |                                |       
+ 21 |  24 | veinticuatro     |                                |       
+  4 |  22 | veintidos        |                                |       
+ 19 |  29 | veintinueve      |                                |       
+ 16 |  28 | veintiocho       |                                |       
+ 27 |  26 | veintiseis       |                                |       
+ 13 |  27 | veintisiete      |                                |       
+  7 |  23 | veintitres       |                                |       
+  8 |  21 | veintiuno        |                                |       
+  0 | 100 | in child table   |                                |       
+  0 | 100 | in child table 2 |                                |       
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR:  insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL:  Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+       conname        
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
 SELECT relname, relkind,
     EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
 FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -381,6 +438,35 @@ SELECT * FROM clstr_1;
  2
 (2 rows)
 
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR;  -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname 
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
+SET SESSION AUTHORIZATION regress_clstr_user;
 -- Test MVCC-safety of cluster. There isn't much we can do to verify the
 -- results with a single backend...
 CREATE TABLE clustertest (key int PRIMARY KEY);
@@ -495,6 +581,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
 ERROR:  cannot mark index clustered in partitioned table
 ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
 ERROR:  cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+   relname   | level | relkind | ?column? 
+-------------+-------+---------+----------
+ clstrpart   |     0 | p       | t
+ clstrpart1  |     1 | p       | t
+ clstrpart11 |     2 | r       | f
+ clstrpart12 |     2 | p       | t
+ clstrpart2  |     1 | r       | f
+ clstrpart3  |     1 | p       | t
+ clstrpart33 |     2 | r       | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+   relname   | level | relkind | ?column? 
+-------------+-------+---------+----------
+ clstrpart   |     0 | p       | t
+ clstrpart1  |     1 | p       | t
+ clstrpart11 |     2 | r       | f
+ clstrpart12 |     2 | p       | t
+ clstrpart2  |     1 | r       | f
+ clstrpart3  |     1 | p       | t
+ clstrpart33 |     2 | r       | f
+(7 rows)
+
 DROP TABLE clstrpart;
 -- Ownership of partitions is checked
 CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
@@ -513,7 +636,7 @@ CREATE TEMP TABLE ptnowner_oldnodes AS
   JOIN pg_class AS c ON c.oid=tree.relid;
 SET SESSION AUTHORIZATION regress_ptnowner;
 CLUSTER ptnowner USING ptnowner_i_idx;
-WARNING:  permission denied to cluster "ptnowner2", skipping it
+WARNING:  permission denied to execute CLUSTER on "ptnowner2", skipping it
 RESET SESSION AUTHORIZATION;
 SELECT a.relname, a.relfilenode=b.relfilenode FROM pg_class a
   JOIN ptnowner_oldnodes b USING (oid) ORDER BY a.relname COLLATE "C";
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..3a1d1d28282 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2071,6 +2071,29 @@ pg_stat_progress_create_index| SELECT s.pid,
     s.param15 AS partitions_done
    FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param2
+            WHEN 0 THEN 'initializing'::text
+            WHEN 1 THEN 'seq scanning heap'::text
+            WHEN 2 THEN 'index scanning heap'::text
+            WHEN 3 THEN 'sorting tuples'::text
+            WHEN 4 THEN 'writing new heap'::text
+            WHEN 5 THEN 'swapping relation files'::text
+            WHEN 6 THEN 'rebuilding index'::text
+            WHEN 7 THEN 'performing final cleanup'::text
+            ELSE NULL::text
+        END AS phase,
+    (s.param3)::oid AS repack_index_relid,
+    s.param4 AS heap_tuples_scanned,
+    s.param5 AS heap_tuples_written,
+    s.param6 AS heap_blks_total,
+    s.param7 AS heap_blks_scanned,
+    s.param8 AS index_rebuild_count
+   FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+     LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_progress_vacuum| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..cfcc3dc9761 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,6 +76,19 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
 SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
 ORDER BY 1;
 
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
 
 SELECT relname, relkind,
     EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
@@ -159,6 +172,34 @@ INSERT INTO clstr_1 VALUES (1);
 CLUSTER clstr_1;
 SELECT * FROM clstr_1;
 
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR;  -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+
 -- Test MVCC-safety of cluster. There isn't much we can do to verify the
 -- results with a single backend...
 
@@ -229,6 +270,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
 CLUSTER clstrpart;
 ALTER TABLE clstrpart SET WITHOUT CLUSTER;
 ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
 DROP TABLE clstrpart;
 
 -- Ownership of partitions is checked
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..98242e25432 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2537,6 +2537,8 @@ ReorderBufferTupleCidKey
 ReorderBufferUpdateProgressTxnCB
 ReorderTuple
 RepOriginId
+RepackCommand
+RepackStmt
 ReparameterizeForeignPathByChild_function
 ReplaceVarsFromTargetList_context
 ReplaceVarsNoMatchOption
@@ -2603,6 +2605,7 @@ RtlNtStatusToDosError_t
 RuleInfo
 RuleLock
 RuleStmt
+RunMode
 RunningTransactions
 RunningTransactionsData
 SASLStatus
-- 
2.43.0



  [application/octet-stream] v21-0004-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch (5.4K, 6-v21-0004-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch)
  download | inline diff:
From b9384aa62c96c94d45bb7e97a56acda5590f0c5f Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Mon, 11 Aug 2025 15:23:05 +0200
Subject: [PATCH v21 4/6] Move conversion of a "historic" to MVCC snapshot to a
 separate function.

The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
 src/backend/replication/logical/snapbuild.c | 51 +++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |  3 +-
 src/include/replication/snapbuild.h         |  1 +
 src/include/utils/snapmgr.h                 |  1 +
 4 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 98ddee20929..a2f1803622c 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
 SnapBuildInitialSnapshot(SnapBuild *builder)
 {
 	Snapshot	snap;
-	TransactionId xid;
 	TransactionId safeXid;
-	TransactionId *newxip;
-	int			newxcnt = 0;
 
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
 	Assert(builder->building_full_snapshot);
@@ -485,6 +482,31 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 
 	MyProc->xmin = snap->xmin;
 
+	/* Convert the historic snapshot to MVCC snapshot. */
+	return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the xip array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. This difference does has no impact on XidInMVCCSnapshot().
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+	TransactionId xid;
+	TransactionId *oldxip = snapshot->xip;
+	uint32		oldxcnt = snapshot->xcnt;
+	TransactionId *newxip;
+	int			newxcnt = 0;
+	Snapshot	result;
+
 	/* allocate in transaction context */
 	newxip = (TransactionId *)
 		palloc(sizeof(TransactionId) * GetMaxSnapshotXidCount());
@@ -495,7 +517,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	 * classical snapshot by marking all non-committed transactions as
 	 * in-progress. This can be expensive.
 	 */
-	for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+	for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
 	{
 		void	   *test;
 
@@ -503,7 +525,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		 * Check whether transaction committed using the decoding snapshot
 		 * meaning of ->xip.
 		 */
-		test = bsearch(&xid, snap->xip, snap->xcnt,
+		test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
 					   sizeof(TransactionId), xidComparator);
 
 		if (test == NULL)
@@ -520,11 +542,22 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 
 	/* adjust remaining snapshot fields as needed */
-	snap->snapshot_type = SNAPSHOT_MVCC;
-	snap->xcnt = newxcnt;
-	snap->xip = newxip;
+	snapshot->xcnt = newxcnt;
+	snapshot->xip = newxip;
 
-	return snap;
+	if (in_place)
+		result = snapshot;
+	else
+	{
+		result = CopySnapshot(snapshot);
+
+		/* Restore the original values so the source is intact. */
+		snapshot->xip = oldxip;
+		snapshot->xcnt = oldxcnt;
+	}
+	result->snapshot_type = SNAPSHOT_MVCC;
+
+	return result;
 }
 
 /*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..bc7840052fe 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -212,7 +212,6 @@ typedef struct ExportedSnapshot
 static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
 static void FreeSnapshot(Snapshot snapshot);
 static void SnapshotResetXmin(void);
@@ -602,7 +601,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
  * to 0.  The returned snapshot has the copied flag set.
  */
-static Snapshot
+Snapshot
 CopySnapshot(Snapshot snapshot)
 {
 	Snapshot	newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 44031dcf6e3..6d4d2d1814c 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
 
 extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
 extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
 extern void SnapBuildClearExportedSnapshot(void);
 extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 604c1f90216..f65f83c85cd 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -63,6 +63,7 @@ extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
 
+extern Snapshot CopySnapshot(Snapshot snapshot);
 extern Snapshot GetCatalogSnapshot(Oid relid);
 extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
 extern void InvalidateCatalogSnapshot(void);
-- 
2.43.0



  [application/octet-stream] v21-0001-Split-vacuumdb-to-create-vacuuming.c-h.patch (69.3K, 7-v21-0001-Split-vacuumdb-to-create-vacuuming.c-h.patch)
  download | inline diff:
From 2206b215a8855cf8a9c29889f5feab4a0bd8a7e0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <[email protected]>
Date: Sat, 30 Aug 2025 14:39:49 +0200
Subject: [PATCH v21 1/6] Split vacuumdb to create vacuuming.c/h

---
 src/bin/scripts/Makefile    |    4 +-
 src/bin/scripts/meson.build |   28 +-
 src/bin/scripts/vacuumdb.c  | 1048 +----------------------------------
 src/bin/scripts/vacuuming.c |  978 ++++++++++++++++++++++++++++++++
 src/bin/scripts/vacuuming.h |   95 ++++
 5 files changed, 1119 insertions(+), 1034 deletions(-)
 create mode 100644 src/bin/scripts/vacuuming.c
 create mode 100644 src/bin/scripts/vacuuming.h

diff --git a/src/bin/scripts/Makefile b/src/bin/scripts/Makefile
index f6b4d40810b..019ca06455d 100644
--- a/src/bin/scripts/Makefile
+++ b/src/bin/scripts/Makefile
@@ -28,7 +28,7 @@ createuser: createuser.o common.o $(WIN32RES) | submake-libpq submake-libpgport
 dropdb: dropdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 dropuser: dropuser.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 clusterdb: clusterdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
-vacuumdb: vacuumdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
+vacuumdb: vacuumdb.o vacuuming.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 reindexdb: reindexdb.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 pg_isready: pg_isready.o common.o $(WIN32RES) | submake-libpq submake-libpgport submake-libpgfeutils
 
@@ -50,7 +50,7 @@ uninstall:
 
 clean distclean:
 	rm -f $(addsuffix $(X), $(PROGRAMS)) $(addsuffix .o, $(PROGRAMS))
-	rm -f common.o $(WIN32RES)
+	rm -f common.o vacuuming.o $(WIN32RES)
 	rm -rf tmp_check
 
 export with_icu
diff --git a/src/bin/scripts/meson.build b/src/bin/scripts/meson.build
index 80df7c33257..a4fed59d1c9 100644
--- a/src/bin/scripts/meson.build
+++ b/src/bin/scripts/meson.build
@@ -12,7 +12,6 @@ binaries = [
   'createuser',
   'dropuser',
   'clusterdb',
-  'vacuumdb',
   'reindexdb',
   'pg_isready',
 ]
@@ -35,6 +34,33 @@ foreach binary : binaries
   bin_targets += binary
 endforeach
 
+vacuuming_common = static_library('libvacuuming_common',
+  files('common.c', 'vacuuming.c'),
+  dependencies: [frontend_code, libpq],
+  kwargs: internal_lib_args,
+)
+
+binaries = [
+  'vacuumdb',
+]
+foreach binary : binaries
+  binary_sources = files('@[email protected]'.format(binary))
+
+  if host_system == 'windows'
+    binary_sources += rc_bin_gen.process(win32ver_rc, extra_args: [
+      '--NAME', binary,
+      '--FILEDESC', '@0@ - PostgreSQL utility'.format(binary),])
+  endif
+
+  binary = executable(binary,
+    binary_sources,
+    link_with: [vacuuming_common],
+    dependencies: [frontend_code, libpq],
+    kwargs: default_bin_args,
+  )
+  bin_targets += binary
+endforeach
+
 tests += {
   'name': 'scripts',
   'sd': meson.current_source_dir(),
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index fd236087e90..b1be61ddf25 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -14,92 +14,13 @@
 
 #include <limits.h>
 
-#include "catalog/pg_attribute_d.h"
-#include "catalog/pg_class_d.h"
 #include "common.h"
-#include "common/connect.h"
 #include "common/logging.h"
-#include "fe_utils/cancel.h"
 #include "fe_utils/option_utils.h"
-#include "fe_utils/parallel_slot.h"
-#include "fe_utils/query_utils.h"
-#include "fe_utils/simple_list.h"
-#include "fe_utils/string_utils.h"
-
-
-/* vacuum options controlled by user flags */
-typedef struct vacuumingOptions
-{
-	bool		analyze_only;
-	bool		verbose;
-	bool		and_analyze;
-	bool		full;
-	bool		freeze;
-	bool		disable_page_skipping;
-	bool		skip_locked;
-	int			min_xid_age;
-	int			min_mxid_age;
-	int			parallel_workers;	/* >= 0 indicates user specified the
-									 * parallel degree, otherwise -1 */
-	bool		no_index_cleanup;
-	bool		force_index_cleanup;
-	bool		do_truncate;
-	bool		process_main;
-	bool		process_toast;
-	bool		skip_database_stats;
-	char	   *buffer_usage_limit;
-	bool		missing_stats_only;
-} vacuumingOptions;
-
-/* object filter options */
-typedef enum
-{
-	OBJFILTER_NONE = 0,			/* no filter used */
-	OBJFILTER_ALL_DBS = (1 << 0),	/* -a | --all */
-	OBJFILTER_DATABASE = (1 << 1),	/* -d | --dbname */
-	OBJFILTER_TABLE = (1 << 2), /* -t | --table */
-	OBJFILTER_SCHEMA = (1 << 3),	/* -n | --schema */
-	OBJFILTER_SCHEMA_EXCLUDE = (1 << 4),	/* -N | --exclude-schema */
-} VacObjFilter;
-
-static VacObjFilter objfilter = OBJFILTER_NONE;
-
-static SimpleStringList *retrieve_objects(PGconn *conn,
-										  vacuumingOptions *vacopts,
-										  SimpleStringList *objects,
-										  bool echo);
-
-static void vacuum_one_database(ConnParams *cparams,
-								vacuumingOptions *vacopts,
-								int stage,
-								SimpleStringList *objects,
-								SimpleStringList **found_objs,
-								int concurrentCons,
-								const char *progname, bool echo, bool quiet);
-
-static void vacuum_all_databases(ConnParams *cparams,
-								 vacuumingOptions *vacopts,
-								 bool analyze_in_stages,
-								 SimpleStringList *objects,
-								 int concurrentCons,
-								 const char *progname, bool echo, bool quiet);
-
-static void prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
-								   vacuumingOptions *vacopts, const char *table);
-
-static void run_vacuum_command(PGconn *conn, const char *sql, bool echo,
-							   const char *table);
+#include "vacuuming.h"
 
 static void help(const char *progname);
-
-void		check_objfilter(void);
-
-static char *escape_quotes(const char *src);
-
-/* For analyze-in-stages mode */
-#define ANALYZE_NO_STAGE	-1
-#define ANALYZE_NUM_STAGES	3
-
+static void check_objfilter(void);
 
 int
 main(int argc, char *argv[])
@@ -145,10 +66,6 @@ main(int argc, char *argv[])
 	int			c;
 	const char *dbname = NULL;
 	const char *maintenance_db = NULL;
-	char	   *host = NULL;
-	char	   *port = NULL;
-	char	   *username = NULL;
-	enum trivalue prompt_password = TRI_DEFAULT;
 	ConnParams	cparams;
 	bool		echo = false;
 	bool		quiet = false;
@@ -168,13 +85,18 @@ main(int argc, char *argv[])
 	vacopts.process_main = true;
 	vacopts.process_toast = true;
 
+	/* the same for connection parameters */
+	memset(&cparams, 0, sizeof(cparams));
+	cparams.prompt_password = TRI_DEFAULT;
+
 	pg_logging_init(argv[0]);
 	progname = get_progname(argv[0]);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pgscripts"));
 
-	handle_help_version_opts(argc, argv, "vacuumdb", help);
+	handle_help_version_opts(argc, argv, progname, help);
 
-	while ((c = getopt_long(argc, argv, "ad:efFh:j:n:N:p:P:qt:U:vwWzZ", long_options, &optindex)) != -1)
+	while ((c = getopt_long(argc, argv, "ad:efFh:j:n:N:p:P:qt:U:vwWzZ",
+							long_options, &optindex)) != -1)
 	{
 		switch (c)
 		{
@@ -195,7 +117,7 @@ main(int argc, char *argv[])
 				vacopts.freeze = true;
 				break;
 			case 'h':
-				host = pg_strdup(optarg);
+				cparams.pghost = pg_strdup(optarg);
 				break;
 			case 'j':
 				if (!option_parse_int(optarg, "-j/--jobs", 1, INT_MAX,
@@ -211,7 +133,7 @@ main(int argc, char *argv[])
 				simple_string_list_append(&objects, optarg);
 				break;
 			case 'p':
-				port = pg_strdup(optarg);
+				cparams.pgport = pg_strdup(optarg);
 				break;
 			case 'P':
 				if (!option_parse_int(optarg, "-P/--parallel", 0, INT_MAX,
@@ -227,16 +149,16 @@ main(int argc, char *argv[])
 				tbl_count++;
 				break;
 			case 'U':
-				username = pg_strdup(optarg);
+				cparams.pguser = pg_strdup(optarg);
 				break;
 			case 'v':
 				vacopts.verbose = true;
 				break;
 			case 'w':
-				prompt_password = TRI_NO;
+				cparams.prompt_password = TRI_NO;
 				break;
 			case 'W':
-				prompt_password = TRI_YES;
+				cparams.prompt_password = TRI_YES;
 				break;
 			case 'z':
 				vacopts.and_analyze = true;
@@ -380,66 +302,9 @@ main(int argc, char *argv[])
 		pg_fatal("cannot use the \"%s\" option without \"%s\" or \"%s\"",
 				 "missing-stats-only", "analyze-only", "analyze-in-stages");
 
-	/* fill cparams except for dbname, which is set below */
-	cparams.pghost = host;
-	cparams.pgport = port;
-	cparams.pguser = username;
-	cparams.prompt_password = prompt_password;
-	cparams.override_dbname = NULL;
-
-	setup_cancel_handler(NULL);
-
-	/* Avoid opening extra connections. */
-	if (tbl_count && (concurrentCons > tbl_count))
-		concurrentCons = tbl_count;
-
-	if (objfilter & OBJFILTER_ALL_DBS)
-	{
-		cparams.dbname = maintenance_db;
-
-		vacuum_all_databases(&cparams, &vacopts,
-							 analyze_in_stages,
-							 &objects,
-							 concurrentCons,
-							 progname, echo, quiet);
-	}
-	else
-	{
-		if (dbname == NULL)
-		{
-			if (getenv("PGDATABASE"))
-				dbname = getenv("PGDATABASE");
-			else if (getenv("PGUSER"))
-				dbname = getenv("PGUSER");
-			else
-				dbname = get_user_name_or_exit(progname);
-		}
-
-		cparams.dbname = dbname;
-
-		if (analyze_in_stages)
-		{
-			int			stage;
-			SimpleStringList *found_objs = NULL;
-
-			for (stage = 0; stage < ANALYZE_NUM_STAGES; stage++)
-			{
-				vacuum_one_database(&cparams, &vacopts,
-									stage,
-									&objects,
-									vacopts.missing_stats_only ? &found_objs : NULL,
-									concurrentCons,
-									progname, echo, quiet);
-			}
-		}
-		else
-			vacuum_one_database(&cparams, &vacopts,
-								ANALYZE_NO_STAGE,
-								&objects, NULL,
-								concurrentCons,
-								progname, echo, quiet);
-	}
-
+	vacuuming_main(&cparams, dbname, maintenance_db, &vacopts, &objects,
+				   analyze_in_stages, tbl_count, concurrentCons,
+				   progname, echo, quiet);
 	exit(0);
 }
 
@@ -466,885 +331,6 @@ check_objfilter(void)
 		pg_fatal("cannot vacuum all tables in schema(s) and exclude schema(s) at the same time");
 }
 
-/*
- * Returns a newly malloc'd version of 'src' with escaped single quotes and
- * backslashes.
- */
-static char *
-escape_quotes(const char *src)
-{
-	char	   *result = escape_single_quotes_ascii(src);
-
-	if (!result)
-		pg_fatal("out of memory");
-	return result;
-}
-
-/*
- * vacuum_one_database
- *
- * Process tables in the given database.
- *
- * There are two ways to specify the list of objects to process:
- *
- * 1) The "found_objs" parameter is a double pointer to a fully qualified list
- *    of objects to process, as returned by a previous call to
- *    vacuum_one_database().
- *
- *     a) If both "found_objs" (the double pointer) and "*found_objs" (the
- *        once-dereferenced double pointer) are not NULL, this list takes
- *        priority, and anything specified in "objects" is ignored.
- *
- *     b) If "found_objs" (the double pointer) is not NULL but "*found_objs"
- *        (the once-dereferenced double pointer) _is_ NULL, the "objects"
- *        parameter takes priority, and the results of the catalog query
- *        described in (2) are stored in "found_objs".
- *
- *     c) If "found_objs" (the double pointer) is NULL, the "objects"
- *        parameter again takes priority, and the results of the catalog query
- *        are not saved.
- *
- * 2) The "objects" parameter is a user-specified list of objects to process.
- *    When (1b) or (1c) applies, this function performs a catalog query to
- *    retrieve a fully qualified list of objects to process, as described
- *    below.
- *
- *     a) If "objects" is not NULL, the catalog query gathers only the objects
- *        listed in "objects".
- *
- *     b) If "objects" is NULL, all tables in the database are gathered.
- *
- * Note that this function is only concerned with running exactly one stage
- * when in analyze-in-stages mode; caller must iterate on us if necessary.
- *
- * If concurrentCons is > 1, multiple connections are used to vacuum tables
- * in parallel.
- */
-static void
-vacuum_one_database(ConnParams *cparams,
-					vacuumingOptions *vacopts,
-					int stage,
-					SimpleStringList *objects,
-					SimpleStringList **found_objs,
-					int concurrentCons,
-					const char *progname, bool echo, bool quiet)
-{
-	PQExpBufferData sql;
-	PGconn	   *conn;
-	SimpleStringListCell *cell;
-	ParallelSlotArray *sa;
-	int			ntups = 0;
-	bool		failed = false;
-	const char *initcmd;
-	SimpleStringList *ret = NULL;
-	const char *stage_commands[] = {
-		"SET default_statistics_target=1; SET vacuum_cost_delay=0;",
-		"SET default_statistics_target=10; RESET vacuum_cost_delay;",
-		"RESET default_statistics_target;"
-	};
-	const char *stage_messages[] = {
-		gettext_noop("Generating minimal optimizer statistics (1 target)"),
-		gettext_noop("Generating medium optimizer statistics (10 targets)"),
-		gettext_noop("Generating default (full) optimizer statistics")
-	};
-
-	Assert(stage == ANALYZE_NO_STAGE ||
-		   (stage >= 0 && stage < ANALYZE_NUM_STAGES));
-
-	conn = connectDatabase(cparams, progname, echo, false, true);
-
-	if (vacopts->disable_page_skipping && PQserverVersion(conn) < 90600)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "disable-page-skipping", "9.6");
-	}
-
-	if (vacopts->no_index_cleanup && PQserverVersion(conn) < 120000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "no-index-cleanup", "12");
-	}
-
-	if (vacopts->force_index_cleanup && PQserverVersion(conn) < 120000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "force-index-cleanup", "12");
-	}
-
-	if (!vacopts->do_truncate && PQserverVersion(conn) < 120000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "no-truncate", "12");
-	}
-
-	if (!vacopts->process_main && PQserverVersion(conn) < 160000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "no-process-main", "16");
-	}
-
-	if (!vacopts->process_toast && PQserverVersion(conn) < 140000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "no-process-toast", "14");
-	}
-
-	if (vacopts->skip_locked && PQserverVersion(conn) < 120000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "skip-locked", "12");
-	}
-
-	if (vacopts->min_xid_age != 0 && PQserverVersion(conn) < 90600)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "--min-xid-age", "9.6");
-	}
-
-	if (vacopts->min_mxid_age != 0 && PQserverVersion(conn) < 90600)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "--min-mxid-age", "9.6");
-	}
-
-	if (vacopts->parallel_workers >= 0 && PQserverVersion(conn) < 130000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "--parallel", "13");
-	}
-
-	if (vacopts->buffer_usage_limit && PQserverVersion(conn) < 160000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "--buffer-usage-limit", "16");
-	}
-
-	if (vacopts->missing_stats_only && PQserverVersion(conn) < 150000)
-	{
-		PQfinish(conn);
-		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
-				 "--missing-stats-only", "15");
-	}
-
-	/* skip_database_stats is used automatically if server supports it */
-	vacopts->skip_database_stats = (PQserverVersion(conn) >= 160000);
-
-	if (!quiet)
-	{
-		if (stage != ANALYZE_NO_STAGE)
-			printf(_("%s: processing database \"%s\": %s\n"),
-				   progname, PQdb(conn), _(stage_messages[stage]));
-		else
-			printf(_("%s: vacuuming database \"%s\"\n"),
-				   progname, PQdb(conn));
-		fflush(stdout);
-	}
-
-	/*
-	 * If the caller provided the results of a previous catalog query, just
-	 * use that.  Otherwise, run the catalog query ourselves and set the
-	 * return variable if provided.
-	 */
-	if (found_objs && *found_objs)
-		ret = *found_objs;
-	else
-	{
-		ret = retrieve_objects(conn, vacopts, objects, echo);
-		if (found_objs)
-			*found_objs = ret;
-	}
-
-	/*
-	 * Count the number of objects in the catalog query result.  If there are
-	 * none, we are done.
-	 */
-	for (cell = ret ? ret->head : NULL; cell; cell = cell->next)
-		ntups++;
-
-	if (ntups == 0)
-	{
-		PQfinish(conn);
-		return;
-	}
-
-	/*
-	 * Ensure concurrentCons is sane.  If there are more connections than
-	 * vacuumable relations, we don't need to use them all.
-	 */
-	if (concurrentCons > ntups)
-		concurrentCons = ntups;
-	if (concurrentCons <= 0)
-		concurrentCons = 1;
-
-	/*
-	 * All slots need to be prepared to run the appropriate analyze stage, if
-	 * caller requested that mode.  We have to prepare the initial connection
-	 * ourselves before setting up the slots.
-	 */
-	if (stage == ANALYZE_NO_STAGE)
-		initcmd = NULL;
-	else
-	{
-		initcmd = stage_commands[stage];
-		executeCommand(conn, initcmd, echo);
-	}
-
-	/*
-	 * Setup the database connections. We reuse the connection we already have
-	 * for the first slot.  If not in parallel mode, the first slot in the
-	 * array contains the connection.
-	 */
-	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, initcmd);
-	ParallelSlotsAdoptConn(sa, conn);
-
-	initPQExpBuffer(&sql);
-
-	cell = ret->head;
-	do
-	{
-		const char *tabname = cell->val;
-		ParallelSlot *free_slot;
-
-		if (CancelRequested)
-		{
-			failed = true;
-			goto finish;
-		}
-
-		free_slot = ParallelSlotsGetIdle(sa, NULL);
-		if (!free_slot)
-		{
-			failed = true;
-			goto finish;
-		}
-
-		prepare_vacuum_command(&sql, PQserverVersion(free_slot->connection),
-							   vacopts, tabname);
-
-		/*
-		 * Execute the vacuum.  All errors are handled in processQueryResult
-		 * through ParallelSlotsGetIdle.
-		 */
-		ParallelSlotSetHandler(free_slot, TableCommandResultHandler, NULL);
-		run_vacuum_command(free_slot->connection, sql.data,
-						   echo, tabname);
-
-		cell = cell->next;
-	} while (cell != NULL);
-
-	if (!ParallelSlotsWaitCompletion(sa))
-	{
-		failed = true;
-		goto finish;
-	}
-
-	/* If we used SKIP_DATABASE_STATS, mop up with ONLY_DATABASE_STATS */
-	if (vacopts->skip_database_stats && stage == ANALYZE_NO_STAGE)
-	{
-		const char *cmd = "VACUUM (ONLY_DATABASE_STATS);";
-		ParallelSlot *free_slot = ParallelSlotsGetIdle(sa, NULL);
-
-		if (!free_slot)
-		{
-			failed = true;
-			goto finish;
-		}
-
-		ParallelSlotSetHandler(free_slot, TableCommandResultHandler, NULL);
-		run_vacuum_command(free_slot->connection, cmd, echo, NULL);
-
-		if (!ParallelSlotsWaitCompletion(sa))
-			failed = true;
-	}
-
-finish:
-	ParallelSlotsTerminate(sa);
-	pg_free(sa);
-
-	termPQExpBuffer(&sql);
-
-	if (failed)
-		exit(1);
-}
-
-/*
- * Prepare the list of tables to process by querying the catalogs.
- *
- * Since we execute the constructed query with the default search_path (which
- * could be unsafe), everything in this query MUST be fully qualified.
- *
- * First, build a WITH clause for the catalog query if any tables were
- * specified, with a set of values made of relation names and their optional
- * set of columns.  This is used to match any provided column lists with the
- * generated qualified identifiers and to filter for the tables provided via
- * --table.  If a listed table does not exist, the catalog query will fail.
- */
-static SimpleStringList *
-retrieve_objects(PGconn *conn, vacuumingOptions *vacopts,
-				 SimpleStringList *objects, bool echo)
-{
-	PQExpBufferData buf;
-	PQExpBufferData catalog_query;
-	PGresult   *res;
-	SimpleStringListCell *cell;
-	SimpleStringList *found_objs = palloc0(sizeof(SimpleStringList));
-	bool		objects_listed = false;
-
-	initPQExpBuffer(&catalog_query);
-	for (cell = objects ? objects->head : NULL; cell; cell = cell->next)
-	{
-		char	   *just_table = NULL;
-		const char *just_columns = NULL;
-
-		if (!objects_listed)
-		{
-			appendPQExpBufferStr(&catalog_query,
-								 "WITH listed_objects (object_oid, column_list) "
-								 "AS (\n  VALUES (");
-			objects_listed = true;
-		}
-		else
-			appendPQExpBufferStr(&catalog_query, ",\n  (");
-
-		if (objfilter & (OBJFILTER_SCHEMA | OBJFILTER_SCHEMA_EXCLUDE))
-		{
-			appendStringLiteralConn(&catalog_query, cell->val, conn);
-			appendPQExpBufferStr(&catalog_query, "::pg_catalog.regnamespace, ");
-		}
-
-		if (objfilter & OBJFILTER_TABLE)
-		{
-			/*
-			 * Split relation and column names given by the user, this is used
-			 * to feed the CTE with values on which are performed pre-run
-			 * validity checks as well.  For now these happen only on the
-			 * relation name.
-			 */
-			splitTableColumnsSpec(cell->val, PQclientEncoding(conn),
-								  &just_table, &just_columns);
-
-			appendStringLiteralConn(&catalog_query, just_table, conn);
-			appendPQExpBufferStr(&catalog_query, "::pg_catalog.regclass, ");
-		}
-
-		if (just_columns && just_columns[0] != '\0')
-			appendStringLiteralConn(&catalog_query, just_columns, conn);
-		else
-			appendPQExpBufferStr(&catalog_query, "NULL");
-
-		appendPQExpBufferStr(&catalog_query, "::pg_catalog.text)");
-
-		pg_free(just_table);
-	}
-
-	/* Finish formatting the CTE */
-	if (objects_listed)
-		appendPQExpBufferStr(&catalog_query, "\n)\n");
-
-	appendPQExpBufferStr(&catalog_query, "SELECT c.relname, ns.nspname");
-
-	if (objects_listed)
-		appendPQExpBufferStr(&catalog_query, ", listed_objects.column_list");
-
-	appendPQExpBufferStr(&catalog_query,
-						 " FROM pg_catalog.pg_class c\n"
-						 " JOIN pg_catalog.pg_namespace ns"
-						 " ON c.relnamespace OPERATOR(pg_catalog.=) ns.oid\n"
-						 " CROSS JOIN LATERAL (SELECT c.relkind IN ("
-						 CppAsString2(RELKIND_PARTITIONED_TABLE) ", "
-						 CppAsString2(RELKIND_PARTITIONED_INDEX) ")) as p (inherited)\n"
-						 " LEFT JOIN pg_catalog.pg_class t"
-						 " ON c.reltoastrelid OPERATOR(pg_catalog.=) t.oid\n");
-
-	/*
-	 * Used to match the tables or schemas listed by the user, completing the
-	 * JOIN clause.
-	 */
-	if (objects_listed)
-	{
-		appendPQExpBufferStr(&catalog_query, " LEFT JOIN listed_objects"
-							 " ON listed_objects.object_oid"
-							 " OPERATOR(pg_catalog.=) ");
-
-		if (objfilter & OBJFILTER_TABLE)
-			appendPQExpBufferStr(&catalog_query, "c.oid\n");
-		else
-			appendPQExpBufferStr(&catalog_query, "ns.oid\n");
-	}
-
-	/*
-	 * Exclude temporary tables, beginning the WHERE clause.
-	 */
-	appendPQExpBufferStr(&catalog_query,
-						 " WHERE c.relpersistence OPERATOR(pg_catalog.!=) "
-						 CppAsString2(RELPERSISTENCE_TEMP) "\n");
-
-	/*
-	 * Used to match the tables or schemas listed by the user, for the WHERE
-	 * clause.
-	 */
-	if (objects_listed)
-	{
-		if (objfilter & OBJFILTER_SCHEMA_EXCLUDE)
-			appendPQExpBufferStr(&catalog_query,
-								 " AND listed_objects.object_oid IS NULL\n");
-		else
-			appendPQExpBufferStr(&catalog_query,
-								 " AND listed_objects.object_oid IS NOT NULL\n");
-	}
-
-	/*
-	 * If no tables were listed, filter for the relevant relation types.  If
-	 * tables were given via --table, don't bother filtering by relation type.
-	 * Instead, let the server decide whether a given relation can be
-	 * processed in which case the user will know about it.
-	 */
-	if ((objfilter & OBJFILTER_TABLE) == 0)
-	{
-		/*
-		 * vacuumdb should generally follow the behavior of the underlying
-		 * VACUUM and ANALYZE commands. If analyze_only is true, process
-		 * regular tables, materialized views, and partitioned tables, just
-		 * like ANALYZE (with no specific target tables) does. Otherwise,
-		 * process only regular tables and materialized views, since VACUUM
-		 * skips partitioned tables when no target tables are specified.
-		 */
-		if (vacopts->analyze_only)
-			appendPQExpBufferStr(&catalog_query,
-								 " AND c.relkind OPERATOR(pg_catalog.=) ANY (array["
-								 CppAsString2(RELKIND_RELATION) ", "
-								 CppAsString2(RELKIND_MATVIEW) ", "
-								 CppAsString2(RELKIND_PARTITIONED_TABLE) "])\n");
-		else
-			appendPQExpBufferStr(&catalog_query,
-								 " AND c.relkind OPERATOR(pg_catalog.=) ANY (array["
-								 CppAsString2(RELKIND_RELATION) ", "
-								 CppAsString2(RELKIND_MATVIEW) "])\n");
-
-	}
-
-	/*
-	 * For --min-xid-age and --min-mxid-age, the age of the relation is the
-	 * greatest of the ages of the main relation and its associated TOAST
-	 * table.  The commands generated by vacuumdb will also process the TOAST
-	 * table for the relation if necessary, so it does not need to be
-	 * considered separately.
-	 */
-	if (vacopts->min_xid_age != 0)
-	{
-		appendPQExpBuffer(&catalog_query,
-						  " AND GREATEST(pg_catalog.age(c.relfrozenxid),"
-						  " pg_catalog.age(t.relfrozenxid)) "
-						  " OPERATOR(pg_catalog.>=) '%d'::pg_catalog.int4\n"
-						  " AND c.relfrozenxid OPERATOR(pg_catalog.!=)"
-						  " '0'::pg_catalog.xid\n",
-						  vacopts->min_xid_age);
-	}
-
-	if (vacopts->min_mxid_age != 0)
-	{
-		appendPQExpBuffer(&catalog_query,
-						  " AND GREATEST(pg_catalog.mxid_age(c.relminmxid),"
-						  " pg_catalog.mxid_age(t.relminmxid)) OPERATOR(pg_catalog.>=)"
-						  " '%d'::pg_catalog.int4\n"
-						  " AND c.relminmxid OPERATOR(pg_catalog.!=)"
-						  " '0'::pg_catalog.xid\n",
-						  vacopts->min_mxid_age);
-	}
-
-	if (vacopts->missing_stats_only)
-	{
-		appendPQExpBufferStr(&catalog_query, " AND (\n");
-
-		/* regular stats */
-		appendPQExpBufferStr(&catalog_query,
-							 " EXISTS (SELECT NULL FROM pg_catalog.pg_attribute a\n"
-							 " WHERE a.attrelid OPERATOR(pg_catalog.=) c.oid\n"
-							 " AND a.attnum OPERATOR(pg_catalog.>) 0::pg_catalog.int2\n"
-							 " AND NOT a.attisdropped\n"
-							 " AND a.attstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
-							 " AND a.attgenerated OPERATOR(pg_catalog.<>) "
-							 CppAsString2(ATTRIBUTE_GENERATED_VIRTUAL) "\n"
-							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
-							 " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
-							 " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
-							 " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");
-
-		/* extended stats */
-		appendPQExpBufferStr(&catalog_query,
-							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext e\n"
-							 " WHERE e.stxrelid OPERATOR(pg_catalog.=) c.oid\n"
-							 " AND e.stxstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
-							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext_data d\n"
-							 " WHERE d.stxoid OPERATOR(pg_catalog.=) e.oid\n"
-							 " AND d.stxdinherit OPERATOR(pg_catalog.=) p.inherited))\n");
-
-		/* expression indexes */
-		appendPQExpBufferStr(&catalog_query,
-							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_attribute a\n"
-							 " JOIN pg_catalog.pg_index i"
-							 " ON i.indexrelid OPERATOR(pg_catalog.=) a.attrelid\n"
-							 " WHERE i.indrelid OPERATOR(pg_catalog.=) c.oid\n"
-							 " AND i.indkey[a.attnum OPERATOR(pg_catalog.-) 1::pg_catalog.int2]"
-							 " OPERATOR(pg_catalog.=) 0::pg_catalog.int2\n"
-							 " AND a.attnum OPERATOR(pg_catalog.>) 0::pg_catalog.int2\n"
-							 " AND NOT a.attisdropped\n"
-							 " AND a.attstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
-							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
-							 " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
-							 " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
-							 " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");
-
-		/* inheritance and regular stats */
-		appendPQExpBufferStr(&catalog_query,
-							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_attribute a\n"
-							 " WHERE a.attrelid OPERATOR(pg_catalog.=) c.oid\n"
-							 " AND a.attnum OPERATOR(pg_catalog.>) 0::pg_catalog.int2\n"
-							 " AND NOT a.attisdropped\n"
-							 " AND a.attstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
-							 " AND a.attgenerated OPERATOR(pg_catalog.<>) "
-							 CppAsString2(ATTRIBUTE_GENERATED_VIRTUAL) "\n"
-							 " AND c.relhassubclass\n"
-							 " AND NOT p.inherited\n"
-							 " AND EXISTS (SELECT NULL FROM pg_catalog.pg_inherits h\n"
-							 " WHERE h.inhparent OPERATOR(pg_catalog.=) c.oid)\n"
-							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
-							 " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
-							 " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
-							 " AND s.stainherit))\n");
-
-		/* inheritance and extended stats */
-		appendPQExpBufferStr(&catalog_query,
-							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext e\n"
-							 " WHERE e.stxrelid OPERATOR(pg_catalog.=) c.oid\n"
-							 " AND e.stxstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
-							 " AND c.relhassubclass\n"
-							 " AND NOT p.inherited\n"
-							 " AND EXISTS (SELECT NULL FROM pg_catalog.pg_inherits h\n"
-							 " WHERE h.inhparent OPERATOR(pg_catalog.=) c.oid)\n"
-							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext_data d\n"
-							 " WHERE d.stxoid OPERATOR(pg_catalog.=) e.oid\n"
-							 " AND d.stxdinherit))\n");
-
-		appendPQExpBufferStr(&catalog_query, " )\n");
-	}
-
-	/*
-	 * Execute the catalog query.  We use the default search_path for this
-	 * query for consistency with table lookups done elsewhere by the user.
-	 */
-	appendPQExpBufferStr(&catalog_query, " ORDER BY c.relpages DESC;");
-	executeCommand(conn, "RESET search_path;", echo);
-	res = executeQuery(conn, catalog_query.data, echo);
-	termPQExpBuffer(&catalog_query);
-	PQclear(executeQuery(conn, ALWAYS_SECURE_SEARCH_PATH_SQL, echo));
-
-	/*
-	 * Build qualified identifiers for each table, including the column list
-	 * if given.
-	 */
-	initPQExpBuffer(&buf);
-	for (int i = 0; i < PQntuples(res); i++)
-	{
-		appendPQExpBufferStr(&buf,
-							 fmtQualifiedIdEnc(PQgetvalue(res, i, 1),
-											   PQgetvalue(res, i, 0),
-											   PQclientEncoding(conn)));
-
-		if (objects_listed && !PQgetisnull(res, i, 2))
-			appendPQExpBufferStr(&buf, PQgetvalue(res, i, 2));
-
-		simple_string_list_append(found_objs, buf.data);
-		resetPQExpBuffer(&buf);
-	}
-	termPQExpBuffer(&buf);
-	PQclear(res);
-
-	return found_objs;
-}
-
-/*
- * Vacuum/analyze all connectable databases.
- *
- * In analyze-in-stages mode, we process all databases in one stage before
- * moving on to the next stage.  That ensure minimal stats are available
- * quickly everywhere before generating more detailed ones.
- */
-static void
-vacuum_all_databases(ConnParams *cparams,
-					 vacuumingOptions *vacopts,
-					 bool analyze_in_stages,
-					 SimpleStringList *objects,
-					 int concurrentCons,
-					 const char *progname, bool echo, bool quiet)
-{
-	PGconn	   *conn;
-	PGresult   *result;
-	int			stage;
-	int			i;
-
-	conn = connectMaintenanceDatabase(cparams, progname, echo);
-	result = executeQuery(conn,
-						  "SELECT datname FROM pg_database WHERE datallowconn AND datconnlimit <> -2 ORDER BY 1;",
-						  echo);
-	PQfinish(conn);
-
-	if (analyze_in_stages)
-	{
-		SimpleStringList **found_objs = NULL;
-
-		if (vacopts->missing_stats_only)
-			found_objs = palloc0(PQntuples(result) * sizeof(SimpleStringList *));
-
-		/*
-		 * When analyzing all databases in stages, we analyze them all in the
-		 * fastest stage first, so that initial statistics become available
-		 * for all of them as soon as possible.
-		 *
-		 * This means we establish several times as many connections, but
-		 * that's a secondary consideration.
-		 */
-		for (stage = 0; stage < ANALYZE_NUM_STAGES; stage++)
-		{
-			for (i = 0; i < PQntuples(result); i++)
-			{
-				cparams->override_dbname = PQgetvalue(result, i, 0);
-
-				vacuum_one_database(cparams, vacopts,
-									stage,
-									objects,
-									vacopts->missing_stats_only ? &found_objs[i] : NULL,
-									concurrentCons,
-									progname, echo, quiet);
-			}
-		}
-	}
-	else
-	{
-		for (i = 0; i < PQntuples(result); i++)
-		{
-			cparams->override_dbname = PQgetvalue(result, i, 0);
-
-			vacuum_one_database(cparams, vacopts,
-								ANALYZE_NO_STAGE,
-								objects, NULL,
-								concurrentCons,
-								progname, echo, quiet);
-		}
-	}
-
-	PQclear(result);
-}
-
-/*
- * Construct a vacuum/analyze command to run based on the given options, in the
- * given string buffer, which may contain previous garbage.
- *
- * The table name used must be already properly quoted.  The command generated
- * depends on the server version involved and it is semicolon-terminated.
- */
-static void
-prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
-					   vacuumingOptions *vacopts, const char *table)
-{
-	const char *paren = " (";
-	const char *comma = ", ";
-	const char *sep = paren;
-
-	resetPQExpBuffer(sql);
-
-	if (vacopts->analyze_only)
-	{
-		appendPQExpBufferStr(sql, "ANALYZE");
-
-		/* parenthesized grammar of ANALYZE is supported since v11 */
-		if (serverVersion >= 110000)
-		{
-			if (vacopts->skip_locked)
-			{
-				/* SKIP_LOCKED is supported since v12 */
-				Assert(serverVersion >= 120000);
-				appendPQExpBuffer(sql, "%sSKIP_LOCKED", sep);
-				sep = comma;
-			}
-			if (vacopts->verbose)
-			{
-				appendPQExpBuffer(sql, "%sVERBOSE", sep);
-				sep = comma;
-			}
-			if (vacopts->buffer_usage_limit)
-			{
-				Assert(serverVersion >= 160000);
-				appendPQExpBuffer(sql, "%sBUFFER_USAGE_LIMIT '%s'", sep,
-								  vacopts->buffer_usage_limit);
-				sep = comma;
-			}
-			if (sep != paren)
-				appendPQExpBufferChar(sql, ')');
-		}
-		else
-		{
-			if (vacopts->verbose)
-				appendPQExpBufferStr(sql, " VERBOSE");
-		}
-	}
-	else
-	{
-		appendPQExpBufferStr(sql, "VACUUM");
-
-		/* parenthesized grammar of VACUUM is supported since v9.0 */
-		if (serverVersion >= 90000)
-		{
-			if (vacopts->disable_page_skipping)
-			{
-				/* DISABLE_PAGE_SKIPPING is supported since v9.6 */
-				Assert(serverVersion >= 90600);
-				appendPQExpBuffer(sql, "%sDISABLE_PAGE_SKIPPING", sep);
-				sep = comma;
-			}
-			if (vacopts->no_index_cleanup)
-			{
-				/* "INDEX_CLEANUP FALSE" has been supported since v12 */
-				Assert(serverVersion >= 120000);
-				Assert(!vacopts->force_index_cleanup);
-				appendPQExpBuffer(sql, "%sINDEX_CLEANUP FALSE", sep);
-				sep = comma;
-			}
-			if (vacopts->force_index_cleanup)
-			{
-				/* "INDEX_CLEANUP TRUE" has been supported since v12 */
-				Assert(serverVersion >= 120000);
-				Assert(!vacopts->no_index_cleanup);
-				appendPQExpBuffer(sql, "%sINDEX_CLEANUP TRUE", sep);
-				sep = comma;
-			}
-			if (!vacopts->do_truncate)
-			{
-				/* TRUNCATE is supported since v12 */
-				Assert(serverVersion >= 120000);
-				appendPQExpBuffer(sql, "%sTRUNCATE FALSE", sep);
-				sep = comma;
-			}
-			if (!vacopts->process_main)
-			{
-				/* PROCESS_MAIN is supported since v16 */
-				Assert(serverVersion >= 160000);
-				appendPQExpBuffer(sql, "%sPROCESS_MAIN FALSE", sep);
-				sep = comma;
-			}
-			if (!vacopts->process_toast)
-			{
-				/* PROCESS_TOAST is supported since v14 */
-				Assert(serverVersion >= 140000);
-				appendPQExpBuffer(sql, "%sPROCESS_TOAST FALSE", sep);
-				sep = comma;
-			}
-			if (vacopts->skip_database_stats)
-			{
-				/* SKIP_DATABASE_STATS is supported since v16 */
-				Assert(serverVersion >= 160000);
-				appendPQExpBuffer(sql, "%sSKIP_DATABASE_STATS", sep);
-				sep = comma;
-			}
-			if (vacopts->skip_locked)
-			{
-				/* SKIP_LOCKED is supported since v12 */
-				Assert(serverVersion >= 120000);
-				appendPQExpBuffer(sql, "%sSKIP_LOCKED", sep);
-				sep = comma;
-			}
-			if (vacopts->full)
-			{
-				appendPQExpBuffer(sql, "%sFULL", sep);
-				sep = comma;
-			}
-			if (vacopts->freeze)
-			{
-				appendPQExpBuffer(sql, "%sFREEZE", sep);
-				sep = comma;
-			}
-			if (vacopts->verbose)
-			{
-				appendPQExpBuffer(sql, "%sVERBOSE", sep);
-				sep = comma;
-			}
-			if (vacopts->and_analyze)
-			{
-				appendPQExpBuffer(sql, "%sANALYZE", sep);
-				sep = comma;
-			}
-			if (vacopts->parallel_workers >= 0)
-			{
-				/* PARALLEL is supported since v13 */
-				Assert(serverVersion >= 130000);
-				appendPQExpBuffer(sql, "%sPARALLEL %d", sep,
-								  vacopts->parallel_workers);
-				sep = comma;
-			}
-			if (vacopts->buffer_usage_limit)
-			{
-				Assert(serverVersion >= 160000);
-				appendPQExpBuffer(sql, "%sBUFFER_USAGE_LIMIT '%s'", sep,
-								  vacopts->buffer_usage_limit);
-				sep = comma;
-			}
-			if (sep != paren)
-				appendPQExpBufferChar(sql, ')');
-		}
-		else
-		{
-			if (vacopts->full)
-				appendPQExpBufferStr(sql, " FULL");
-			if (vacopts->freeze)
-				appendPQExpBufferStr(sql, " FREEZE");
-			if (vacopts->verbose)
-				appendPQExpBufferStr(sql, " VERBOSE");
-			if (vacopts->and_analyze)
-				appendPQExpBufferStr(sql, " ANALYZE");
-		}
-	}
-
-	appendPQExpBuffer(sql, " %s;", table);
-}
-
-/*
- * Send a vacuum/analyze command to the server, returning after sending the
- * command.
- *
- * Any errors during command execution are reported to stderr.
- */
-static void
-run_vacuum_command(PGconn *conn, const char *sql, bool echo,
-				   const char *table)
-{
-	bool		status;
-
-	if (echo)
-		printf("%s\n", sql);
-
-	status = PQsendQuery(conn, sql) == 1;
-
-	if (!status)
-	{
-		if (table)
-			pg_log_error("vacuuming of table \"%s\" in database \"%s\" failed: %s",
-						 table, PQdb(conn), PQerrorMessage(conn));
-		else
-			pg_log_error("vacuuming of database \"%s\" failed: %s",
-						 PQdb(conn), PQerrorMessage(conn));
-	}
-}
 
 static void
 help(const char *progname)
diff --git a/src/bin/scripts/vacuuming.c b/src/bin/scripts/vacuuming.c
new file mode 100644
index 00000000000..9be37fcc45a
--- /dev/null
+++ b/src/bin/scripts/vacuuming.c
@@ -0,0 +1,978 @@
+/*-------------------------------------------------------------------------
+ * vacuuming.c
+ *		Common routines for vacuumdb
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/scripts/vacuuming.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres_fe.h"
+
+#include <limits.h>
+
+#include "catalog/pg_attribute_d.h"
+#include "catalog/pg_class_d.h"
+#include "common/connect.h"
+#include "common/logging.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/string_utils.h"
+#include "vacuuming.h"
+
+VacObjFilter objfilter = OBJFILTER_NONE;
+
+
+/*
+ * Executes vacuum/analyze as indicated, or dies in case of failure.
+ */
+void
+vacuuming_main(ConnParams *cparams, const char *dbname,
+			   const char *maintenance_db, vacuumingOptions *vacopts,
+			   SimpleStringList *objects, bool analyze_in_stages,
+			   int tbl_count, int concurrentCons,
+			   const char *progname, bool echo, bool quiet)
+{
+	setup_cancel_handler(NULL);
+
+	/* Avoid opening extra connections. */
+	if (tbl_count && (concurrentCons > tbl_count))
+		concurrentCons = tbl_count;
+
+	if (objfilter & OBJFILTER_ALL_DBS)
+	{
+		cparams->dbname = maintenance_db;
+
+		vacuum_all_databases(cparams, vacopts,
+							 analyze_in_stages,
+							 objects,
+							 concurrentCons,
+							 progname, echo, quiet);
+	}
+	else
+	{
+		if (dbname == NULL)
+		{
+			if (getenv("PGDATABASE"))
+				dbname = getenv("PGDATABASE");
+			else if (getenv("PGUSER"))
+				dbname = getenv("PGUSER");
+			else
+				dbname = get_user_name_or_exit(progname);
+		}
+
+		cparams->dbname = dbname;
+
+		if (analyze_in_stages)
+		{
+			int			stage;
+			SimpleStringList *found_objs = NULL;
+
+			for (stage = 0; stage < ANALYZE_NUM_STAGES; stage++)
+			{
+				vacuum_one_database(cparams, vacopts,
+									stage,
+									objects,
+									vacopts->missing_stats_only ? &found_objs : NULL,
+									concurrentCons,
+									progname, echo, quiet);
+			}
+		}
+		else
+			vacuum_one_database(cparams, vacopts,
+								ANALYZE_NO_STAGE,
+								objects, NULL,
+								concurrentCons,
+								progname, echo, quiet);
+	}
+}
+
+
+/*
+ * vacuum_one_database
+ *
+ * Process tables in the given database.
+ *
+ * There are two ways to specify the list of objects to process:
+ *
+ * 1) The "found_objs" parameter is a double pointer to a fully qualified list
+ *    of objects to process, as returned by a previous call to
+ *    vacuum_one_database().
+ *
+ *     a) If both "found_objs" (the double pointer) and "*found_objs" (the
+ *        once-dereferenced double pointer) are not NULL, this list takes
+ *        priority, and anything specified in "objects" is ignored.
+ *
+ *     b) If "found_objs" (the double pointer) is not NULL but "*found_objs"
+ *        (the once-dereferenced double pointer) _is_ NULL, the "objects"
+ *        parameter takes priority, and the results of the catalog query
+ *        described in (2) are stored in "found_objs".
+ *
+ *     c) If "found_objs" (the double pointer) is NULL, the "objects"
+ *        parameter again takes priority, and the results of the catalog query
+ *        are not saved.
+ *
+ * 2) The "objects" parameter is a user-specified list of objects to process.
+ *    When (1b) or (1c) applies, this function performs a catalog query to
+ *    retrieve a fully qualified list of objects to process, as described
+ *    below.
+ *
+ *     a) If "objects" is not NULL, the catalog query gathers only the objects
+ *        listed in "objects".
+ *
+ *     b) If "objects" is NULL, all tables in the database are gathered.
+ *
+ * Note that this function is only concerned with running exactly one stage
+ * when in analyze-in-stages mode; caller must iterate on us if necessary.
+ *
+ * If concurrentCons is > 1, multiple connections are used to vacuum tables
+ * in parallel.
+ */
+void
+vacuum_one_database(ConnParams *cparams,
+					vacuumingOptions *vacopts,
+					int stage,
+					SimpleStringList *objects,
+					SimpleStringList **found_objs,
+					int concurrentCons,
+					const char *progname, bool echo, bool quiet)
+{
+	PQExpBufferData sql;
+	PGconn	   *conn;
+	SimpleStringListCell *cell;
+	ParallelSlotArray *sa;
+	int			ntups = 0;
+	bool		failed = false;
+	const char *initcmd;
+	SimpleStringList *ret = NULL;
+	const char *stage_commands[] = {
+		"SET default_statistics_target=1; SET vacuum_cost_delay=0;",
+		"SET default_statistics_target=10; RESET vacuum_cost_delay;",
+		"RESET default_statistics_target;"
+	};
+	const char *stage_messages[] = {
+		gettext_noop("Generating minimal optimizer statistics (1 target)"),
+		gettext_noop("Generating medium optimizer statistics (10 targets)"),
+		gettext_noop("Generating default (full) optimizer statistics")
+	};
+
+	Assert(stage == ANALYZE_NO_STAGE ||
+		   (stage >= 0 && stage < ANALYZE_NUM_STAGES));
+
+	conn = connectDatabase(cparams, progname, echo, false, true);
+
+	if (vacopts->disable_page_skipping && PQserverVersion(conn) < 90600)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "disable-page-skipping", "9.6");
+	}
+
+	if (vacopts->no_index_cleanup && PQserverVersion(conn) < 120000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "no-index-cleanup", "12");
+	}
+
+	if (vacopts->force_index_cleanup && PQserverVersion(conn) < 120000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "force-index-cleanup", "12");
+	}
+
+	if (!vacopts->do_truncate && PQserverVersion(conn) < 120000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "no-truncate", "12");
+	}
+
+	if (!vacopts->process_main && PQserverVersion(conn) < 160000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "no-process-main", "16");
+	}
+
+	if (!vacopts->process_toast && PQserverVersion(conn) < 140000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "no-process-toast", "14");
+	}
+
+	if (vacopts->skip_locked && PQserverVersion(conn) < 120000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "skip-locked", "12");
+	}
+
+	if (vacopts->min_xid_age != 0 && PQserverVersion(conn) < 90600)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "--min-xid-age", "9.6");
+	}
+
+	if (vacopts->min_mxid_age != 0 && PQserverVersion(conn) < 90600)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "--min-mxid-age", "9.6");
+	}
+
+	if (vacopts->parallel_workers >= 0 && PQserverVersion(conn) < 130000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "--parallel", "13");
+	}
+
+	if (vacopts->buffer_usage_limit && PQserverVersion(conn) < 160000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "--buffer-usage-limit", "16");
+	}
+
+	if (vacopts->missing_stats_only && PQserverVersion(conn) < 150000)
+	{
+		PQfinish(conn);
+		pg_fatal("cannot use the \"%s\" option on server versions older than PostgreSQL %s",
+				 "--missing-stats-only", "15");
+	}
+
+	/* skip_database_stats is used automatically if server supports it */
+	vacopts->skip_database_stats = (PQserverVersion(conn) >= 160000);
+
+	if (!quiet)
+	{
+		if (stage != ANALYZE_NO_STAGE)
+			printf(_("%s: processing database \"%s\": %s\n"),
+				   progname, PQdb(conn), _(stage_messages[stage]));
+		else
+			printf(_("%s: vacuuming database \"%s\"\n"),
+				   progname, PQdb(conn));
+		fflush(stdout);
+	}
+
+	/*
+	 * If the caller provided the results of a previous catalog query, just
+	 * use that.  Otherwise, run the catalog query ourselves and set the
+	 * return variable if provided.
+	 */
+	if (found_objs && *found_objs)
+		ret = *found_objs;
+	else
+	{
+		ret = retrieve_objects(conn, vacopts, objects, echo);
+		if (found_objs)
+			*found_objs = ret;
+	}
+
+	/*
+	 * Count the number of objects in the catalog query result.  If there are
+	 * none, we are done.
+	 */
+	for (cell = ret ? ret->head : NULL; cell; cell = cell->next)
+		ntups++;
+
+	if (ntups == 0)
+	{
+		PQfinish(conn);
+		return;
+	}
+
+	/*
+	 * Ensure concurrentCons is sane.  If there are more connections than
+	 * vacuumable relations, we don't need to use them all.
+	 */
+	if (concurrentCons > ntups)
+		concurrentCons = ntups;
+	if (concurrentCons <= 0)
+		concurrentCons = 1;
+
+	/*
+	 * All slots need to be prepared to run the appropriate analyze stage, if
+	 * caller requested that mode.  We have to prepare the initial connection
+	 * ourselves before setting up the slots.
+	 */
+	if (stage == ANALYZE_NO_STAGE)
+		initcmd = NULL;
+	else
+	{
+		initcmd = stage_commands[stage];
+		executeCommand(conn, initcmd, echo);
+	}
+
+	/*
+	 * Setup the database connections. We reuse the connection we already have
+	 * for the first slot.  If not in parallel mode, the first slot in the
+	 * array contains the connection.
+	 */
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, initcmd);
+	ParallelSlotsAdoptConn(sa, conn);
+
+	initPQExpBuffer(&sql);
+
+	cell = ret->head;
+	do
+	{
+		const char *tabname = cell->val;
+		ParallelSlot *free_slot;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			goto finish;
+		}
+
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
+		if (!free_slot)
+		{
+			failed = true;
+			goto finish;
+		}
+
+		prepare_vacuum_command(&sql, PQserverVersion(free_slot->connection),
+							   vacopts, tabname);
+
+		/*
+		 * Execute the vacuum.  All errors are handled in processQueryResult
+		 * through ParallelSlotsGetIdle.
+		 */
+		ParallelSlotSetHandler(free_slot, TableCommandResultHandler, NULL);
+		run_vacuum_command(free_slot->connection, sql.data,
+						   echo, tabname);
+
+		cell = cell->next;
+	} while (cell != NULL);
+
+	if (!ParallelSlotsWaitCompletion(sa))
+	{
+		failed = true;
+		goto finish;
+	}
+
+	/* If we used SKIP_DATABASE_STATS, mop up with ONLY_DATABASE_STATS */
+	if (vacopts->skip_database_stats &&
+		stage == ANALYZE_NO_STAGE)
+	{
+		const char *cmd = "VACUUM (ONLY_DATABASE_STATS);";
+		ParallelSlot *free_slot = ParallelSlotsGetIdle(sa, NULL);
+
+		if (!free_slot)
+		{
+			failed = true;
+			goto finish;
+		}
+
+		ParallelSlotSetHandler(free_slot, TableCommandResultHandler, NULL);
+		run_vacuum_command(free_slot->connection, cmd, echo, NULL);
+
+		if (!ParallelSlotsWaitCompletion(sa))
+			failed = true;
+	}
+
+finish:
+	ParallelSlotsTerminate(sa);
+	pg_free(sa);
+
+	termPQExpBuffer(&sql);
+
+	if (failed)
+		exit(1);
+}
+
+/*
+ * Prepare the list of tables to process by querying the catalogs.
+ *
+ * Since we execute the constructed query with the default search_path (which
+ * could be unsafe), everything in this query MUST be fully qualified.
+ *
+ * First, build a WITH clause for the catalog query if any tables were
+ * specified, with a set of values made of relation names and their optional
+ * set of columns.  This is used to match any provided column lists with the
+ * generated qualified identifiers and to filter for the tables provided via
+ * --table.  If a listed table does not exist, the catalog query will fail.
+ */
+SimpleStringList *
+retrieve_objects(PGconn *conn, vacuumingOptions *vacopts,
+				 SimpleStringList *objects, bool echo)
+{
+	PQExpBufferData buf;
+	PQExpBufferData catalog_query;
+	PGresult   *res;
+	SimpleStringListCell *cell;
+	SimpleStringList *found_objs = palloc0(sizeof(SimpleStringList));
+	bool		objects_listed = false;
+
+	initPQExpBuffer(&catalog_query);
+	for (cell = objects ? objects->head : NULL; cell; cell = cell->next)
+	{
+		char	   *just_table = NULL;
+		const char *just_columns = NULL;
+
+		if (!objects_listed)
+		{
+			appendPQExpBufferStr(&catalog_query,
+								 "WITH listed_objects (object_oid, column_list) AS (\n"
+								 "  VALUES (");
+			objects_listed = true;
+		}
+		else
+			appendPQExpBufferStr(&catalog_query, ",\n  (");
+
+		if (objfilter & (OBJFILTER_SCHEMA | OBJFILTER_SCHEMA_EXCLUDE))
+		{
+			appendStringLiteralConn(&catalog_query, cell->val, conn);
+			appendPQExpBufferStr(&catalog_query, "::pg_catalog.regnamespace, ");
+		}
+
+		if (objfilter & OBJFILTER_TABLE)
+		{
+			/*
+			 * Split relation and column names given by the user, this is used
+			 * to feed the CTE with values on which are performed pre-run
+			 * validity checks as well.  For now these happen only on the
+			 * relation name.
+			 */
+			splitTableColumnsSpec(cell->val, PQclientEncoding(conn),
+								  &just_table, &just_columns);
+
+			appendStringLiteralConn(&catalog_query, just_table, conn);
+			appendPQExpBufferStr(&catalog_query, "::pg_catalog.regclass, ");
+		}
+
+		if (just_columns && just_columns[0] != '\0')
+			appendStringLiteralConn(&catalog_query, just_columns, conn);
+		else
+			appendPQExpBufferStr(&catalog_query, "NULL");
+
+		appendPQExpBufferStr(&catalog_query, "::pg_catalog.text)");
+
+		pg_free(just_table);
+	}
+
+	/* Finish formatting the CTE */
+	if (objects_listed)
+		appendPQExpBufferStr(&catalog_query, "\n)\n");
+
+	appendPQExpBufferStr(&catalog_query, "SELECT c.relname, ns.nspname");
+
+	if (objects_listed)
+		appendPQExpBufferStr(&catalog_query, ", listed_objects.column_list");
+
+	appendPQExpBufferStr(&catalog_query,
+						 " FROM pg_catalog.pg_class c\n"
+						 " JOIN pg_catalog.pg_namespace ns"
+						 " ON c.relnamespace OPERATOR(pg_catalog.=) ns.oid\n"
+						 " CROSS JOIN LATERAL (SELECT c.relkind IN ("
+						 CppAsString2(RELKIND_PARTITIONED_TABLE) ", "
+						 CppAsString2(RELKIND_PARTITIONED_INDEX) ")) as p (inherited)\n"
+						 " LEFT JOIN pg_catalog.pg_class t"
+						 " ON c.reltoastrelid OPERATOR(pg_catalog.=) t.oid\n");
+
+	/*
+	 * Used to match the tables or schemas listed by the user, completing the
+	 * JOIN clause.
+	 */
+	if (objects_listed)
+	{
+		appendPQExpBufferStr(&catalog_query, " LEFT JOIN listed_objects"
+							 " ON listed_objects.object_oid"
+							 " OPERATOR(pg_catalog.=) ");
+
+		if (objfilter & OBJFILTER_TABLE)
+			appendPQExpBufferStr(&catalog_query, "c.oid\n");
+		else
+			appendPQExpBufferStr(&catalog_query, "ns.oid\n");
+	}
+
+	/*
+	 * Exclude temporary tables, beginning the WHERE clause.
+	 */
+	appendPQExpBufferStr(&catalog_query,
+						 " WHERE c.relpersistence OPERATOR(pg_catalog.!=) "
+						 CppAsString2(RELPERSISTENCE_TEMP) "\n");
+
+	/*
+	 * Used to match the tables or schemas listed by the user, for the WHERE
+	 * clause.
+	 */
+	if (objects_listed)
+	{
+		if (objfilter & OBJFILTER_SCHEMA_EXCLUDE)
+			appendPQExpBufferStr(&catalog_query,
+								 " AND listed_objects.object_oid IS NULL\n");
+		else
+			appendPQExpBufferStr(&catalog_query,
+								 " AND listed_objects.object_oid IS NOT NULL\n");
+	}
+
+	/*
+	 * If no tables were listed, filter for the relevant relation types.  If
+	 * tables were given via --table, don't bother filtering by relation type.
+	 * Instead, let the server decide whether a given relation can be
+	 * processed in which case the user will know about it.
+	 */
+	if ((objfilter & OBJFILTER_TABLE) == 0)
+	{
+		/*
+		 * vacuumdb should generally follow the behavior of the underlying
+		 * VACUUM and ANALYZE commands. If analyze_only is true, process
+		 * regular tables, materialized views, and partitioned tables, just
+		 * like ANALYZE (with no specific target tables) does. Otherwise,
+		 * process only regular tables and materialized views, since VACUUM
+		 * skips partitioned tables when no target tables are specified.
+		 */
+		if (vacopts->analyze_only)
+			appendPQExpBufferStr(&catalog_query,
+								 " AND c.relkind OPERATOR(pg_catalog.=) ANY (array["
+								 CppAsString2(RELKIND_RELATION) ", "
+								 CppAsString2(RELKIND_MATVIEW) ", "
+								 CppAsString2(RELKIND_PARTITIONED_TABLE) "])\n");
+		else
+			appendPQExpBufferStr(&catalog_query,
+								 " AND c.relkind OPERATOR(pg_catalog.=) ANY (array["
+								 CppAsString2(RELKIND_RELATION) ", "
+								 CppAsString2(RELKIND_MATVIEW) "])\n");
+	}
+
+	/*
+	 * For --min-xid-age and --min-mxid-age, the age of the relation is the
+	 * greatest of the ages of the main relation and its associated TOAST
+	 * table.  The commands generated by vacuumdb will also process the TOAST
+	 * table for the relation if necessary, so it does not need to be
+	 * considered separately.
+	 */
+	if (vacopts->min_xid_age != 0)
+	{
+		appendPQExpBuffer(&catalog_query,
+						  " AND GREATEST(pg_catalog.age(c.relfrozenxid),"
+						  " pg_catalog.age(t.relfrozenxid)) "
+						  " OPERATOR(pg_catalog.>=) '%d'::pg_catalog.int4\n"
+						  " AND c.relfrozenxid OPERATOR(pg_catalog.!=)"
+						  " '0'::pg_catalog.xid\n",
+						  vacopts->min_xid_age);
+	}
+
+	if (vacopts->min_mxid_age != 0)
+	{
+		appendPQExpBuffer(&catalog_query,
+						  " AND GREATEST(pg_catalog.mxid_age(c.relminmxid),"
+						  " pg_catalog.mxid_age(t.relminmxid)) OPERATOR(pg_catalog.>=)"
+						  " '%d'::pg_catalog.int4\n"
+						  " AND c.relminmxid OPERATOR(pg_catalog.!=)"
+						  " '0'::pg_catalog.xid\n",
+						  vacopts->min_mxid_age);
+	}
+
+	if (vacopts->missing_stats_only)
+	{
+		appendPQExpBufferStr(&catalog_query, " AND (\n");
+
+		/* regular stats */
+		appendPQExpBufferStr(&catalog_query,
+							 " EXISTS (SELECT NULL FROM pg_catalog.pg_attribute a\n"
+							 " WHERE a.attrelid OPERATOR(pg_catalog.=) c.oid\n"
+							 " AND a.attnum OPERATOR(pg_catalog.>) 0::pg_catalog.int2\n"
+							 " AND NOT a.attisdropped\n"
+							 " AND a.attstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
+							 " AND a.attgenerated OPERATOR(pg_catalog.<>) "
+							 CppAsString2(ATTRIBUTE_GENERATED_VIRTUAL) "\n"
+							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
+							 " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
+							 " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
+							 " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");
+
+		/* extended stats */
+		appendPQExpBufferStr(&catalog_query,
+							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext e\n"
+							 " WHERE e.stxrelid OPERATOR(pg_catalog.=) c.oid\n"
+							 " AND e.stxstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
+							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext_data d\n"
+							 " WHERE d.stxoid OPERATOR(pg_catalog.=) e.oid\n"
+							 " AND d.stxdinherit OPERATOR(pg_catalog.=) p.inherited))\n");
+
+		/* expression indexes */
+		appendPQExpBufferStr(&catalog_query,
+							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_attribute a\n"
+							 " JOIN pg_catalog.pg_index i"
+							 " ON i.indexrelid OPERATOR(pg_catalog.=) a.attrelid\n"
+							 " WHERE i.indrelid OPERATOR(pg_catalog.=) c.oid\n"
+							 " AND i.indkey[a.attnum OPERATOR(pg_catalog.-) 1::pg_catalog.int2]"
+							 " OPERATOR(pg_catalog.=) 0::pg_catalog.int2\n"
+							 " AND a.attnum OPERATOR(pg_catalog.>) 0::pg_catalog.int2\n"
+							 " AND NOT a.attisdropped\n"
+							 " AND a.attstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
+							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
+							 " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
+							 " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
+							 " AND s.stainherit OPERATOR(pg_catalog.=) p.inherited))\n");
+
+		/* inheritance and regular stats */
+		appendPQExpBufferStr(&catalog_query,
+							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_attribute a\n"
+							 " WHERE a.attrelid OPERATOR(pg_catalog.=) c.oid\n"
+							 " AND a.attnum OPERATOR(pg_catalog.>) 0::pg_catalog.int2\n"
+							 " AND NOT a.attisdropped\n"
+							 " AND a.attstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
+							 " AND a.attgenerated OPERATOR(pg_catalog.<>) "
+							 CppAsString2(ATTRIBUTE_GENERATED_VIRTUAL) "\n"
+							 " AND c.relhassubclass\n"
+							 " AND NOT p.inherited\n"
+							 " AND EXISTS (SELECT NULL FROM pg_catalog.pg_inherits h\n"
+							 " WHERE h.inhparent OPERATOR(pg_catalog.=) c.oid)\n"
+							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic s\n"
+							 " WHERE s.starelid OPERATOR(pg_catalog.=) a.attrelid\n"
+							 " AND s.staattnum OPERATOR(pg_catalog.=) a.attnum\n"
+							 " AND s.stainherit))\n");
+
+		/* inheritance and extended stats */
+		appendPQExpBufferStr(&catalog_query,
+							 " OR EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext e\n"
+							 " WHERE e.stxrelid OPERATOR(pg_catalog.=) c.oid\n"
+							 " AND e.stxstattarget IS DISTINCT FROM 0::pg_catalog.int2\n"
+							 " AND c.relhassubclass\n"
+							 " AND NOT p.inherited\n"
+							 " AND EXISTS (SELECT NULL FROM pg_catalog.pg_inherits h\n"
+							 " WHERE h.inhparent OPERATOR(pg_catalog.=) c.oid)\n"
+							 " AND NOT EXISTS (SELECT NULL FROM pg_catalog.pg_statistic_ext_data d\n"
+							 " WHERE d.stxoid OPERATOR(pg_catalog.=) e.oid\n"
+							 " AND d.stxdinherit))\n");
+
+		appendPQExpBufferStr(&catalog_query, " )\n");
+	}
+
+	/*
+	 * Execute the catalog query.  We use the default search_path for this
+	 * query for consistency with table lookups done elsewhere by the user.
+	 */
+	appendPQExpBufferStr(&catalog_query, " ORDER BY c.relpages DESC;");
+	executeCommand(conn, "RESET search_path;", echo);
+	res = executeQuery(conn, catalog_query.data, echo);
+	termPQExpBuffer(&catalog_query);
+	PQclear(executeQuery(conn, ALWAYS_SECURE_SEARCH_PATH_SQL, echo));
+
+	/*
+	 * Build qualified identifiers for each table, including the column list
+	 * if given.
+	 */
+	initPQExpBuffer(&buf);
+	for (int i = 0; i < PQntuples(res); i++)
+	{
+		appendPQExpBufferStr(&buf,
+							 fmtQualifiedIdEnc(PQgetvalue(res, i, 1),
+											   PQgetvalue(res, i, 0),
+											   PQclientEncoding(conn)));
+
+		if (objects_listed && !PQgetisnull(res, i, 2))
+			appendPQExpBufferStr(&buf, PQgetvalue(res, i, 2));
+
+		simple_string_list_append(found_objs, buf.data);
+		resetPQExpBuffer(&buf);
+	}
+	termPQExpBuffer(&buf);
+	PQclear(res);
+
+	return found_objs;
+}
+
+/*
+ * Vacuum/analyze all connectable databases.
+ *
+ * In analyze-in-stages mode, we process all databases in one stage before
+ * moving on to the next stage.  That ensure minimal stats are available
+ * quickly everywhere before generating more detailed ones.
+ */
+void
+vacuum_all_databases(ConnParams *cparams,
+					 vacuumingOptions *vacopts,
+					 bool analyze_in_stages,
+					 SimpleStringList *objects,
+					 int concurrentCons,
+					 const char *progname, bool echo, bool quiet)
+{
+	PGconn	   *conn;
+	PGresult   *result;
+	int			stage;
+	int			i;
+
+	conn = connectMaintenanceDatabase(cparams, progname, echo);
+	result = executeQuery(conn,
+						  "SELECT datname FROM pg_database WHERE datallowconn AND datconnlimit <> -2 ORDER BY 1;",
+						  echo);
+	PQfinish(conn);
+
+	if (analyze_in_stages)
+	{
+		SimpleStringList **found_objs = NULL;
+
+		if (vacopts->missing_stats_only)
+			found_objs = palloc0(PQntuples(result) * sizeof(SimpleStringList *));
+
+		/*
+		 * When analyzing all databases in stages, we analyze them all in the
+		 * fastest stage first, so that initial statistics become available
+		 * for all of them as soon as possible.
+		 *
+		 * This means we establish several times as many connections, but
+		 * that's a secondary consideration.
+		 */
+		for (stage = 0; stage < ANALYZE_NUM_STAGES; stage++)
+		{
+			for (i = 0; i < PQntuples(result); i++)
+			{
+				cparams->override_dbname = PQgetvalue(result, i, 0);
+
+				vacuum_one_database(cparams, vacopts,
+									stage,
+									objects,
+									vacopts->missing_stats_only ? &found_objs[i] : NULL,
+									concurrentCons,
+									progname, echo, quiet);
+			}
+		}
+	}
+	else
+	{
+		for (i = 0; i < PQntuples(result); i++)
+		{
+			cparams->override_dbname = PQgetvalue(result, i, 0);
+
+			vacuum_one_database(cparams, vacopts,
+								ANALYZE_NO_STAGE,
+								objects, NULL,
+								concurrentCons,
+								progname, echo, quiet);
+		}
+	}
+
+	PQclear(result);
+}
+
+/*
+ * Construct a vacuum/analyze command to run based on the given
+ * options, in the given string buffer, which may contain previous garbage.
+ *
+ * The table name used must be already properly quoted.  The command generated
+ * depends on the server version involved and it is semicolon-terminated.
+ */
+void
+prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
+					   vacuumingOptions *vacopts, const char *table)
+{
+	const char *paren = " (";
+	const char *comma = ", ";
+	const char *sep = paren;
+
+	resetPQExpBuffer(sql);
+
+	if (vacopts->analyze_only)
+	{
+		appendPQExpBufferStr(sql, "ANALYZE");
+
+		/* parenthesized grammar of ANALYZE is supported since v11 */
+		if (serverVersion >= 110000)
+		{
+			if (vacopts->skip_locked)
+			{
+				/* SKIP_LOCKED is supported since v12 */
+				Assert(serverVersion >= 120000);
+				appendPQExpBuffer(sql, "%sSKIP_LOCKED", sep);
+				sep = comma;
+			}
+			if (vacopts->verbose)
+			{
+				appendPQExpBuffer(sql, "%sVERBOSE", sep);
+				sep = comma;
+			}
+			if (vacopts->buffer_usage_limit)
+			{
+				Assert(serverVersion >= 160000);
+				appendPQExpBuffer(sql, "%sBUFFER_USAGE_LIMIT '%s'", sep,
+								  vacopts->buffer_usage_limit);
+				sep = comma;
+			}
+			if (sep != paren)
+				appendPQExpBufferChar(sql, ')');
+		}
+		else
+		{
+			if (vacopts->verbose)
+				appendPQExpBufferStr(sql, " VERBOSE");
+		}
+	}
+	else
+	{
+		appendPQExpBufferStr(sql, "VACUUM");
+
+		/* parenthesized grammar of VACUUM is supported since v9.0 */
+		if (serverVersion >= 90000)
+		{
+			if (vacopts->disable_page_skipping)
+			{
+				/* DISABLE_PAGE_SKIPPING is supported since v9.6 */
+				Assert(serverVersion >= 90600);
+				appendPQExpBuffer(sql, "%sDISABLE_PAGE_SKIPPING", sep);
+				sep = comma;
+			}
+			if (vacopts->no_index_cleanup)
+			{
+				/* "INDEX_CLEANUP FALSE" has been supported since v12 */
+				Assert(serverVersion >= 120000);
+				Assert(!vacopts->force_index_cleanup);
+				appendPQExpBuffer(sql, "%sINDEX_CLEANUP FALSE", sep);
+				sep = comma;
+			}
+			if (vacopts->force_index_cleanup)
+			{
+				/* "INDEX_CLEANUP TRUE" has been supported since v12 */
+				Assert(serverVersion >= 120000);
+				Assert(!vacopts->no_index_cleanup);
+				appendPQExpBuffer(sql, "%sINDEX_CLEANUP TRUE", sep);
+				sep = comma;
+			}
+			if (!vacopts->do_truncate)
+			{
+				/* TRUNCATE is supported since v12 */
+				Assert(serverVersion >= 120000);
+				appendPQExpBuffer(sql, "%sTRUNCATE FALSE", sep);
+				sep = comma;
+			}
+			if (!vacopts->process_main)
+			{
+				/* PROCESS_MAIN is supported since v16 */
+				Assert(serverVersion >= 160000);
+				appendPQExpBuffer(sql, "%sPROCESS_MAIN FALSE", sep);
+				sep = comma;
+			}
+			if (!vacopts->process_toast)
+			{
+				/* PROCESS_TOAST is supported since v14 */
+				Assert(serverVersion >= 140000);
+				appendPQExpBuffer(sql, "%sPROCESS_TOAST FALSE", sep);
+				sep = comma;
+			}
+			if (vacopts->skip_database_stats)
+			{
+				/* SKIP_DATABASE_STATS is supported since v16 */
+				Assert(serverVersion >= 160000);
+				appendPQExpBuffer(sql, "%sSKIP_DATABASE_STATS", sep);
+				sep = comma;
+			}
+			if (vacopts->skip_locked)
+			{
+				/* SKIP_LOCKED is supported since v12 */
+				Assert(serverVersion >= 120000);
+				appendPQExpBuffer(sql, "%sSKIP_LOCKED", sep);
+				sep = comma;
+			}
+			if (vacopts->full)
+			{
+				appendPQExpBuffer(sql, "%sFULL", sep);
+				sep = comma;
+			}
+			if (vacopts->freeze)
+			{
+				appendPQExpBuffer(sql, "%sFREEZE", sep);
+				sep = comma;
+			}
+			if (vacopts->verbose)
+			{
+				appendPQExpBuffer(sql, "%sVERBOSE", sep);
+				sep = comma;
+			}
+			if (vacopts->and_analyze)
+			{
+				appendPQExpBuffer(sql, "%sANALYZE", sep);
+				sep = comma;
+			}
+			if (vacopts->parallel_workers >= 0)
+			{
+				/* PARALLEL is supported since v13 */
+				Assert(serverVersion >= 130000);
+				appendPQExpBuffer(sql, "%sPARALLEL %d", sep,
+								  vacopts->parallel_workers);
+				sep = comma;
+			}
+			if (vacopts->buffer_usage_limit)
+			{
+				Assert(serverVersion >= 160000);
+				appendPQExpBuffer(sql, "%sBUFFER_USAGE_LIMIT '%s'", sep,
+								  vacopts->buffer_usage_limit);
+				sep = comma;
+			}
+			if (sep != paren)
+				appendPQExpBufferChar(sql, ')');
+		}
+		else
+		{
+			if (vacopts->full)
+				appendPQExpBufferStr(sql, " FULL");
+			if (vacopts->freeze)
+				appendPQExpBufferStr(sql, " FREEZE");
+			if (vacopts->verbose)
+				appendPQExpBufferStr(sql, " VERBOSE");
+			if (vacopts->and_analyze)
+				appendPQExpBufferStr(sql, " ANALYZE");
+		}
+	}
+
+	appendPQExpBuffer(sql, " %s;", table);
+}
+
+/*
+ * Send a vacuum/analyze command to the server, returning after sending the
+ * command.
+ *
+ * Any errors during command execution are reported to stderr.
+ */
+void
+run_vacuum_command(PGconn *conn, const char *sql, bool echo,
+				   const char *table)
+{
+	bool		status;
+
+	if (echo)
+		printf("%s\n", sql);
+
+	status = PQsendQuery(conn, sql) == 1;
+
+	if (!status)
+	{
+		if (table)
+		{
+			pg_log_error("vacuuming of table \"%s\" in database \"%s\" failed: %s",
+						 table, PQdb(conn), PQerrorMessage(conn));
+		}
+		else
+		{
+			pg_log_error("vacuuming of database \"%s\" failed: %s",
+						 PQdb(conn), PQerrorMessage(conn));
+		}
+	}
+}
+
+/*
+ * Returns a newly malloc'd version of 'src' with escaped single quotes and
+ * backslashes.
+ */
+char *
+escape_quotes(const char *src)
+{
+	char	   *result = escape_single_quotes_ascii(src);
+
+	if (!result)
+		pg_fatal("out of memory");
+	return result;
+}
diff --git a/src/bin/scripts/vacuuming.h b/src/bin/scripts/vacuuming.h
new file mode 100644
index 00000000000..d3f000840fa
--- /dev/null
+++ b/src/bin/scripts/vacuuming.h
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * vacuuming.h
+ *		Common declarations for vacuuming.c
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/bin/scripts/vacuuming.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef VACUUMING_H
+#define VACUUMING_H
+
+#include "common.h"
+#include "fe_utils/connect_utils.h"
+#include "fe_utils/simple_list.h"
+
+/* For analyze-in-stages mode */
+#define ANALYZE_NO_STAGE	-1
+#define ANALYZE_NUM_STAGES	3
+
+/* vacuum options controlled by user flags */
+typedef struct vacuumingOptions
+{
+	bool		analyze_only;
+	bool		verbose;
+	bool		and_analyze;
+	bool		full;
+	bool		freeze;
+	bool		disable_page_skipping;
+	bool		skip_locked;
+	int			min_xid_age;
+	int			min_mxid_age;
+	int			parallel_workers;	/* >= 0 indicates user specified the
+									 * parallel degree, otherwise -1 */
+	bool		no_index_cleanup;
+	bool		force_index_cleanup;
+	bool		do_truncate;
+	bool		process_main;
+	bool		process_toast;
+	bool		skip_database_stats;
+	char	   *buffer_usage_limit;
+	bool		missing_stats_only;
+} vacuumingOptions;
+
+/* object filter options */
+typedef enum
+{
+	OBJFILTER_NONE = 0,			/* no filter used */
+	OBJFILTER_ALL_DBS = (1 << 0),	/* -a | --all */
+	OBJFILTER_DATABASE = (1 << 1),	/* -d | --dbname */
+	OBJFILTER_TABLE = (1 << 2), /* -t | --table */
+	OBJFILTER_SCHEMA = (1 << 3),	/* -n | --schema */
+	OBJFILTER_SCHEMA_EXCLUDE = (1 << 4),	/* -N | --exclude-schema */
+} VacObjFilter;
+
+extern VacObjFilter objfilter;
+
+extern void vacuuming_main(ConnParams *cparams, const char *dbname,
+						   const char *maintenance_db, vacuumingOptions *vacopts,
+						   SimpleStringList *objects, bool analyze_in_stages,
+						   int tbl_count, int concurrentCons,
+						   const char *progname, bool echo, bool quiet);
+
+extern SimpleStringList *retrieve_objects(PGconn *conn,
+										  vacuumingOptions *vacopts,
+										  SimpleStringList *objects,
+										  bool echo);
+
+extern void vacuum_one_database(ConnParams *cparams,
+								vacuumingOptions *vacopts,
+								int stage,
+								SimpleStringList *objects,
+								SimpleStringList **found_objs,
+								int concurrentCons,
+								const char *progname, bool echo, bool quiet);
+
+extern void vacuum_all_databases(ConnParams *cparams,
+								 vacuumingOptions *vacopts,
+								 bool analyze_in_stages,
+								 SimpleStringList *objects,
+								 int concurrentCons,
+								 const char *progname, bool echo, bool quiet);
+
+extern void prepare_vacuum_command(PQExpBuffer sql, int serverVersion,
+								   vacuumingOptions *vacopts, const char *table);
+
+extern void run_vacuum_command(PGconn *conn, const char *sql, bool echo,
+							   const char *table);
+
+extern char *escape_quotes(const char *src);
+
+#endif							/* VACUUMING_H */
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-03 09:55  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-09-03 09:55 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> While testing MVCC-safe version with stress-tests
> 007_repack_concurrently_mvcc.pl I encountered some random crashes with
> such logs:
> 
> 25-09-02 12:24:40.039 CEST client backend[261907]
> 007_repack_concurrently_mvcc.pl ERROR:  relcache reference
> 0x7715b9f394a8 is not owned by resource owner TopTransaction
> ...
> This time I was clever and tried to attempt to reproduce the issue on
> a non-MVCC safe version at first - and it is reproducible.

Thanks again for a thorough testing!

I think this should be fixed separately [1].

[1] https://www.postgresql.org/message-id/119497.1756892972%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-23 15:51  Alvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-09-23 15:51 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; +Cc: Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Hello,

Barring further commentary, I intend to get 0001 committed tomorrow, and
0002 some time later -- perhaps by end of this week, or sometime next
week.

Regards

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-25 18:12  Álvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  1 sibling, 3 replies; 106+ messages in thread

From: Álvaro Herrera @ 2025-09-25 18:12 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; +Cc: Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

After looking at this some more, I realized that 0001 had been written a
bit too hastily and that it could use with some more cleanup -- in
particular, we don't need to export most of the function prototypes
other than vacuuming_main() (and the trivial escape_quotes helper).  I
made the other functions static.  Also, prepare_vacuum_command() also
needs the encoding in order to do fmtIdEnc() on a given index name (for
`pg_repackdb -t table --index=foobar`), so I changed it to take the
PGconn instead of just the serverVersion.  I realized that it makes no
sense that objfilter is a global variable instead of living inside
`main` and be passed as argument where needed.  (Heck, maybe it should
be inside vacuumingOpts).  Lastly, it seemed weird coding that the
functions would sometimes exit(1) instead of returning a result code, so
I made them do that and have the callers react appropriately.  These are
all fairly straightforward changes.

So here's v22 with those and rebased to current sources.  Only the first
two patches this time, which are the ones I would be glad to receive
input on.

I also wonder if analyze_only and analyze_in_stages should be new values
in RunMode rather than separate booleans ... I think that might make the
code simpler.  I didn't try though.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-25 20:20  Marcos Pegoraro <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  2 siblings, 1 reply; 106+ messages in thread

From: Marcos Pegoraro @ 2025-09-25 20:20 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Em qui., 25 de set. de 2025 às 15:12, Álvaro Herrera <[email protected]>
escreveu:

Some typos I've found on usage of pg_repackdb.

+ printf(_("  -n, --schema=SCHEMA             repack tables in the
specified schema(s) only\n"));
+ printf(_("  -N, --exclude-schema=SCHEMA     do not repack tables in the
specified schema(s)\n"));
both options can point to a single schema, so "(s)" should be removed.
"in the specified schema(s)" should be "in the specified schema"

Same occurs on this one, which should be table, not table(s)
+ printf(_("  -t, --table='TABLE'             repack specific table(s)
only\n"));

regards
Marcos


^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-25 21:31  Robert Treat <[email protected]>
  parent: Marcos Pegoraro <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Robert Treat @ 2025-09-25 21:31 UTC (permalink / raw)
  To: Marcos Pegoraro <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On Thu, Sep 25, 2025 at 4:21 PM Marcos Pegoraro <[email protected]> wrote:
>
> Em qui., 25 de set. de 2025 às 15:12, Álvaro Herrera <[email protected]> escreveu:
>
> Some typos I've found on usage of pg_repackdb.
>
> + printf(_("  -n, --schema=SCHEMA             repack tables in the specified schema(s) only\n"));
> + printf(_("  -N, --exclude-schema=SCHEMA     do not repack tables in the specified schema(s)\n"));
> both options can point to a single schema, so "(s)" should be removed.
> "in the specified schema(s)" should be "in the specified schema"
>
> Same occurs on this one, which should be table, not table(s)
> + printf(_("  -t, --table='TABLE'             repack specific table(s) only\n"));
>

This pattern is used because you can pass more than one argument, for
example, something like

pg_repackdb -d pagila -v -n public -n legacy

While I agree that the wording is a little awkward; I'd prefer "repack
tables only in the specified schema(s)"; but this follows the same
pattern as pg_dump and friends.

Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-25 21:46  Marcos Pegoraro <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Marcos Pegoraro @ 2025-09-25 21:46 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

Em qui., 25 de set. de 2025 às 18:31, Robert Treat <[email protected]>
escreveu:

> This pattern is used because you can pass more than one argument, for
> example, something like


I know that

>
> While I agree that the wording is a little awkward, this follows the same
> pattern as pg_dump and friends.
>

well, I think pg_dump looks wrong too. Because if you explain that it's a
single table or single schema on docs, why you write on plural on usage ?
+        Repack or analyze all tables in
+        <replaceable class="parameter">schema</replaceable> only.  Multiple
+        schemas can be repacked by writing multiple <option>-n</option>
+        switches.

instead of
+ printf(_("  -n, --schema=SCHEMA             repack tables in the
specified schema(s) only\n"));
maybe this ?
+ printf(_("  -n, --schema=SCHEMA             repack tables in the
specified schema, can be used several times\n"));

regards
Marcos


^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-26 14:27  Mihail Nikalayeu <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  2 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-09-26 14:27 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Robert Treat <[email protected]>; Fujii Masao <[email protected]>

Hello!

Álvaro Herrera <[email protected]>:
> So here's v22 with those and rebased to current sources.  Only the first
> two patches this time, which are the ones I would be glad to receive
> input on.

> get_tables_to_repack_partitioned(RepackCommand cmd, MemoryContext cluster_context,
>                          Oid relid, bool rel_is_index)

Should we rename it to repack_context to be aligned with the calling side?

---------
'cmd' in

> static List *get_tables_to_repack(RepackCommand cmd, bool usingindex,
>                           MemoryContext permcxt);

but 'command' in

> get_tables_to_repack(RepackCommand command, bool usingindex,
>                 MemoryContext permcxt)

---------

> cmd == REPACK_COMMAND_CLUSTER ? "CLUSTER" : "REPACK",

May be changed to RepackCommandAsString

-----------

if (cmd == REPACK_COMMAND_REPACK)
    pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
                          PROGRESS_REPACK_COMMAND_REPACK);
else if (cmd == REPACK_COMMAND_CLUSTER)
{
    pgstat_progress_update_param(PROGRESS_REPACK_COMMAND,
                          PROGRESS_CLUSTER_COMMAND_CLUSTER);
} else ....

'{' and '}' looks a little bit weird.

--------
Documentation of pg_repackdb contains a lot of "analyze" and even
"--analyze" parameter - but I can't see anything related in the code.


Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-09-26 17:30  Robert Treat <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  2 siblings, 0 replies; 106+ messages in thread

From: Robert Treat @ 2025-09-26 17:30 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>; Mihail Nikalayeu <[email protected]>

On Thu, Sep 25, 2025 at 2:12 PM Álvaro Herrera <[email protected]> wrote:
> So here's v22 with those and rebased to current sources.  Only the first
> two patches this time, which are the ones I would be glad to receive
> input on.
>

A number of small issues I noticed. I don't know that they all need
addressing right now, but seems worth asking the questions...

#1
"pg_repackdb --help" does not mention the --index option, although the
flag is accepted. I'm not sure if this is meant to match clusterdb,
but since we need the index option to invoke the clustering behavior,
I think it needs to be there.

#2
[xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t customer
--index=idx_last_name
pg_repackdb: repacking database "pagila"
INFO:  clustering "public.customer" using sequential scan and sort

[xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t customer
pg_repackdb: repacking database "pagila"
INFO:  vacuuming "public.customer"

This was less confusing once I figured out we could pass the --index
option, but even with that it is a little confusing, I think mostly
because it looks like we are "vacuuming" the table, which in a world
of repack and vacuum (ie. no vacuum full) doesn't make sense. I think
the right thing to do here would be to modify it to be "repacking %s"
in both cases, with the "using sequential scan and sort" as the means
to understand which version of repack is being executed.

#3
pg_repackdb does not offer an --analyze option, which istm it should
to match the REPACK command

#4
SQL level REPACK help shows:

where option can be one of:
    VERBOSE [ boolean ]
    ANALYSE | ANALYZE

but SQL level VACUUM does
    VERBOSE [ boolean ]
    ANALYZE [ boolean ]

These operate the same way, so I would expect it to match the language
in vacuum.

#5
[xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t film --index
pg_repackdb: repacking database "pagila"

In the above scenario, I am repacking without having previously
specified an index. At the SQL level this would throw an error, at the
command line it gives me a heart attack. :-)
It's actually not that bad, because we don't actually do anything, but
maybe we should throw an error?

#6
On the individual command pages (like sql-repack.html), I think there
should be more cross-linking, ie. repack should probably say "see also
cluster" and vice versa. Likely similarly with vacuum and repack.

#7
Is there some reason you chose to intermingle the repack regression
tests with the existing tests? I feel like it'd be easier to
differentiate potential regressions and new functionality if these
were separated.

Robert Treat
https://xzilla.net

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-10-07 14:05  Álvaro Herrera <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Álvaro Herrera @ 2025-10-07 14:05 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>

On 2025-Sep-26, Mihail Nikalayeu wrote:

> Should we rename it to repack_context to be aligned with the calling side?

Sure, done.

> > cmd == REPACK_COMMAND_CLUSTER ? "CLUSTER" : "REPACK",
> 
> May be changed to RepackCommandAsString

Oh, of course.

> Documentation of pg_repackdb contains a lot of "analyze" and even
> "--analyze" parameter - but I can't see anything related in the code.

Hmm, yeah, that was missing.  I added it.  In doing so I noticed that
because vacuumdb allows a column list to be given, then we should do
likewise here, both in pg_repackdb and in the REPACK command, so I added
support for that.  This changed the grammar a little bit.  Note that we
still don't allow multiple tables to be given to the SQL command REPACK,
so if you want to repack multiple tables, you need to call it without
giving a name or give the name of a partitioned table.  The pg_repackdb
utility allows you to give multiple -t switches, and in that case it
calls REPACK once for each name.

Also, if you give a column list to pg_repackdb, then you must pass -z.
This is consistent with vacuumdb via VACUUM ANALYZE.

On 2025-Sep-26, Robert Treat wrote:

> #1
> "pg_repackdb --help" does not mention the --index option, although the
> flag is accepted. I'm not sure if this is meant to match clusterdb,
> but since we need the index option to invoke the clustering behavior,
> I think it needs to be there.

Oops, yes, added.

> #2
> [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t customer
> --index=idx_last_name
> pg_repackdb: repacking database "pagila"
> INFO:  clustering "public.customer" using sequential scan and sort
> 
> [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t customer
> pg_repackdb: repacking database "pagila"
> INFO:  vacuuming "public.customer"
> 
> This was less confusing once I figured out we could pass the --index
> option, but even with that it is a little confusing, I think mostly
> because it looks like we are "vacuuming" the table, which in a world
> of repack and vacuum (ie. no vacuum full) doesn't make sense. I think
> the right thing to do here would be to modify it to be "repacking %s"
> in both cases, with the "using sequential scan and sort" as the means
> to understand which version of repack is being executed.

I changed these messages to always say "repacking", but it will say
"using sequential scan and sort", or "using index", or "following
physical order", respectively.

That said, on this topic, I've always been bothered by our usage of
command names as verbs, because they are (IMO) horrible for translation.
For instance, in this version of the patch I am making this change:

    if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
        ereport(ERROR,
-               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                errmsg("cannot cluster a shared catalog")));
+               errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+               errmsg("cannot run %s on a shared catalog",
+                      RepackCommandAsString(cmd)));

In the old version, the message is not very translatable because you
have to find a native word to say "to cluster" or "to vacuum", and that
doesn't always work very well in a direct translation.  For instance, in
the Spanish message catalog you find this sort of thing:

msgid "vacuuming \"%s.%s.%s\""
msgstr "haciendo vacuum a «%s.%s.%s»"

which is pretty clear ... but the reason it works, is that I have turned
the phrase around before translating it.  I would struggle if I had to
find a Spanish verb that means "to repack" without contorting the
message or saying something absurd and/or against Spanish language
rules, such as "ejecutando repack en table XYZ" or "repaqueando tabl
XYZ" (that's not a word!) or "reempaquetando tabla XYZ" (this is
correct, but far enough from "repack" that it's annoying and potentially
confusing).  So I would rather the original used "running REPACK on
table using method XYZ", which is very very easy to translate, and then
the translator doesn't have to editorialize.

> #3
> pg_repackdb does not offer an --analyze option, which istm it should
> to match the REPACK command

Added, as mentioned above.

> #4

Fixed.

> #5
> [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t film --index
> pg_repackdb: repacking database "pagila"
> 
> In the above scenario, I am repacking without having previously
> specified an index. At the SQL level this would throw an error, at the
> command line it gives me a heart attack. :-)
> It's actually not that bad, because we don't actually do anything, but
> maybe we should throw an error?

Yeah, I think this is confusing.  I think we should make pg_repackdb
explicitly indicate what has been done, in all cases, without requiring
-v.  Otherwise it's too confusing, particularly for the using-index mode
which determines which tables to process based on the existance of an
index marked indiscluster.

> #6
> On the individual command pages (like sql-repack.html), I think there
> should be more cross-linking, ie. repack should probably say "see also
> cluster" and vice versa. Likely similarly with vacuum and repack.

Hmm, I don't necessarily agree -- I think the sql-cluster page should be
mostly empty and reference the sql-repack page.  We don't need any
incoming links to sql-cluster, I think.  All the useful info should be
in the sql-repack page only.  The same applies for VACUUM FULL: an
outgoing link in sql-vacuum to sql-repack is good to have, but we don't
need links from sql-repack to sql-vacuum.

> #7
> Is there some reason you chose to intermingle the repack regression
> tests with the existing tests? I feel like it'd be easier to
> differentiate potential regressions and new functionality if these
> were separated.

I admit I haven't paid too much attention to these tests.  I think I
would rather create a separate src/test/regress/sql/repack.sql file with
the tests for this command.  Let's consider this part a WIP for now --
clearly more tests are needed both for the SQL command CLUSTER and for
pg_repackdb.

In the meantime, this version has been rebased to current sources.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-10-09 06:38  Antonin Houska <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-10-09 06:38 UTC (permalink / raw)
  To: [email protected]; +Cc: Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Fujii Masao <[email protected]>

Álvaro Herrera <[email protected]> wrote:

> On 2025-Sep-26, Mihail Nikalayeu wrote:
> 
> > Should we rename it to repack_context to be aligned with the calling side?
> 
> Sure, done.
> 
> > > cmd == REPACK_COMMAND_CLUSTER ? "CLUSTER" : "REPACK",
> > 
> > May be changed to RepackCommandAsString
> 
> Oh, of course.
> 
> > Documentation of pg_repackdb contains a lot of "analyze" and even
> > "--analyze" parameter - but I can't see anything related in the code.
> 
> Hmm, yeah, that was missing.  I added it.  In doing so I noticed that
> because vacuumdb allows a column list to be given, then we should do
> likewise here, both in pg_repackdb and in the REPACK command, so I added
> support for that.

+	/*
+	 * Make sure ANALYZE is specified if a column list is present.
+	 */
+	if ((params->options & CLUOPT_ANALYZE) == 0 && stmt->relation->va_cols != NIL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("ANALYZE option must be specified when a column list is provided")));

Shouldn't the user documentation mention this restriction?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-10-09 11:49  Álvaro Herrera <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Álvaro Herrera @ 2025-10-09 11:49 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>; Pg Hackers <[email protected]>; Fujii Masao <[email protected]>

On 2025-Oct-09, Antonin Houska wrote:

> +	/*
> +	 * Make sure ANALYZE is specified if a column list is present.
> +	 */
> +	if ((params->options & CLUOPT_ANALYZE) == 0 && stmt->relation->va_cols != NIL)
> +		ereport(ERROR,
> +				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +				 errmsg("ANALYZE option must be specified when a column list is provided")));
> 
> Shouldn't the user documentation mention this restriction?

Hmm, yeah, I guess it should.  Will add.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-10-10 14:11  Alvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 0 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-10-10 14:11 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; +Cc: Antonin Houska <[email protected]>

Hello,

Here's patch v24.  I was hoping to push this today, but I think there
were too many changes from v23 for that.  Here's what I did:

- pg_stat_progress_cluster is no longer a view on top of the low-level
  pg_stat_get_progress_info() function.  Instead, it's a view on top of
  pg_stat_progress_repack.  The only change it applies on top of that
  one is change the command from REPACK to one of VACUUM FULL or
  CLUSTER, depending on whether an index is being used or not.  This
  should keep the behavior identical to previous versions.
  Alternatively we could just hide rows where the command is REPACK, but
  I don't think that would be any better.  This way, we maintain
  compatibility with tools reading pg_stat_progress_cluster.  Maybe this
  is useless and we should just drop the view, not sure, we can discuss
  separately.

- pg_stat_progress_repack itself now shows the command.  Also I got rid
  of the separate enum values for the command, and instead used the
  values from the parse node (RepackCommand); this removes about a dozen
  lines of C code.  To forestall potentially bogus usage of value 0, I
  made the enum start from 1.

- I noticed that you can do "CLUSTER pg_class ON some_index" and it will
  happily modify pg_index.indisclustered, which is a bit weird
  considering that allow_system_table_mods is off -- if you later try
  ALTER TABLE .. SET WITHOUT CLUSTER, it won't let you.  I think this is
  bogus and we should change it so that CLUSTER refuses to change the
  clustered index on a system catalog, unless allow_system_table_mods is
  on.  However, that would be a change from longstanding behavior which
  is specifically tested for in regression tests, so I didn't do it.
  We can discuss such a change separately.  But I did make REPACK refuse
  to do that, because we don't need to propagate bogus historical
  behavior.  So REPACK will fail if you try to change the indisclustered
  index, but it will work fine if you repack based on the same index as
  before, or repack with no index.

- pg_repackdb: if you try with a non-superuser without specifying a
  table name, it will fail as soon as it hits the first catalog table or
  whatever with "ERROR: cannot lock this table".  This is sorta fine for
  vacuumdb, but only because VACUUM itself will instead say "WARNING:
  cannot lock table XYZ, skipping", so it's not an error and vacuumdb
  keeps running.  IMO this is bogus: vacuumdb should not try to process
  tables that it doesn't have privileges to.  However, not wanting to
  change longstanding behavior, I left that alone.  For pg_repackdb, I
  added a condition in the WHERE clause there to only fetch tables that
  the current user has MAINTAIN privilege over.  Then you can do a
  "pg_repackdb -U foobar" and it will nicely process the tables that
  that user is allowed to process.  We can discuss changing the vacuumdb
  behavior separately.

- Added some additional tests for pg_repackdb and REPACK.

- Updated the docs.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/


^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-10-13 00:03  Robert Treat <[email protected]>
  parent: Álvaro Herrera <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Robert Treat @ 2025-10-13 00:03 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Fujii Masao <[email protected]>

On Tue, Oct 7, 2025 at 10:05 AM Álvaro Herrera <[email protected]> wrote:
> On 2025-Sep-26, Robert Treat wrote:
<snip>
> That said, on this topic, I've always been bothered by our usage of
> command names as verbs, because they are (IMO) horrible for translation.
> For instance, in this version of the patch I am making this change:
>
>     if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
>         ereport(ERROR,
> -               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -                errmsg("cannot cluster a shared catalog")));
> +               errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +               errmsg("cannot run %s on a shared catalog",
> +                      RepackCommandAsString(cmd)));
>
> In the old version, the message is not very translatable because you
> have to find a native word to say "to cluster" or "to vacuum", and that
> doesn't always work very well in a direct translation.  For instance, in
> the Spanish message catalog you find this sort of thing:
>
> msgid "vacuuming \"%s.%s.%s\""
> msgstr "haciendo vacuum a «%s.%s.%s»"
>
> which is pretty clear ... but the reason it works, is that I have turned
> the phrase around before translating it.  I would struggle if I had to
> find a Spanish verb that means "to repack" without contorting the
> message or saying something absurd and/or against Spanish language
> rules, such as "ejecutando repack en table XYZ" or "repaqueando tabl
> XYZ" (that's not a word!) or "reempaquetando tabla XYZ" (this is
> correct, but far enough from "repack" that it's annoying and potentially
> confusing).  So I would rather the original used "running REPACK on
> table using method XYZ", which is very very easy to translate, and then
> the translator doesn't have to editorialize.
>

I see you didn't do this in the current patch, but +1 for this idea
from me. And if you think it'd help, I'm also +1 on the idea for the
main docs as well, for example doing something like

+  <para>
-   <application>pg_repackdb</application> is a utility for repacking a
+   <application>pg_repackdb</application> is a utility for running REPACK on a
+   <productname>PostgreSQL</productname> database.

I'd be inclined to leave the internal comments alone though, since
they aren't translated.

> > #5
> > [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t film --index
> > pg_repackdb: repacking database "pagila"
> >
> > In the above scenario, I am repacking without having previously
> > specified an index. At the SQL level this would throw an error, at the
> > command line it gives me a heart attack. :-)
> > It's actually not that bad, because we don't actually do anything, but
> > maybe we should throw an error?
>
> Yeah, I think this is confusing.  I think we should make pg_repackdb
> explicitly indicate what has been done, in all cases, without requiring
> -v.  Otherwise it's too confusing, particularly for the using-index mode
> which determines which tables to process based on the existance of an
> index marked indiscluster.
>

At the moment, clusterdb runs silently, but vacuumdb emits output, so
there is an argument for either way as default behavior. That said, I
think the current behavior of vacuum, which is what we are currently
following in pg_repackdb, is the worst of the two:

 [xzilla@zebes]  pgsql/bin/vacuumdb -t actor  pagila
vacuumdb: vacuuming database "pagila"

Without any additional information, the information we do give is
misleading; I would rather not say anything. We could of course try to
make this more verbose, but I think clusterdb actually gets this
right...
- say nothing by default (follow the "rule of silence.")
- if we want to see commands, pass -e
- if we want to see the details, pass -v
- if we do something that causes an error, return the error
- if we don't want errors, pass -q

This is also how reindexdb works, and I think most of the other
utilities, and I'd argue this is how vacuumdb should work... to the
extent I almost consider it a bug that it doesn't (I leave a little
room since I am not sure why it doesn't operate like the other
utilities). vacuum is a bit outside the purview of what we are doing
here, but I do think following clusterdb/reindexdb is the behavior we
should follow for pg_repackdb.

> I admit I haven't paid too much attention to these tests.  I think I
> would rather create a separate src/test/regress/sql/repack.sql file with
> the tests for this command.  Let's consider this part a WIP for now --
> clearly more tests are needed both for the SQL command CLUSTER and for
> pg_repackdb.

Yeah, istm as long as we have all 3 commands (repack, cluster, vacuum
full) we need regression tests for all 3.

> - pg_stat_progress_cluster is no longer a view on top of the low-level
>   pg_stat_get_progress_info() function.  Instead, it's a view on top of
>   pg_stat_progress_repack.  The only change it applies on top of that
>   one is change the command from REPACK to one of VACUUM FULL or
>   CLUSTER, depending on whether an index is being used or not.  This
>   should keep the behavior identical to previous versions.
>   Alternatively we could just hide rows where the command is REPACK, but
>   I don't think that would be any better.  This way, we maintain
>   compatibility with tools reading pg_stat_progress_cluster.  Maybe this
>   is useless and we should just drop the view, not sure, we can discuss
>   separately.
>

I think this mostly depends on how aggressive you want to be in moving
people away from cluster and toward repack. If we remove
_progress_cluster, it will force people to update monitoring which
probably encourages people to switch to pg_repackdb. We probably need
to have at least one "bridge" release though, and I think you've got
the right balance for that.

> - I noticed that you can do "CLUSTER pg_class ON some_index" and it will
>   happily modify pg_index.indisclustered, which is a bit weird
>   considering that allow_system_table_mods is off -- if you later try
>   ALTER TABLE .. SET WITHOUT CLUSTER, it won't let you.  I think this is
>   bogus and we should change it so that CLUSTER refuses to change the
>   clustered index on a system catalog, unless allow_system_table_mods is
>   on.  However, that would be a change from longstanding behavior which
>   is specifically tested for in regression tests, so I didn't do it.
>   We can discuss such a change separately.  But I did make REPACK refuse
>   to do that, because we don't need to propagate bogus historical
>   behavior.  So REPACK will fail if you try to change the indisclustered
>   index, but it will work fine if you repack based on the same index as
>   before, or repack with no index.
>

Since cluster will presumably be deprecated with this release, I'd
leave the existing behavior and move forward with repack as you've
laid out.

> - pg_repackdb: if you try with a non-superuser without specifying a
>   table name, it will fail as soon as it hits the first catalog table or
>   whatever with "ERROR: cannot lock this table".  This is sorta fine for
>   vacuumdb, but only because VACUUM itself will instead say "WARNING:
>   cannot lock table XYZ, skipping", so it's not an error and vacuumdb
>   keeps running.  IMO this is bogus: vacuumdb should not try to process
>   tables that it doesn't have privileges to.  However, not wanting to
>   change longstanding behavior, I left that alone.  For pg_repackdb, I
>   added a condition in the WHERE clause there to only fetch tables that
>   the current user has MAINTAIN privilege over.  Then you can do a
>   "pg_repackdb -U foobar" and it will nicely process the tables that
>   that user is allowed to process.  We can discuss changing the vacuumdb
>   behavior separately.

Again, vacuumdb seems to be a good example of what not to do, but I'll
leave that for another thread. In general I like this idea, but it
does make for a weird corner case where if I specify a table with -t
that I don't have permission to repack, repack returns silently whilst
doing nothing. I suppose one way to handle that would be to check if
the table passed in -t is found in the list of tables with MAINTAIN
privileges, and if not to issue a WARNING like "%s not found. Make
sure that the table exists and that you have MAINTAIN privileges".

Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-10-30 23:17  Alvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  6 siblings, 4 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-10-30 23:17 UTC (permalink / raw)
  To: Pg Hackers <[email protected]>; +Cc: Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>

Hello,

Here's a new installment of this series, v25, including the CONCURRENTLY
part, which required some conflict fixes on top of the much-changed
v24-0001 patch.

After the talk on this subject for PGConf.EU, there were some
reservations about this whole project, and if I understand correctly,
they can be summarized in these three points:

1. Would the spill files for reorderbuffers occupy as much disk space as
it takes to copy the initial contents of the table, for each active
logical decoding replication slot?  Antonin claims (I haven't verified
this) that there are some hacks in place to avoid this problem, or that
it is easy to install some -- and if so, then this patch would already
be better than pg_repack.  This perhaps merits more testing.

2. Is the concurrent REPACK operation MVCC-safe?  At the moment, with
the present implementation, no it is not.  There are discussions on
getting this fixed, and Mihail has proposed some patches which at least
are quite short, though their safety is something we need to assess in
more depth.

3. Would the xmin horizon remain stuck at the spot where REPACK started,
thereby preventing VACUUM from cleaning up recently-dead rows in other
tables?  As I understand, with the current implementation, yes it would,
and we cannot easily apply hacks such as PROC_IN_VACUUM to prevent it,
because it would introduce the same problems it did for CREATE INDEX
CONCURRENTLY that was fixed in pg14 (commit 042b584c7f7d62).  Mihail and
Antonin have discussed possible ways to ease this, but we don't have
code for that yet.  This is, again, no worse than VACUUM FULL or
CLUSTER, so lack of this wouldn't be a killer for this project, though
of course it would be much better to do better.

I have not yet addressed Robert Treat's feedback from October 12th.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-01 12:42  jian he <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  3 siblings, 2 replies; 106+ messages in thread

From: jian he @ 2025-11-01 12:42 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>

On Fri, Oct 31, 2025 at 7:17 AM Alvaro Herrera <[email protected]> wrote:
>
> Hello,
>
> Here's a new installment of this series, v25, including the CONCURRENTLY
> part, which required some conflict fixes on top of the much-changed
> v24-0001 patch.
>
hi.

    if (params.options & CLUOPT_ANALYZE)
        ereport(ERROR,
                errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                errmsg("cannot %s multiple tables", "REPACK (ANALYZE)"));
for this error case, adding a simple test case would be better?

+ /* Do an analyze, if requested */
+ if (params->options & CLUOPT_ANALYZE)
+ {
+ VacuumParams vac_params = {0};
+
+ vac_params.options |= VACOPT_ANALYZE;
+ if (params->options & CLUOPT_VERBOSE)
+ vac_params.options |= VACOPT_VERBOSE;
+ analyze_rel(RelationGetRelid(rel), NULL, vac_params,
+ stmt->relation->va_cols, true, NULL);
+ }

Looking at the comments in struct VacuumParams, some fields have nonzero default
values — for example, log_vacuum_min_duration.
Do we need to explicitly set these fields to their default values?
(see ExecVacuum)

repack.sgml can also add a
<refsect1> <title>See Also</title>
similar to analyze.sgml, vacuum.sgml

doc/src/sgml/ref/repack.sgml
synopsis section missing syntax:
REPACK USING INDEX

I am wondering, can we also support
REPACK opt_utility_option_list USING INDEX

MATERIALIZED VIEW:
create materialized view a_________ as select * from t2;

repack (verbose);
INFO:  repacking "public.a_________" in physical order
INFO:  "public.a_________": found 0 removable, 10 nonremovable row
versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.

cluster (verbose);
won't touch materialized view a_________

but materialized views don't have bloat, nothing can be removed.
So here we are waste cycles to repack materialized view?





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-01 12:53  Sergei Kornilov <[email protected]>
  parent: jian he <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Sergei Kornilov @ 2025-11-01 12:53 UTC (permalink / raw)
  To: jian he <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>; Alvaro Herrera <[email protected]>

Hello!

> but materialized views don't have bloat, nothing can be removed.

REFRESH MATERIALIZED VIEW CONCURRENTLY does not replace relation completely but updates the relation using insert and delete queries (refresh_by_match_merge in src/backend/commands/matview.c) - so there may be bloat.

regards, Sergei





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-01 18:16  Mihail Nikalayeu <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  3 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-11-01 18:16 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Robert Treat <[email protected]>

Hello!

On Fri, Oct 31, 2025 at 12:17 AM Alvaro Herrera <[email protected]> wrote:
> Here's a new installment of this series, v25, including the CONCURRENTLY
> part, which required some conflict fixes on top of the much-changed
> v24-0001 patch.

 > * cluster.c
 > *      CLUSTER a table on an index.  This is now also used for VACUUM FULL.

 Should we add something about repack here?

 > ii_ExclusinOps
 typo here.

 >  * index is inserted into catalogs and needs to be built later on.
 Now it is only in case concurrently == true

>   * Build the index information for the new index.  Note that rebuild of
>   * indexes with exclusion constraints is not supported, hence there is no
>   * need to fill all the ii_Exclusion* fields.

Now the function supports its in !concurrently mode. Should we fill
ii_Exclusion? Also, it says

> If !concurrently, ii_ExclusinOps is currently not needed.

But it is not clear - why not?

>   newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
>                           oldInfo->ii_NumIndexKeyAttrs,
>                           oldInfo->ii_Am,
>                           indexExprs,
>                           indexPreds,
>                           oldInfo->ii_Unique,
>                           oldInfo->ii_NullsNotDistinct,
>                           false,  /* not ready for inserts */
>                           true,
>                           indexRelation->rd_indam->amsummarizing,
>                           oldInfo->ii_WithoutOverlaps);

Is it ok we pass isready == false if !concurrent?
Also, we pass concurrent == true even if concurrently == false - feels
strange and probably wrong.


> This difference does has no impact on XidInMVCCSnapshot().
Should it be "This difference has no impact"?

> * pgoutput_cluster.c
> *       src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
 it is pgoutput_trepack.c :)

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-03 07:56  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-11-03 07:56 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hello!
> 
> On Fri, Oct 31, 2025 at 12:17 AM Alvaro Herrera <[email protected]> wrote:
> > Here's a new installment of this series, v25, including the CONCURRENTLY
> > part, which required some conflict fixes on top of the much-changed
> > v24-0001 patch.
> 
>  > * cluster.c
>  > *      CLUSTER a table on an index.  This is now also used for VACUUM FULL.

ok

>  Should we add something about repack here?
> 
>  > ii_ExclusinOps
>  typo here.

ok

>  >  * index is inserted into catalogs and needs to be built later on.
>  Now it is only in case concurrently == true

ok

> >   * Build the index information for the new index.  Note that rebuild of
> >   * indexes with exclusion constraints is not supported, hence there is no
> >   * need to fill all the ii_Exclusion* fields.
> 
> Now the function supports its in !concurrently mode. Should we fill
> ii_Exclusion? Also, it says
> 
> > If !concurrently, ii_ExclusinOps is currently not needed.

> But it is not clear - why not?

Right, makeIndexInfo() needs to be adjusted.

> 
> >   newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
> >                           oldInfo->ii_NumIndexKeyAttrs,
> >                           oldInfo->ii_Am,
> >                           indexExprs,
> >                           indexPreds,
> >                           oldInfo->ii_Unique,
> >                           oldInfo->ii_NullsNotDistinct,
> >                           false,  /* not ready for inserts */
> >                           true,
> >                           indexRelation->rd_indam->amsummarizing,
> >                           oldInfo->ii_WithoutOverlaps);
> 
> Is it ok we pass isready == false if !concurrent?
> Also, we pass concurrent == true even if concurrently == false - feels
> strange and probably wrong.

You're right, both arguments are wrong.

> > This difference does has no impact on XidInMVCCSnapshot().
> Should it be "This difference has no impact"?

ok

> > * pgoutput_cluster.c
> > *       src/backend/replication/pgoutput_cluster/pgoutput_cluster.c
>  it is pgoutput_trepack.c :)

ok

I'll fix all the problems in the next version. Thanks!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-05 02:48  jian he <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  3 siblings, 1 reply; 106+ messages in thread

From: jian he @ 2025-11-05 02:48 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>

On Fri, Oct 31, 2025 at 7:17 AM Alvaro Herrera <[email protected]> wrote:
>
> Hello,
>
> Here's a new installment of this series, v25, including the CONCURRENTLY
> part, which required some conflict fixes on top of the much-changed
> v24-0001 patch.
>

 <refnamediv>
  <refname>pg_repackdb</refname>
  <refpurpose>repack and analyze a <productname>PostgreSQL</productname>
  database</refpurpose>
 </refnamediv>

but with --all option specified, it's doing repack whole cluster.
(more than one database).
I am not fully sure this description is OK.


I think pg_repackdb Synopsis section:
pg_repackdb [connection-option...] [option...] [ -t | --table table [(
column [,...] )] ] ... [ dbname | -a | --all ]
pg_repackdb [connection-option...] [option...] [ -n | --schema schema
] ... [ dbname | -a | --all ]
pg_repackdb [connection-option...] [option...] [ -N | --exclude-schema
schema ] ... [ dbname | -a | --all ]

can be simplified the same way as as pg_dump:

pg_repackdb [connection-option...] [option...]  [ dbname | -a | --all ]

------------------------
[-d] dbname
[--dbname=]dbname

what do you think to expand it as below:
dbname
-d dbname
--dbname=dbname
--------------------

+ printf(_("      --index[=INDEX]             repack following an index\n"));
should it be
+ printf(_("--index[=INDEX]                   repack following an index\n"));
?


similar to pg_dump:
    printf(_("\nIf no database name is supplied, then the PGDATABASE
environment\n"
             "variable value is used.\n\n"));

in pg_repackdb help section, we can mention:
    printf(_("\nIf no database name is supplied and --all option not
specified then the PGDATABASE environment\n"
             "variable value is used.\n\n"));
Do you think it's necessary?


what the expectation of
pg_repackdb --index=index_name, the doc is not very helpful.

pg_repackdb --analyze --index=zz --verbose
pg_repackdb: repacking database "src3"
pg_repackdb: error: processing of database "src3" failed: ERROR:  "zz"
is not an index for table "tenk1"

select pg_get_indexdef ('zz'::regclass);
                  pg_get_indexdef
---------------------------------------------------
 CREATE INDEX zz ON public.tenk2 USING btree (two)

------
jian he
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-05 05:10  Robert Treat <[email protected]>
  parent: jian he <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Robert Treat @ 2025-11-05 05:10 UTC (permalink / raw)
  To: jian he <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>

On Tue, Nov 4, 2025 at 9:48 PM jian he <[email protected]> wrote:
> On Fri, Oct 31, 2025 at 7:17 AM Alvaro Herrera <[email protected]> wrote:
> >
> > Hello,
> >
> > Here's a new installment of this series, v25, including the CONCURRENTLY
> > part, which required some conflict fixes on top of the much-changed
> > v24-0001 patch.
> >
>
>  <refnamediv>
>   <refname>pg_repackdb</refname>
>   <refpurpose>repack and analyze a <productname>PostgreSQL</productname>
>   database</refpurpose>
>  </refnamediv>
>
> but with --all option specified, it's doing repack whole cluster.
> (more than one database).
> I am not fully sure this description is OK.
>

This wording came from vacuumdb, which operates the same way, and I
don't think it's lead to confusion. And while I don't think we need to
take away the option, I see no reason to encourage the idea that
people should be doing cluster wide full database repacks. On that
note, I'd take the "and analyze" from the refpurpose as well; the more
I look at it, I see pg_repackdb as a replacement for clusterdb, with
selected bells and whistles from vacuum full or external repack-type
tooling, but at the end of the day that's a simpler model for
operators, and helps draw a distinction for which features we DONT
need to include, like -Z (ie. analyze only; if you want that, use
vacuumdb, not pg_repackdb)

>
> I think pg_repackdb Synopsis section:
> pg_repackdb [connection-option...] [option...] [ -t | --table table [(
> column [,...] )] ] ... [ dbname | -a | --all ]
> pg_repackdb [connection-option...] [option...] [ -n | --schema schema
> ] ... [ dbname | -a | --all ]
> pg_repackdb [connection-option...] [option...] [ -N | --exclude-schema
> schema ] ... [ dbname | -a | --all ]
>
> can be simplified the same way as as pg_dump:
>
> pg_repackdb [connection-option...] [option...]  [ dbname | -a | --all ]
>

I think it's laid out that way in vacuumdb to indicate that those
options are exclusive of one another. I'm not sure how convincing that
is, but the above would need to do more to make the switch imo.

> ------------------------
> [-d] dbname
> [--dbname=]dbname
>
> what do you think to expand it as below:
> dbname
> -d dbname
> --dbname=dbname

not sure i am following this one, but the brackets are the standard
way we should items to be optional, which in either case they are.

> --------------------
>
> + printf(_("      --index[=INDEX]             repack following an index\n"));
> should it be
> + printf(_("--index[=INDEX]                   repack following an index\n"));
> ?
>

I believe this is included for alignment, since this option has no
shorthand version.

>
> similar to pg_dump:
>     printf(_("\nIf no database name is supplied, then the PGDATABASE
> environment\n"
>              "variable value is used.\n\n"));
>
> in pg_repackdb help section, we can mention:
>     printf(_("\nIf no database name is supplied and --all option not
> specified then the PGDATABASE environment\n"
>              "variable value is used.\n\n"));
> Do you think it's necessary?
>

no. (again, looking first at clusterdb, and also vacuumdb, neither of
which have it).

>
> what the expectation of
> pg_repackdb --index=index_name, the doc is not very helpful.
>
> pg_repackdb --analyze --index=zz --verbose
> pg_repackdb: repacking database "src3"
> pg_repackdb: error: processing of database "src3" failed: ERROR:  "zz"
> is not an index for table "tenk1"
>
> select pg_get_indexdef ('zz'::regclass);
>                   pg_get_indexdef
> ---------------------------------------------------
>  CREATE INDEX zz ON public.tenk2 USING btree (two)
>

Hmm... yes, this is a bit confusing. I didn't verify it in the code,
but from memory I think the --index option is meant to be used only in
conjunction with --table, in which case it would repack the table
using the specified index. I could be overlooking something though.

Robert Treat
https://xzilla.net

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-05 07:12  Antonin Houska <[email protected]>
  parent: Robert Treat <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-11-05 07:12 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: jian he <[email protected]>; Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Mihail Nikalayeu <[email protected]>

Robert Treat <[email protected]> wrote:

> On Tue, Nov 4, 2025 at 9:48 PM jian he <[email protected]> wrote:

> > what the expectation of
> > pg_repackdb --index=index_name, the doc is not very helpful.
> >
> > pg_repackdb --analyze --index=zz --verbose
> > pg_repackdb: repacking database "src3"
> > pg_repackdb: error: processing of database "src3" failed: ERROR:  "zz"
> > is not an index for table "tenk1"
> >
> > select pg_get_indexdef ('zz'::regclass);
> >                   pg_get_indexdef
> > ---------------------------------------------------
> >  CREATE INDEX zz ON public.tenk2 USING btree (two)
> >
> 
> Hmm... yes, this is a bit confusing. I didn't verify it in the code,
> but from memory I think the --index option is meant to be used only in
> conjunction with --table, in which case it would repack the table
> using the specified index. I could be overlooking something though.

The corresponding code is:

+	/*
+	 * In REPACK mode, if the 'using_index' option was given but no index
+	 * name, filter only tables that have an index with indisclustered set.
+	 * (If an index name is given, we trust the user to pass a reasonable list
+	 * of tables.)
+	 *
+	 * XXX it may be worth printing an error if an index name is given with no
+	 * list of tables.
+	 */
+	if (vacopts->mode == MODE_REPACK &&
+		vacopts->using_index && !vacopts->indexname)
+	{
+		appendPQExpBufferStr(&catalog_query,
+							 " AND EXISTS (SELECT 1 FROM pg_catalog.pg_index\n"
+							 "    WHERE indrelid = c.oid AND indisclustered)\n");
+	}

I'm not sure if it's worth allowing the --index option to have an
argument. Since the user can specify multiple tables, he should also be able
to specify multiple indexes. And then the question would be: what should
happen if the user forgot to specify (or just mistyped) the index name for a
table which does not yet have the clustering index set? Skip that table (and
print out a warning)? Or consider it an error?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-05 08:46  jian he <[email protected]>
  parent: Robert Treat <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: jian he @ 2025-11-05 08:46 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>

On Wed, Nov 5, 2025 at 1:11 PM Robert Treat <[email protected]> wrote:
> > --------------------
> >
> > + printf(_("      --index[=INDEX]             repack following an index\n"));
> > should it be
> > + printf(_("--index[=INDEX]                   repack following an index\n"));
> > ?
> >
>
> I believe this is included for alignment, since this option has no
> shorthand version.
>

if you compare pg_dump --help, pg_repackdb --help
then you will see the inconsistency.

This is legacy behavior, but can we move some of the error checks in
do_analyze_rel to an earlier point?
we call cluster_rel before analyze_rel, cluster_rel is way more time-consuming,
a failure in analyze_rel means all the previous work (cluster_rel) is wasted.

+ else if (HeadMatches("REPACK", "(*") &&
+ !HeadMatches("REPACK", "(*)"))
+ {
+ /*
+ * This fires if we're in an unfinished parenthesized option list.
+ * get_previous_words treats a completed parenthesized option list as
+ * one word, so the above test is correct.
+ */
+ if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+ COMPLETE_WITH("VERBOSE");
+ else if (TailMatches("VERBOSE"))
+ COMPLETE_WITH("ON", "OFF");
+ }
this part can also support the ANALYZE option?

ClusterStmt
should be removed from src/tools/pgindent/typedefs.list?

doc/src/sgml/ref/clusterdb.sgml
  <para>
   <application>clusterdb</application> has been superceded by
   <application>pg_repackdb</application>.
  </para>
google told me, "superceded" should be "superseded"

--
jian he
EDB: http://www.enterprisedb.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-11-09 22:13  Robert Treat <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Robert Treat @ 2025-11-09 22:13 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: jian he <[email protected]>; Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Mihail Nikalayeu <[email protected]>

On Wed, Nov 5, 2025 at 2:12 AM Antonin Houska <[email protected]> wrote:
> Robert Treat <[email protected]> wrote:
> > On Tue, Nov 4, 2025 at 9:48 PM jian he <[email protected]> wrote:
>
> > > what the expectation of
> > > pg_repackdb --index=index_name, the doc is not very helpful.
> > >
> > > pg_repackdb --analyze --index=zz --verbose
> > > pg_repackdb: repacking database "src3"
> > > pg_repackdb: error: processing of database "src3" failed: ERROR:  "zz"
> > > is not an index for table "tenk1"
> > >
> > > select pg_get_indexdef ('zz'::regclass);
> > >                   pg_get_indexdef
> > > ---------------------------------------------------
> > >  CREATE INDEX zz ON public.tenk2 USING btree (two)
> > >
> >
> > Hmm... yes, this is a bit confusing. I didn't verify it in the code,
> > but from memory I think the --index option is meant to be used only in
> > conjunction with --table, in which case it would repack the table
> > using the specified index. I could be overlooking something though.
>
> The corresponding code is:
>
> +       /*
> +        * In REPACK mode, if the 'using_index' option was given but no index
> +        * name, filter only tables that have an index with indisclustered set.
> +        * (If an index name is given, we trust the user to pass a reasonable list
> +        * of tables.)
> +        *
> +        * XXX it may be worth printing an error if an index name is given with no
> +        * list of tables.
> +        */
> +       if (vacopts->mode == MODE_REPACK &&
> +               vacopts->using_index && !vacopts->indexname)
> +       {
> +               appendPQExpBufferStr(&catalog_query,
> +                                                        " AND EXISTS (SELECT 1 FROM pg_catalog.pg_index\n"
> +                                                        "    WHERE indrelid = c.oid AND indisclustered)\n");
> +       }
>
> I'm not sure if it's worth allowing the --index option to have an
> argument. Since the user can specify multiple tables, he should also be able
> to specify multiple indexes. And then the question would be: what should
> happen if the user forgot to specify (or just mistyped) the index name for a
> table which does not yet have the clustering index set? Skip that table (and
> print out a warning)? Or consider it an error?
>

Ah, yes, this is something completely different. So, we do need a way
to differentiate between "vacuum full" vs "cluster" all tables... as
well as "vacuum full" vs "cluster" of a specific table (including the
idea of "vacuum full" of a previously clustered table, and the
existing code handles all that (though I might quibble with the option
name).

As for having an --index= option, I'd love to hear the use case;
something like partitions or maybe some client per schema situation
comes to mind, but ISTM in all those cases the user would also know
(or be expected to know) the table name, so I agree with Antonin that
the extra complexity doesn't seem worth supporting to me. (It's even
worse the more you think about it, what if some table has the index
named above, but is clustered on a different index, then what should
we do?)

As for the use case I was thinking of, specifying a table and index in
order to repack using that index (and setting indisclustered if not
already); while I feel like that would be a useful option, if it isn't
currently supported I don't see a strong argument for adding it now.


Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-02 00:50  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-02 00:50 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, Antonin!

On Mon, Nov 3, 2025 at 8:56 AM Antonin Houska <[email protected]> wrote:
> I'll fix all the problems in the next version. Thanks!

A few more moments I mentioned:

> switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
vis is unused, also to double braces.

>       LockBuffer(buf, BUFFER_LOCK_UNLOCK);
>       continue;
>    }

>    /*
>     * In the concurrent case, we have a copy of the tuple, so we
>     * don't worry whether the source tuple will be deleted / updated
>     * after we release the lock.
>     */
>    LockBuffer(buf, BUFFER_LOCK_UNLOCK);
>}

I think locking and comments are a little bit confusing here.
I think we may use single LockBuffer(buf, BUFFER_LOCK_UNLOCK); before
`if (isdead)` as it was before.
Also, I am not sure "we have a copy" is the correct point here, I
think motivation is mostly the same as in
heapam_index_build_range_scan.

Also, I think it is a good idea to add tests for index-based and
sort-based repack.

Also, for sort-based I think we need to also call
repack_decode_concurrent_changes during insertion phase

> is_system_catalog && !concurrent
2 places, always true, feels strange.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-02 16:14  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-12-02 16:14 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hello, Antonin!
> 
> On Mon, Nov 3, 2025 at 8:56 AM Antonin Houska <[email protected]> wrote:
> > I'll fix all the problems in the next version. Thanks!
> 
> A few more moments I mentioned:
> 
> > switch ((vis = HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf)))
> vis is unused, also to double braces.
> 
> >       LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> >       continue;
> >    }
> 
> >    /*
> >     * In the concurrent case, we have a copy of the tuple, so we
> >     * don't worry whether the source tuple will be deleted / updated
> >     * after we release the lock.
> >     */
> >    LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> >}
> 
> I think locking and comments are a little bit confusing here.
> I think we may use single LockBuffer(buf, BUFFER_LOCK_UNLOCK); before
> `if (isdead)` as it was before.
> Also, I am not sure "we have a copy" is the correct point here, I
> think motivation is mostly the same as in
> heapam_index_build_range_scan.

All these problems are due to incorrect separation of the "preserve
visibility" part of the patch series. Will be fixed in the next version.

> Also, I think it is a good idea to add tests for index-based and
> sort-based repack.

Not sure, cluster.sql already seems to do the same.

> Also, for sort-based I think we need to also call
> repack_decode_concurrent_changes during insertion phase

I miss the point. The current coding is such that this part

	if (concurrent)
	{
		XLogRecPtr	end_of_wal;

		end_of_wal = GetFlushRecPtr(NULL);
		if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
		{
			repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
			end_of_wal_prev = end_of_wal;
		}
	}

gets called regardless the value of 'tuplesort' above.

> > is_system_catalog && !concurrent
> 2 places, always true, feels strange.

ok

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-02 16:22  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-02 16:22 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hi!

On Tue, Dec 2, 2025 at 5:14 PM Antonin Houska <[email protected]> wrote:
> Not sure, cluster.sql already seems to do the same.
I think in the case of CONCURRENTLY it may behave a little bit
different, but not sure.

> I miss the point. The current coding is such that this part
I mean call it periodically in both loops: scan loop and insertion loop.

Best greetings,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-03 07:56  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-12-03 07:56 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> On Tue, Dec 2, 2025 at 5:14 PM Antonin Houska <[email protected]> wrote:
> > Not sure, cluster.sql already seems to do the same.
> I think in the case of CONCURRENTLY it may behave a little bit
> different, but not sure.
> 
> > I miss the point. The current coding is such that this part
> I mean call it periodically in both loops: scan loop and insertion loop.

ok, that makes sense. I'll add that to the next version. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-04 13:36  Antonin Houska <[email protected]>
  parent: jian he <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-12-04 13:36 UTC (permalink / raw)
  To: jian he <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>

jian he <[email protected]> wrote:

>     if (params.options & CLUOPT_ANALYZE)
>         ereport(ERROR,
>                 errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>                 errmsg("cannot %s multiple tables", "REPACK (ANALYZE)"));
> for this error case, adding a simple test case would be better?

More options should probably be tested, currently we have only very basic
regression test for pg_repackdb. TBD

> + /* Do an analyze, if requested */
> + if (params->options & CLUOPT_ANALYZE)
> + {
> + VacuumParams vac_params = {0};
> +
> + vac_params.options |= VACOPT_ANALYZE;
> + if (params->options & CLUOPT_VERBOSE)
> + vac_params.options |= VACOPT_VERBOSE;
> + analyze_rel(RelationGetRelid(rel), NULL, vac_params,
> + stmt->relation->va_cols, true, NULL);
> + }
> 
> Looking at the comments in struct VacuumParams, some fields have nonzero default
> values — for example, log_vacuum_min_duration.
> Do we need to explicitly set these fields to their default values?
> (see ExecVacuum)

Perhaps, TBD.

> repack.sgml can also add a
> <refsect1> <title>See Also</title>
> similar to analyze.sgml, vacuum.sgml

ok, added this in v26 (to be posted today):

 <refsect1>
  <title>See Also</title>

  <simplelist type="inline">
   <member><xref linkend="app-pgrepackdb"/></member>
   <member><xref linkend="repack-progress-reporting"/></member>
  </simplelist>
 </refsect1>

(Not added reference to VACUUM FULL and CLUSTER intentionally: whoever uses
REPACK should not need them.

> doc/src/sgml/ref/repack.sgml
> synopsis section missing syntax:
> REPACK USING INDEX

ok, added in v26.

> I am wondering, can we also support
> REPACK opt_utility_option_list USING INDEX

I agree, added that in v26 (Hopefully I haven't broken anything, the syntax is
not trival anymore.)

> MATERIALIZED VIEW:
> create materialized view a_________ as select * from t2;
> 
> repack (verbose);
> INFO:  repacking "public.a_________" in physical order
> INFO:  "public.a_________": found 0 removable, 10 nonremovable row
> versions in 1 pages
> DETAIL:  0 dead row versions cannot be removed yet.
> CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> 
> cluster (verbose);
> won't touch materialized view a_________
> 
> but materialized views don't have bloat, nothing can be removed.
> So here we are waste cycles to repack materialized view?

Answered in https://www.postgresql.org/message-id/3436011762001613%40a7af8471-b1b8-48c2-9ff7-631187067407

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-04 17:43  Antonin Houska <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  3 siblings, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-12-04 17:43 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Pg Hackers <[email protected]>; Mihail Nikalayeu <[email protected]>; Robert Treat <[email protected]>

Alvaro Herrera <[email protected]> wrote:

> Here's a new installment of this series, v25, including the CONCURRENTLY
> part, which required some conflict fixes on top of the much-changed
> v24-0001 patch.

v26 attached here. It's been rebased and reflects most of the feedback.

A few incomplete items are marked as TBD here [1] and [2] is another thing
that needs discussion.

Besides that, I've done some refactoring in 0004: 1) move more code to
setup_logical_decoding(), and 2) reduced the number of arguments of
process_concurrent_changes() by using a new structure. Both these changes are
a preparation for a background worker that will perform the logical decoding,
but seem to be useful as such. (I have a PoC of the worker but will post it
later, it doesn't seem to be the priority for now.)

I've also removed support for decoding TRUNCATE because I realized that this
command uses AccessExclusiveLock, so it cannot be executed on a table that
REPACK (CONCURRENTLY) is just processing.

Also I tried to fix TAB completion in psql.

> I have not yet addressed Robert Treat's feedback from October 12th.

These are still pending.

[1] https://www.postgresql.org/message-id/23631.1764855372%40localhost
[2] https://www.postgresql.org/message-id/CAJSLCQ2_jX8WmNOC4eu6hL5QyNHceOkgPbGhKHFw2X5onVEKDQ%40mail.gma...
-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-05 00:03  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-05 00:03 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, Antonin!

On Thu, Dec 4, 2025 at 6:43 PM Antonin Houska <[email protected]> wrote:
> v26 attached here. It's been rebased and reflects most of the feedback.

Some comments on 0001-0002:
1)

> cluster_rel(stmt->command, rel, indexOid, params);
cluster_rel closes relation, and after it is dereferenced a few lines after.
Technically it may be correct, but feels a little bit strange.

2)

> if (vacopts->mode == MODE_VACUUM)
I think for better compatibility it is better to handle new value in
if - (vacopts->mode == MODE_REPACK) to keep old cases unchanged

3)

> case T_RepackStmt:
>    tag = CMDTAG_REPACK;
>    break;

should we use instead:

case T_RepackStmt:
    if (((RepackStmt *) parsetree)->command == REPACK_COMMAND_CLUSTER)
       tag = CMDTAG_CLUSTER;
    else
       tag = CMDTAG_REPACK;
    break;

or delete CMDTAG_CLUSTER - since it not used anymore

4)
"has been superceded by"
typo

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-06 18:16  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-06 18:16 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, Antonin!

Some comments for 0003:

 > /* allocate in transaction context */
It may be any context now, because it is a function now.

> result = CopySnapshot(snapshot);

> /* Restore the original values so the source is intact. */
> snapshot->xip = oldxip;
> snapshot->xcnt = oldxcnt;

I think it is worth to call pfree(newxip) here.

> "This difference does has no impact"

should be "This difference has no impact"?


Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-07 16:03  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-07 16:03 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, comments so far on 0004:

---
> ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);

I think the biggest issue we have so far -
repack_decode_concurrent_changes is not called while new indexes are
built (the build itself creates a huge amount of WAL and takes days
sometimes). Looks like a way to catastrophic scenarios :)

Some small parts of it may be related to reset snapshots tech in CIC case:
1) if we build new indexes concurrently in REPACK case
2) and reset snapshots every so often
3) we may use the same callback to also process WAL every so often
4) but it still not applies to some phases of index building (batch
insertion phase, for example)

Or should we move repack_decode_concurrent_changes calls into some
kind of worker instead?

---
> if (OldHeap->rd_rel->reltoastrelid)
>    LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);

I think we should pass mode from rebuild_relation here - because
AccessExclusiveLock will break "CONCURRENTLY" totally.
And also upgrade before swap probably.

---
> cluster_is_permitted_for_relation(RepackCommand cmd, Oid relid, Oid userid)

Should be check CheckSlotPermissions(); here? Aso, maybe it is worth
mentioning in docs.

---
> REPACK (CONCURRENTLY) repack_test USING INDEX repack_test_pkey;

Some paths (without index) are not covered in any way in tests at the moment.
Also, I think some TOAST-related scenarios too.

> * Alternatively, we can lock all the indexes now in a mode that blocks
> * all the ALTER INDEX commands (ShareUpdateExclusiveLock ?), and keep

I think it's better to lock.

---
> rebuild_relation(RepackCommand cmd, Relation OldHeap, Relation index,

"cmd" is not used.

---
> apply_concurrent_update
> apply_concurrent_delete
> apply_concurrent_insert

"change" is not used, but I think it is intentionally for the MVCC-safe case.

---
> rebuild_relation(RepackCommand cmd, Relation OldHeap, Relation index,
>              bool verbose, bool concurrent)

"concurrent" is "concurrently" in definition.

---

> TM_FailureData *tmfd, bool changingPart,
> bool wal_logical);
Maybe "walLogical" to keep it aligned with "changingPart"?

---
> subtransacion
typo

---
> Should we check a the end

"a" is "at"?

---
> Note that <command>REPACK</command> with the
> the <literal>CONCURRENTLY</literal> option does not try to order the

double "the"

---
> if (size >= 0x3FFFFFFF)
if (size >= MaxAllocSize)

---
> extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
>                           Buffer buffer);
> extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
>                            Buffer buffer);

Looks like this from another patch.

---
src/backend/utils/cache/relcache.c
> #include "commands/cluster.h"

may be removed

---
> during any of the preceding
> phase.

"phases"

---
> # Prefix the system columns with underscore as they are not allowed as column
> # names.

Should it be removed?

---
> "Failed to find target tuple"

This and multiple other new error messages should start with lowercase

---
> Copyright (c) 2012-2024, PostgreSQL Global Development Group

in pgoutput_repack - maybe it is time to adjust.

---
src/test/modules/injection_points/logical.conf

Better to add newline

---
> SELECT injection_points_detach('repack-concurrently-before-lock');

Uses spaces, need to be tabs.


Next step in my plan - rebase MVCC-safe commit and test it with some
amount of stress tests.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-08 07:35  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-12-08 07:35 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> On Thu, Dec 4, 2025 at 6:43 PM Antonin Houska <[email protected]> wrote:
> > v26 attached here. It's been rebased and reflects most of the feedback.
> 
> Some comments on 0001-0002:
> 1)
> 
> > cluster_rel(stmt->command, rel, indexOid, params);
> cluster_rel closes relation, and after it is dereferenced a few lines after.
> Technically it may be correct, but feels a little bit strange.

ok, will be fixed in the next version (supposedly later today).

> 2)
> 
> > if (vacopts->mode == MODE_VACUUM)
> I think for better compatibility it is better to handle new value in
> if - (vacopts->mode == MODE_REPACK) to keep old cases unchanged

I suppose you mean vacuuming.c. We're considering removal of pg_repackdb from
the patchset, so let's decide on this later.

> 3)
> 
> > case T_RepackStmt:
> >    tag = CMDTAG_REPACK;
> >    break;
> 
> should we use instead:
> 
> case T_RepackStmt:
>     if (((RepackStmt *) parsetree)->command == REPACK_COMMAND_CLUSTER)
>        tag = CMDTAG_CLUSTER;
>     else
>        tag = CMDTAG_REPACK;
>     break;
> 
> or delete CMDTAG_CLUSTER - since it not used anymore

LGTM, will include it in the next version.

> 4)
> "has been superceded by"
> typo

ok. (This may also be removed, as it's specific to pg_repackdb.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-08 09:51  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Antonin Houska @ 2025-12-08 09:51 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Some comments for 0003:
> 
>  > /* allocate in transaction context */
> It may be any context now, because it is a function now.

Inaccuracy not introduced by REPACK, but I think it's o.k. if the next version
of this patch will remove the comment.

> > result = CopySnapshot(snapshot);
> 
> > /* Restore the original values so the source is intact. */
> > snapshot->xip = oldxip;
> > snapshot->xcnt = oldxcnt;
> 
> I think it is worth to call pfree(newxip) here.

ok

> > "This difference does has no impact"
> 
> should be "This difference has no impact"?

Right, thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-09 18:52  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Antonin Houska @ 2025-12-09 18:52 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hello, comments so far on 0004:
> 
> ---
> > ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
> 
> I think the biggest issue we have so far -
> repack_decode_concurrent_changes is not called while new indexes are
> built (the build itself creates a huge amount of WAL and takes days
> sometimes). Looks like a way to catastrophic scenarios :)

Indeed, that may be a problem.

> Some small parts of it may be related to reset snapshots tech in CIC case:
> 1) if we build new indexes concurrently in REPACK case
> 2) and reset snapshots every so often
> 3) we may use the same callback to also process WAL every so often
> 4) but it still not applies to some phases of index building (batch
> insertion phase, for example)

I prefer not to depend on other improvements.

> Or should we move repack_decode_concurrent_changes calls into some
> kind of worker instead?

Worker makes more sense to me - the initial implementation is in 0005.

> ---
> > if (OldHeap->rd_rel->reltoastrelid)
> >    LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
> 
> I think we should pass mode from rebuild_relation here - because
> AccessExclusiveLock will break "CONCURRENTLY" totally.

Good point, I missed this.

> And also upgrade before swap probably.

rebuild_relation_finish_concurrent() already does that.

> ---
> > cluster_is_permitted_for_relation(RepackCommand cmd, Oid relid, Oid userid)
> 
> Should be check CheckSlotPermissions(); here? Aso, maybe it is worth
> mentioning in docs.

setup_logical_decoding() does that, but I'm not sure if we should really
require the REPLICATION user attribute for REPACK. I need to think about this,
perhaps ACL_MAINTAIN is enough.

> ---
> > REPACK (CONCURRENTLY) repack_test USING INDEX repack_test_pkey;
> 
> Some paths (without index) are not covered in any way in tests at the moment.
> Also, I think some TOAST-related scenarios too.

I added test for TOAST to "injection_points" and hit a serious problem: when
applying concurrent changes to the new table, REPACK tried to delete rows from
the new one. The point is that the "swap TOAST by content" technique cannot be
used here. Fixed, thanks for this suggestion!

> > * Alternatively, we can lock all the indexes now in a mode that blocks
> > * all the ALTER INDEX commands (ShareUpdateExclusiveLock ?), and keep
> 
> I think it's better to lock.

ok, changed

> ---
> > rebuild_relation(RepackCommand cmd, Relation OldHeap, Relation index,
> 
> "cmd" is not used.

Fixed (not specific to 0004).

> ---
> > apply_concurrent_update
> > apply_concurrent_delete
> > apply_concurrent_insert
> 
> "change" is not used, but I think it is intentionally for the MVCC-safe case.

Not sure if it's necessary for the MVCC-safe case, I consider it leftover from
some previous version. Removed.

> ---
> > rebuild_relation(RepackCommand cmd, Relation OldHeap, Relation index,
> >              bool verbose, bool concurrent)
> 
> "concurrent" is "concurrently" in definition.

Fixed.

> ---
> 
> > TM_FailureData *tmfd, bool changingPart,
> > bool wal_logical);
> Maybe "walLogical" to keep it aligned with "changingPart"?

ok

> ---
> > subtransacion
> typo
> 

I removed the related code. It was a workaround for plan_cluster_use_sort()
not to leave locks behind. However, as REPACK (CONCURRENTLY) does not unlock
the relation anymore, this is not needed as well.

> ---
> > Should we check a the end
> 
> "a" is "at"?

Removed when addressing one of the previous comments.

> 
> ---
> > Note that <command>REPACK</command> with the
> > the <literal>CONCURRENTLY</literal> option does not try to order the
> 
> double "the"

Fixed.

> ---
> > if (size >= 0x3FFFFFFF)
> if (size >= MaxAllocSize)

Fixed.

> ---
> > extern bool HeapTupleMVCCInserted(HeapTuple htup, Snapshot snapshot,
> >                           Buffer buffer);
> > extern bool HeapTupleMVCCNotDeleted(HeapTuple htup, Snapshot snapshot,
> >                            Buffer buffer);
> 
> Looks like this from another patch.

Right, this is from the "MVCC safety part".

> ---
> src/backend/utils/cache/relcache.c
> > #include "commands/cluster.h"
> 
> may be removed

Yes, this belongs to some of the following patches of the series.

> ---
> > during any of the preceding
> > phase.
> 
> "phases"

Fixed.

> ---
> > # Prefix the system columns with underscore as they are not allowed as column
> > # names.
> 
> Should it be removed?

Done. (Belongs to the "MVCC-safety" part, where the test check xmin, xmax,
...)

> ---
> > "Failed to find target tuple"
> 
> This and multiple other new error messages should start with lowercase

Fixed.

> ---
> > Copyright (c) 2012-2024, PostgreSQL Global Development Group
> 
> in pgoutput_repack - maybe it is time to adjust.

Done.

> ---
> src/test/modules/injection_points/logical.conf
> 
> Better to add newline

Done.

> ---
> > SELECT injection_points_detach('repack-concurrently-before-lock');
> 
> Uses spaces, need to be tabs.

ok

Thanks for the review!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-09 19:22  Alvaro Herrera <[email protected]>
  parent: Antonin Houska <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Alvaro Herrera @ 2025-12-09 19:22 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, many thanks for the new version.  Here's a very quick proposal
for a new top-of-file comment on cluster.c,

 * cluster.c
 *		Implementation of REPACK [CONCURRENTLY], also known as CLUSTER and
 *		VACUUM FULL.
 *
 * There are two somewhat different ways to rewrite a table.  In non-
 * concurrent mode, it's easy: take AccessExclusiveLock, create a new
 * transient relation, copy the tuples over to the relfilenode of the
 * new relation, swap the relfilenodes, then drop the old relation.
 *
 * In concurrent mode, we lock the table with only ShareUpdateExclusiveLock,
 * then do an initial copy as above.  However, while the tuples are being
 * copied, concurrent transactions could modify the table, and to cope
 * with those changes, we rely on logical decoding to obtain them from WAL.
 * A bgworker consumes WAL while the initial copy is ongoing (to prevent
 * excessive WAL from being reserved), and accumulates the changes in
 * a tuplestore.  Once the initial copy is complete, we read the changes
 * from the tuplestore and re-apply them on the new heap.  Then we
 * upgrade our ShareUpdateExclusiveLock to AccessExclusiveLock and swap
 * the relfilenodes.  This way, the time we hold a strong lock on the
 * table is much reduced, and the bloat is greatly reduced.

I haven't read build_relation_finish_concurrent() yet to understand how
exactly do we do the lock upgrade, which I think is an important point
we should address in this comment.  Also not addressed is how exactly we
handle indexes.  Feel free to correct this, reword it or include any
additional details that you think are important.

(At this point we could just as well rename the file to repack.c, since
very little of the original remains.  But let's discuss that later.)

Thanks,

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-11 20:38  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  1 sibling, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-11 20:38 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, Antonin!

On Tue, Dec 9, 2025 at 7:52 PM Antonin Houska <[email protected]> wrote:
> Worker makes more sense to me - the initial implementation is in 0005.

Comments for 0005, so far:

---
> export_initial_snapshot

Hm, should we use ExportSnapshot instead? And ImportSnapshort to import it.

---
> get_initial_snapshot

Should we check if a worker is still alive while waiting? Also is
"process_concurrent_changes".

And AFAIU RegisterDynamicBackgroundWorker does not guarantee new
workers to be started (in case of some fork-related issues).

---
> Assert(res = SHM_MQ_DETACHED);

==

---
> /* Wait a bit before we retry reading WAL. */
> (void) WaitLatch(MyLatch,
>              WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>              1000L,
>              WAIT_EVENT_REPACK_WORKER_MAIN);

Looks like we need ResetLatch(MyLatch); here.

---
> * - decoding_ctx - logical decoding context, to capture concurrent data

Need to be removed together with parameters.

---
> hpm_context = AllocSetContextCreate(TopMemoryContext,
>                            "ProcessParallelMessages",
>                            ALLOCSET_DEFAULT_SIZES);

"ProcessRepacklMessages"

---
> if (XLogRecPtrIsInvalid(lsn_upto))
> {
>    SpinLockAcquire(&shared->mutex);
>    lsn_upto = shared->lsn_upto;
>    /* 'done' should be set at the same time as 'lsn_upto' */
>    done = shared->done;
>    SpinLockRelease(&shared->mutex);
>
>    /* Check if the work happens to be complete. */
>    continue;
> }

May be moved to the start of the loop to avoid duplication.

---
> SpinLockAcquire(&shared->mutex);
> valid = shared->sfs_valid;
> SpinLockRelease(&shared->mutex);

Better to remember last_exported here to avoid any races/misses.

---
> shared->lsn_upto = InvalidXLogRecPtr;

I think it is better to clear it once it is read (after removing duplication).

---
> bool       done;

bool exit_after_lsn_upto?

---
> bool       sfs_valid;

Do we really need it? I think it is better to leave only last_exported
and in process_concurrent_changes wait add argument
(last_processed_file) and wait for last_exported to become higher.

---
What if we reverse roles of leader-worker?

Leader gets a snapshot, transfers it to workers (multiple probably for
parallel scan) using already ready mechanics - workers are processing
the scan of the table in parallel. Leader decodes the WAL.

Also, workers may be assigned with a list of indexes they need to build.

Feels like it reuses more from current infrastructure and also needs
less different synchronization logic. But I'm not sure about the
indexes phase - maybe it is not so easy to do.

---
Also, should we add some kind of back pressure between building
indexes/new heap and num of WAL we have?
But probably it is out of scope of the patch.

---
To build N indexes we need to scan table N times. What is about
building multiple indexes during a single heap scan?

--
Just a gentle reminder about the XMIN_COMMITTED flag and WAL storm
after the switch.

Best regards,
Mikhail.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-13 18:45  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 2 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-13 18:45 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, everyone.

Stress tests for REPACK concurrently in attachment.
So far I can't break anything (except MVCC of course).

A rebased version of the MVCC-safe "light" version with its own stress
test is attached also.

Best regards,
Mikhail.

From 457235c743a2dec2c1917fbdfa7f5a48d305c63e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 13 Dec 2025 19:42:52 +0100
Subject: [PATCH vnocfbot] Preserve visibility information of the concurrent 
 data  changes.

As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.

However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". It ensures
that tuples written into the new table have the same XID and command ID (CID)
as they had in the old table.

To "replay" an UPDATE or DELETE command on the new table, we use SnapshotSelf to find the last alive version of tuple and update with stamp with xid of original transaction. It is safe because:
* all transactions we replaying are committed
* apply worker working without any concurrent modifiers of the table

As long as we preserve the tuple visibility information (which includes XID),
it's important to avoid logical decoding of the WAL generated by DMLs on the
new table: the logical decoding subsystem probably does not expect that the
incoming WAL records contain XIDs of an already decoded transactions. (And of
course, repeated decoding would be wasted effort.)

Author: Antonin Houska <[email protected]> with changes from Mikhail Nikalayeu <[email protected]
---
 contrib/amcheck/meson.build                   |   1 +
 .../amcheck/t/009_repack_concurrently_mvcc.pl | 113 ++++++++++++++++++
 doc/src/sgml/mvcc.sgml                        |  12 +-
 doc/src/sgml/ref/repack.sgml                  |   9 --
 src/backend/access/common/toast_internals.c   |   3 +-
 src/backend/access/heap/heapam.c              |  29 +++--
 src/backend/access/heap/heapam_handler.c      |  24 ++--
 src/backend/commands/cluster.c                | 107 ++++++++++++-----
 .../pgoutput_repack/pgoutput_repack.c         |  16 ++-
 src/include/access/heapam.h                   |   6 +-
 .../injection_points/specs/repack.spec        |   4 -
 11 files changed, 243 insertions(+), 81 deletions(-)
 create mode 100644 contrib/amcheck/t/009_repack_concurrently_mvcc.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index f7c70735989..6946c684259 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -52,6 +52,7 @@ tests += {
       't/006_verify_gin.pl',
       't/007_repack_concurrently.pl',
       't/008_repack_concurrently.pl',
+      't/009_repack_concurrently_mvcc.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/009_repack_concurrently_mvcc.pl b/contrib/amcheck/t/009_repack_concurrently_mvcc.pl
new file mode 100644
index 00000000000..a83fd5b8141
--- /dev/null
+++ b/contrib/amcheck/t/009_repack_concurrently_mvcc.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Test REPACK CONCURRENTLY with concurrent modifications
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my $node;
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf(
+	'postgresql.conf', qq(
+wal_level = logical
+));
+$node->start;
+$node->safe_psql('postgres', q(CREATE TABLE tbl1(i int PRIMARY KEY, j int)));
+$node->safe_psql('postgres', q(CREATE TABLE tbl2(i int PRIMARY KEY, j int)));
+
+
+# Insert 100 rows into tbl1
+$node->safe_psql('postgres', q(
+    INSERT INTO tbl1 SELECT i, i % 100 FROM generate_series(1,100) i
+));
+
+# Insert 100 rows into tbl2
+$node->safe_psql('postgres', q(
+    INSERT INTO tbl2 SELECT i, i % 100 FROM generate_series(1,100) i
+));
+
+
+# Insert 100 rows into tbl1
+$node->safe_psql('postgres', q(
+	CREATE OR REPLACE FUNCTION log_raise(i int, j1 int, j2 int) RETURNS VOID AS $$
+	BEGIN
+	  RAISE NOTICE 'ERROR i=% j1=% j2=%', i, j1, j2;
+	END;$$ LANGUAGE plpgsql;
+));
+
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+
+$node->pgbench(
+'--no-vacuum --client=10 --jobs=4 --exit-on-abort --transactions=2500',
+0,
+[qr{actually processed}],
+[qr{^$}],
+'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+{
+	'concurrent_ops' => q(
+		SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+		\if :gotlock
+			SELECT nextval('in_row_rebuild') AS last_value \gset
+			\if :last_value = 2
+				REPACK (CONCURRENTLY) tbl1 USING INDEX tbl1_pkey;
+				\sleep 10 ms
+				REPACK (CONCURRENTLY) tbl2 USING INDEX tbl2_pkey;
+				\sleep 10 ms
+			\endif
+			SELECT pg_advisory_unlock(42);
+		\else
+			\set num random(1, 100)
+			BEGIN;
+			UPDATE tbl1 SET j = j + 1 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 2 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 3 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 4 WHERE i = :num;
+			\sleep 1 ms
+
+			UPDATE tbl2 SET j = j + 1 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 2 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 3 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 4 WHERE i = :num;
+
+			COMMIT;
+			SELECT setval('in_row_rebuild', 1);
+
+			BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+			SELECT COALESCE(SUM(j), 0) AS t1 FROM tbl1 WHERE i = :num \gset p_
+			\sleep 10 ms
+			SELECT COALESCE(SUM(j), 0) AS t2 FROM tbl2 WHERE i = :num \gset p_
+			\if :p_t1 != :p_t2
+				COMMIT;
+				SELECT log_raise(tbl1.i, tbl1.j, tbl2.j) FROM tbl1 LEFT OUTER JOIN tbl2 ON tbl1.i = tbl2.i WHERE tbl1.j != tbl2.j;
+				\sleep 10 ms
+				SELECT log_raise(tbl1.i, tbl1.j, tbl2.j) FROM tbl1 LEFT OUTER JOIN tbl2 ON tbl1.i = tbl2.i WHERE tbl1.j != tbl2.j;
+				SELECT (:p_t1 + :p_t2) / 0;
+			\endif
+
+			COMMIT;
+		\endif
+	)
+});
+
+$node->stop;
+done_testing();
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 0f5c34af542..049ee75a4ba 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,17 +1833,15 @@ SELECT pg_advisory_lock(q.id) FROM
    <title>Caveats</title>
 
    <para>
-    Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
-    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
-    TABLE</command></link> and <command>REPACK</command> with
-    the <literal>CONCURRENTLY</literal> option, are not
+    Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
+    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
     MVCC-safe.  This means that after the truncation or rewrite commits, the
     table will appear empty to concurrent transactions, if they are using a
-    snapshot taken before the command committed.  This will only be an
+    snapshot taken before the DDL command committed.  This will only be an
     issue for a transaction that did not access the table in question
-    before the command started &mdash; any transaction that has done so
+    before the DDL command started &mdash; any transaction that has done so
     would hold at least an <literal>ACCESS SHARE</literal> table lock,
-    which would block the truncating or rewriting command until that transaction completes.
+    which would block the DDL command until that transaction completes.
     So these commands will not cause any apparent inconsistency in the
     table contents for successive queries on the target table, but they
     could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 30c43c49069..9796a923597 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -308,15 +308,6 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING
        </listitem>
       </itemizedlist>
      </para>
-
-     <warning>
-      <para>
-       <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
-       option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
-       details.
-      </para>
-     </warning>
-
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 63b848473f8..91119da5cd5 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -311,7 +311,8 @@ toast_save_datum(Relation rel, Datum value,
 
 		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
 
-		heap_insert(toastrel, toasttup, mycid, options, NULL);
+		heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+					options, NULL);
 
 		/*
 		 * Create the index entry.  We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e11833f01b4..94ca07e4b55 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2085,7 +2085,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
 /*
  *	heap_insert		- insert tuple into a heap
  *
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
  * command ID.
  *
  * See table_tuple_insert for comments about most of the input flags, except
@@ -2101,15 +2101,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * reflected into *tup.
  */
 void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
-			int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+			CommandId cid, int options, BulkInsertState bistate)
 {
-	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple	heaptup;
 	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		all_visible_cleared = false;
 
+	Assert(TransactionIdIsValid(xid));
+
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
 		   RelationGetNumberOfAttributes(relation));
@@ -2375,7 +2376,6 @@ void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate)
 {
-	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
 	int			i;
 	int			ndone;
@@ -2408,7 +2408,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		tuple = ExecFetchSlotHeapTuple(slots[i], true, NULL);
 		slots[i]->tts_tableOid = RelationGetRelid(relation);
 		tuple->t_tableOid = slots[i]->tts_tableOid;
-		heaptuples[i] = heap_prepare_insert(relation, tuple, xid, cid,
+		heaptuples[i] = heap_prepare_insert(relation, tuple, GetCurrentTransactionId(), cid,
 											options);
 	}
 
@@ -2746,7 +2746,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 void
 simple_heap_insert(Relation relation, HeapTuple tup)
 {
-	heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+	heap_insert(relation, tup, GetCurrentTransactionId(),
+				GetCurrentCommandId(true), 0, NULL);
 }
 
 /*
@@ -2803,11 +2804,10 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, const ItemPointerData *tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 			TM_FailureData *tmfd, bool changingPart, bool walLogical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	ItemId		lp;
 	HeapTupleData tp;
 	Page		page;
@@ -2824,6 +2824,7 @@ heap_delete(Relation relation, const ItemPointerData *tid,
 	bool		old_key_copied = false;
 
 	Assert(ItemPointerIsValid(tid));
+	Assert(TransactionIdIsValid(xid));
 
 	AssertHasSnapshotForToast(relation);
 
@@ -3240,7 +3241,7 @@ simple_heap_delete(Relation relation, const ItemPointerData *tid)
 	TM_Result	result;
 	TM_FailureData tmfd;
 
-	result = heap_delete(relation, tid,
+	result = heap_delete(relation, tid, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, false,	/* changingPart */
@@ -3283,12 +3284,11 @@ simple_heap_delete(Relation relation, const ItemPointerData *tid)
  */
 TM_Result
 heap_update(Relation relation, const ItemPointerData *otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
 			TU_UpdateIndexes *update_indexes, bool walLogical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	Bitmapset  *hot_attrs;
 	Bitmapset  *sum_attrs;
 	Bitmapset  *key_attrs;
@@ -3328,6 +3328,7 @@ heap_update(Relation relation, const ItemPointerData *otid, HeapTuple newtup,
 				infomask2_new_tuple;
 
 	Assert(ItemPointerIsValid(otid));
+	Assert(TransactionIdIsValid(xid));
 
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4534,7 +4535,7 @@ simple_heap_update(Relation relation, const ItemPointerData *otid, HeapTuple tup
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
 
-	result = heap_update(relation, otid, tup,
+	result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, &lockmode, update_indexes,
@@ -5373,8 +5374,6 @@ compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 	uint16		new_infomask,
 				new_infomask2;
 
-	Assert(TransactionIdIsCurrentTransactionId(add_to_xmax));
-
 l5:
 	new_infomask = 0;
 	new_infomask2 = 0;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e6d630fa2f7..b49f9add5bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
 	options |= HEAP_INSERT_SPECULATIVE;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -309,8 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
-					   true);
+	return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+					   crosscheck, wait, tmfd, changingPart, true);
 }
 
 
@@ -328,7 +330,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
+	result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+						 cid, crosscheck, wait,
 						 tmfd, lockmode, update_indexes, true);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
@@ -2441,9 +2444,16 @@ reform_and_rewrite_tuple(HeapTuple tuple,
 		 * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
 		 * the relation files, it drops this relation, so no logical
 		 * replication subscription should need the data.
+		 *
+		 * It is also crucial to stamp the new record with the exact same xid
+		 * and cid, because the tuple must be visible to the snapshots of the
+		 * concurrent transactions later.
 		 */
-		heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
-					HEAP_INSERT_NO_LOGICAL, NULL);
+		// TODO: looks like cid is not required
+		CommandId	cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
+		TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);
+
+		heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);
 	}
 
 	heap_freetuple(copiedTuple);
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f2a2ec6d3e5..1b1928ce300 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
@@ -249,15 +250,20 @@ static bool decode_concurrent_changes(LogicalDecodingContext *ctx,
 									  DecodingWorkerShared *shared);
 static void apply_concurrent_changes(BufFile *file, ChangeDest *dest);
 static void apply_concurrent_insert(Relation rel, HeapTuple tup,
+									TransactionId xid,
 									IndexInsertState *iistate,
 									TupleTableSlot *index_slot);
 static void apply_concurrent_update(Relation rel, HeapTuple tup,
 									HeapTuple tup_target,
+									TransactionId xid,
 									IndexInsertState *iistate,
 									TupleTableSlot *index_slot);
-static void apply_concurrent_delete(Relation rel, HeapTuple tup_target);
+static void apply_concurrent_delete(Relation rel,
+									TransactionId xid,
+									HeapTuple tup_target);
 static HeapTuple find_target_tuple(Relation rel, ChangeDest *dest,
 								   HeapTuple tup_key,
+								   Snapshot snapshot,
 								   TupleTableSlot *ident_slot);
 static void process_concurrent_changes(XLogRecPtr end_of_wal,
 									   ChangeDest *dest,
@@ -1091,7 +1097,14 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent
 
 	/* The historic snapshot won't be needed anymore. */
 	if (snapshot)
+	{
+		TransactionId xmin = snapshot->xmin;
 		PopActiveSnapshot();
+		Assert(concurrent);
+		// TODO: seems like it not required: need to check SnapBuildInitialSnapshotForRepack
+		WaitForOlderSnapshots(xmin, false);
+	}
+
 
 	if (concurrent)
 	{
@@ -1382,30 +1395,35 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_get_cutoffs(OldHeap, params, &cutoffs);
-
-	/*
-	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
-	 * backwards, so take the max.
-	 */
+	if (!concurrent)
 	{
 		TransactionId relfrozenxid = OldHeap->rd_rel->relfrozenxid;
+		MultiXactId relminmxid = OldHeap->rd_rel->relminmxid;
 
+		vacuum_get_cutoffs(OldHeap, params, &cutoffs);
+		/*
+		 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
+		 * backwards, so take the max.
+		 */
 		if (TransactionIdIsValid(relfrozenxid) &&
 			TransactionIdPrecedes(cutoffs.FreezeLimit, relfrozenxid))
 			cutoffs.FreezeLimit = relfrozenxid;
-	}
-
-	/*
-	 * MultiXactCutoff, similarly, shouldn't go backwards either.
-	 */
-	{
-		MultiXactId relminmxid = OldHeap->rd_rel->relminmxid;
-
+		/*
+		 * MultiXactCutoff, similarly, shouldn't go backwards either.
+		 */
 		if (MultiXactIdIsValid(relminmxid) &&
 			MultiXactIdPrecedes(cutoffs.MultiXactCutoff, relminmxid))
 			cutoffs.MultiXactCutoff = relminmxid;
 	}
+	else
+	{
+		/*
+		 * In concurrent mode we reuse all the xmin/xmax,
+		 * so just use current values for simplicity.
+		 */
+		cutoffs.FreezeLimit = OldHeap->rd_rel->relfrozenxid;
+		cutoffs.MultiXactCutoff = OldHeap->rd_rel->relminmxid;
+	}
 
 	/*
 	 * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
@@ -2745,6 +2763,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		size_t		nread;
 		HeapTuple	tup,
 					tup_exist;
+		TransactionId xid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -2761,6 +2780,17 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		tup->t_len = t_len;
 		ItemPointerSetInvalid(&tup->t_self);
 		tup->t_tableOid = RelationGetRelid(dest->rel);
+		BufFileReadExact(file, &xid, sizeof(TransactionId));
+
+		if (TransactionIdIsValid(xid && TransactionIdIsInProgress(xid)))
+		{
+			/* xmin is committed for sure because we got that update from reorderbuffer.
+			 * but there is a possibility procarray is not yet updated and current backend still see it as
+			 * in-progress. Let's wait for procarray to be updated. */
+			XactLockTableWait(xid, NULL, NULL, XLTW_None);
+			Assert(!TransactionIdIsInProgress(xid));
+			Assert(TransactionIdDidCommit(xid));
+		}
 
 		if (kind == CHANGE_UPDATE_OLD)
 		{
@@ -2771,7 +2801,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		{
 			Assert(tup_old == NULL);
 
-			apply_concurrent_insert(rel, tup, dest->iistate, index_slot);
+			apply_concurrent_insert(rel, tup, xid, dest->iistate, index_slot);
 
 			pfree(tup);
 		}
@@ -2790,17 +2820,21 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 			}
 
 			/*
-			 * Find the tuple to be updated or deleted.
+			 * Find the tuple to be updated or deleted using SnapshotSelf.
+			 * That way we receive the last alive version in case of HOT chain.
+			 * It is guaranteed there is no any non-yet committed, but updated version
+			 * because we here replaying all-committed transactions without any concurrency
+			 * involved.
 			 */
-			tup_exist = find_target_tuple(rel, dest, tup_key, ident_slot);
+			tup_exist = find_target_tuple(rel, dest, tup_key, SnapshotSelf, ident_slot);
 			if (tup_exist == NULL)
 				elog(ERROR, "failed to find target tuple");
 
 			if (kind == CHANGE_UPDATE_NEW)
-				apply_concurrent_update(rel, tup, tup_exist, dest->iistate,
+				apply_concurrent_update(rel, tup, tup_exist, xid, dest->iistate,
 										index_slot);
 			else
-				apply_concurrent_delete(rel, tup_exist);
+				apply_concurrent_delete(rel, xid, tup_exist);
 
 			if (tup_old != NULL)
 			{
@@ -2819,6 +2853,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		 */
 		if (kind != CHANGE_UPDATE_OLD)
 		{
+			// TODO: not sure it is required at all: we are replaying committed transactions stamping them with committed XID
 			CommandCounterIncrement();
 			UpdateActiveSnapshotCommandId();
 		}
@@ -2830,7 +2865,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 }
 
 static void
-apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
+apply_concurrent_insert(Relation rel, HeapTuple tup, TransactionId xid, IndexInsertState *iistate,
 						TupleTableSlot *index_slot)
 {
 	List	   *recheck;
@@ -2840,9 +2875,12 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 	 * Like simple_heap_insert(), but make sure that the INSERT is not
 	 * logically decoded - see reform_and_rewrite_tuple() for more
 	 * information.
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
-	heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
-				NULL);
+	Assert(TransactionIdIsValid(xid));
+	heap_insert(rel, tup, xid, GetCurrentCommandId(true),
+				HEAP_INSERT_NO_LOGICAL, NULL);
 
 	/*
 	 * Update indexes.
@@ -2850,6 +2888,7 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 	 * In case functions in the index need the active snapshot and caller
 	 * hasn't set one.
 	 */
+	PushActiveSnapshot(GetLatestSnapshot());
 	ExecStoreHeapTuple(tup, index_slot, false);
 	recheck = ExecInsertIndexTuples(iistate->rri,
 									index_slot,
@@ -2860,6 +2899,7 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 									NIL,	/* arbiterIndexes */
 									false	/* onlySummarizing */
 		);
+	PopActiveSnapshot();
 
 	/*
 	 * If recheck is required, it must have been preformed on the source
@@ -2873,6 +2913,7 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 
 static void
 apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+						TransactionId xid,
 						IndexInsertState *iistate, TupleTableSlot *index_slot)
 {
 	LockTupleMode lockmode;
@@ -2887,9 +2928,12 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 	 *
 	 * Do it like in simple_heap_update(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
+	Assert(TransactionIdIsValid(xid));
 	res = heap_update(rel, &tup_target->t_self, tup,
-					  GetCurrentCommandId(true),
+					  xid, GetCurrentCommandId(true),
 					  InvalidSnapshot,
 					  false,	/* no wait - only we are doing changes */
 					  &tmfd, &lockmode, &update_indexes,
@@ -2901,6 +2945,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 
 	if (update_indexes != TU_None)
 	{
+		PushActiveSnapshot(GetLatestSnapshot());
 		recheck = ExecInsertIndexTuples(iistate->rri,
 										index_slot,
 										iistate->estate,
@@ -2910,6 +2955,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 										NIL,	/* arbiterIndexes */
 		/* onlySummarizing */
 										update_indexes == TU_Summarizing);
+		PopActiveSnapshot();
 		list_free(recheck);
 	}
 
@@ -2917,7 +2963,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 }
 
 static void
-apply_concurrent_delete(Relation rel, HeapTuple tup_target)
+apply_concurrent_delete(Relation rel, TransactionId xid, HeapTuple tup_target)
 {
 	TM_Result	res;
 	TM_FailureData tmfd;
@@ -2927,9 +2973,12 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target)
 	 *
 	 * Do it like in simple_heap_delete(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
-	res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
-					  InvalidSnapshot, false,
+	Assert(TransactionIdIsValid(xid));
+	res = heap_delete(rel, &tup_target->t_self, xid,
+					  GetCurrentCommandId(true), InvalidSnapshot, false,
 					  &tmfd,
 					  false,	/* no wait - only we are doing changes */
 					  false /* wal_logical */ );
@@ -2950,7 +2999,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target)
  */
 static HeapTuple
 find_target_tuple(Relation rel, ChangeDest *dest, HeapTuple tup_key,
-				  TupleTableSlot *ident_slot)
+				  Snapshot snapshot, TupleTableSlot *ident_slot)
 {
 	Relation	ident_index = dest->ident_index;
 	IndexScanDesc scan;
@@ -2959,7 +3008,7 @@ find_target_tuple(Relation rel, ChangeDest *dest, HeapTuple tup_key,
 	HeapTuple	result = NULL;
 
 	/* XXX no instrumentation for now */
-	scan = index_beginscan(rel, ident_index, GetActiveSnapshot(),
+	scan = index_beginscan(rel, ident_index, snapshot,
 						   NULL, dest->ident_key_nentries, 0);
 
 	/*
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index fb9956d392d..8d796e0a684 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -29,7 +29,8 @@ static void plugin_commit_txn(LogicalDecodingContext *ctx,
 static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 						  Relation rel, ReorderBufferChange *change);
 static void store_change(LogicalDecodingContext *ctx,
-						 ConcurrentChangeKind kind, HeapTuple tuple);
+						 ConcurrentChangeKind kind, HeapTuple tuple,
+						 TransactionId xid);
 
 void
 _PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -120,7 +121,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (newtuple == NULL)
 					elog(ERROR, "Incomplete insert info.");
 
-				store_change(ctx, CHANGE_INSERT, newtuple);
+				store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_UPDATE:
@@ -137,9 +138,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					elog(ERROR, "Incomplete update info.");
 
 				if (oldtuple != NULL)
-					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+								 change->txn->xid);
 
-				store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+				store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+							 change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_DELETE:
@@ -152,7 +155,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (oldtuple == NULL)
 					elog(ERROR, "Incomplete delete info.");
 
-				store_change(ctx, CHANGE_DELETE, oldtuple);
+				store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
 			}
 			break;
 		default:
@@ -165,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* Store concurrent data change. */
 static void
 store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
-			 HeapTuple tuple)
+			 HeapTuple tuple, TransactionId xid)
 {
 	RepackDecodingState *dstate;
 	char		kind_byte = (char) kind;
@@ -195,6 +198,7 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
 	BufFileWrite(dstate->file, &tuple->t_len, sizeof(tuple->t_len));
 	/* ... and the tuple itself. */
 	BufFileWrite(dstate->file, tuple->t_data, tuple->t_len);
+	BufFileWrite(dstate->file, &xid, sizeof(TransactionId));
 
 	/* Free the flat copy if created above. */
 	if (flattened)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b7cd25896f6..d9776f61a0d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,20 +354,20 @@ extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
 extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
 
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, const ItemPointerData *tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 							 TM_FailureData *tmfd, bool changingPart,
 							 bool wal_logical);
 extern void heap_finish_speculative(Relation relation, const ItemPointerData *tid);
 extern void heap_abort_speculative(Relation relation, const ItemPointerData *tid);
 extern TM_Result heap_update(Relation relation, const ItemPointerData *otid,
 							 HeapTuple newtup,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 							 TM_FailureData *tmfd, LockTupleMode *lockmode,
 							 TU_UpdateIndexes *update_indexes, bool wal_logical);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index d727a9b056b..accd42d78aa 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -85,9 +85,6 @@ step change_new
 # When applying concurrent data changes, we should see the effects of an
 # in-progress subtransaction.
 #
-# XXX Not sure this test is useful now - it was designed for the patch that
-# preserves tuple visibility and which therefore modifies
-# TransactionIdIsCurrentTransactionId().
 step change_subxact1
 {
 	BEGIN;
@@ -102,7 +99,6 @@ step change_subxact1
 # When applying concurrent data changes, we should not see the effects of a
 # rolled back subtransaction.
 #
-# XXX Is this test useful? See above.
 step change_subxact2
 {
 	BEGIN;
-- 
2.43.0



Attachments:

  [text/plain] nocfbot-0006-Preserve-visibility-information-of-the-conc.patch (31.8K, 2-nocfbot-0006-Preserve-visibility-information-of-the-conc.patch)
  download | inline diff:
From 457235c743a2dec2c1917fbdfa7f5a48d305c63e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 13 Dec 2025 19:42:52 +0100
Subject: [PATCH vnocfbot] Preserve visibility information of the concurrent 
 data  changes.

As explained in the commit message of the preceding patch of the series, the
data changes done by applications while REPACK CONCURRENTLY is copying the
table contents to a new file are decoded from WAL and eventually also applied
to the new file. To reduce the complexity a little bit, the preceding patch
uses the current transaction (i.e. transaction opened by the REPACK command)
to execute those INSERT, UPDATE and DELETE commands.

However, REPACK is not expected to change visibility of tuples. Therefore,
this patch fixes the handling of the "concurrent data changes". It ensures
that tuples written into the new table have the same XID and command ID (CID)
as they had in the old table.

To "replay" an UPDATE or DELETE command on the new table, we use SnapshotSelf to find the last alive version of tuple and update with stamp with xid of original transaction. It is safe because:
* all transactions we replaying are committed
* apply worker working without any concurrent modifiers of the table

As long as we preserve the tuple visibility information (which includes XID),
it's important to avoid logical decoding of the WAL generated by DMLs on the
new table: the logical decoding subsystem probably does not expect that the
incoming WAL records contain XIDs of an already decoded transactions. (And of
course, repeated decoding would be wasted effort.)

Author: Antonin Houska <[email protected]> with changes from Mikhail Nikalayeu <[email protected]
---
 contrib/amcheck/meson.build                   |   1 +
 .../amcheck/t/009_repack_concurrently_mvcc.pl | 113 ++++++++++++++++++
 doc/src/sgml/mvcc.sgml                        |  12 +-
 doc/src/sgml/ref/repack.sgml                  |   9 --
 src/backend/access/common/toast_internals.c   |   3 +-
 src/backend/access/heap/heapam.c              |  29 +++--
 src/backend/access/heap/heapam_handler.c      |  24 ++--
 src/backend/commands/cluster.c                | 107 ++++++++++++-----
 .../pgoutput_repack/pgoutput_repack.c         |  16 ++-
 src/include/access/heapam.h                   |   6 +-
 .../injection_points/specs/repack.spec        |   4 -
 11 files changed, 243 insertions(+), 81 deletions(-)
 create mode 100644 contrib/amcheck/t/009_repack_concurrently_mvcc.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index f7c70735989..6946c684259 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -52,6 +52,7 @@ tests += {
       't/006_verify_gin.pl',
       't/007_repack_concurrently.pl',
       't/008_repack_concurrently.pl',
+      't/009_repack_concurrently_mvcc.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/009_repack_concurrently_mvcc.pl b/contrib/amcheck/t/009_repack_concurrently_mvcc.pl
new file mode 100644
index 00000000000..a83fd5b8141
--- /dev/null
+++ b/contrib/amcheck/t/009_repack_concurrently_mvcc.pl
@@ -0,0 +1,113 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Test REPACK CONCURRENTLY with concurrent modifications
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my $node;
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf(
+	'postgresql.conf', qq(
+wal_level = logical
+));
+$node->start;
+$node->safe_psql('postgres', q(CREATE TABLE tbl1(i int PRIMARY KEY, j int)));
+$node->safe_psql('postgres', q(CREATE TABLE tbl2(i int PRIMARY KEY, j int)));
+
+
+# Insert 100 rows into tbl1
+$node->safe_psql('postgres', q(
+    INSERT INTO tbl1 SELECT i, i % 100 FROM generate_series(1,100) i
+));
+
+# Insert 100 rows into tbl2
+$node->safe_psql('postgres', q(
+    INSERT INTO tbl2 SELECT i, i % 100 FROM generate_series(1,100) i
+));
+
+
+# Insert 100 rows into tbl1
+$node->safe_psql('postgres', q(
+	CREATE OR REPLACE FUNCTION log_raise(i int, j1 int, j2 int) RETURNS VOID AS $$
+	BEGIN
+	  RAISE NOTICE 'ERROR i=% j1=% j2=%', i, j1, j2;
+	END;$$ LANGUAGE plpgsql;
+));
+
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+
+$node->pgbench(
+'--no-vacuum --client=10 --jobs=4 --exit-on-abort --transactions=2500',
+0,
+[qr{actually processed}],
+[qr{^$}],
+'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+{
+	'concurrent_ops' => q(
+		SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+		\if :gotlock
+			SELECT nextval('in_row_rebuild') AS last_value \gset
+			\if :last_value = 2
+				REPACK (CONCURRENTLY) tbl1 USING INDEX tbl1_pkey;
+				\sleep 10 ms
+				REPACK (CONCURRENTLY) tbl2 USING INDEX tbl2_pkey;
+				\sleep 10 ms
+			\endif
+			SELECT pg_advisory_unlock(42);
+		\else
+			\set num random(1, 100)
+			BEGIN;
+			UPDATE tbl1 SET j = j + 1 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 2 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 3 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl1 SET j = j + 4 WHERE i = :num;
+			\sleep 1 ms
+
+			UPDATE tbl2 SET j = j + 1 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 2 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 3 WHERE i = :num;
+			\sleep 1 ms
+			UPDATE tbl2 SET j = j + 4 WHERE i = :num;
+
+			COMMIT;
+			SELECT setval('in_row_rebuild', 1);
+
+			BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+			SELECT COALESCE(SUM(j), 0) AS t1 FROM tbl1 WHERE i = :num \gset p_
+			\sleep 10 ms
+			SELECT COALESCE(SUM(j), 0) AS t2 FROM tbl2 WHERE i = :num \gset p_
+			\if :p_t1 != :p_t2
+				COMMIT;
+				SELECT log_raise(tbl1.i, tbl1.j, tbl2.j) FROM tbl1 LEFT OUTER JOIN tbl2 ON tbl1.i = tbl2.i WHERE tbl1.j != tbl2.j;
+				\sleep 10 ms
+				SELECT log_raise(tbl1.i, tbl1.j, tbl2.j) FROM tbl1 LEFT OUTER JOIN tbl2 ON tbl1.i = tbl2.i WHERE tbl1.j != tbl2.j;
+				SELECT (:p_t1 + :p_t2) / 0;
+			\endif
+
+			COMMIT;
+		\endif
+	)
+});
+
+$node->stop;
+done_testing();
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 0f5c34af542..049ee75a4ba 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,17 +1833,15 @@ SELECT pg_advisory_lock(q.id) FROM
    <title>Caveats</title>
 
    <para>
-    Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
-    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
-    TABLE</command></link> and <command>REPACK</command> with
-    the <literal>CONCURRENTLY</literal> option, are not
+    Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
+    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
     MVCC-safe.  This means that after the truncation or rewrite commits, the
     table will appear empty to concurrent transactions, if they are using a
-    snapshot taken before the command committed.  This will only be an
+    snapshot taken before the DDL command committed.  This will only be an
     issue for a transaction that did not access the table in question
-    before the command started &mdash; any transaction that has done so
+    before the DDL command started &mdash; any transaction that has done so
     would hold at least an <literal>ACCESS SHARE</literal> table lock,
-    which would block the truncating or rewriting command until that transaction completes.
+    which would block the DDL command until that transaction completes.
     So these commands will not cause any apparent inconsistency in the
     table contents for successive queries on the target table, but they
     could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 30c43c49069..9796a923597 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -308,15 +308,6 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING
        </listitem>
       </itemizedlist>
      </para>
-
-     <warning>
-      <para>
-       <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
-       option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
-       details.
-      </para>
-     </warning>
-
     </listitem>
    </varlistentry>
 
diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 63b848473f8..91119da5cd5 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -311,7 +311,8 @@ toast_save_datum(Relation rel, Datum value,
 
 		toasttup = heap_form_tuple(toasttupDesc, t_values, t_isnull);
 
-		heap_insert(toastrel, toasttup, mycid, options, NULL);
+		heap_insert(toastrel, toasttup, GetCurrentTransactionId(), mycid,
+					options, NULL);
 
 		/*
 		 * Create the index entry.  We cheat a little here by not using
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e11833f01b4..94ca07e4b55 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2085,7 +2085,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
 /*
  *	heap_insert		- insert tuple into a heap
  *
- * The new tuple is stamped with current transaction ID and the specified
+ * The new tuple is stamped with specified transaction ID and the specified
  * command ID.
  *
  * See table_tuple_insert for comments about most of the input flags, except
@@ -2101,15 +2101,16 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * reflected into *tup.
  */
 void
-heap_insert(Relation relation, HeapTuple tup, CommandId cid,
-			int options, BulkInsertState bistate)
+heap_insert(Relation relation, HeapTuple tup, TransactionId xid,
+			CommandId cid, int options, BulkInsertState bistate)
 {
-	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple	heaptup;
 	Buffer		buffer;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		all_visible_cleared = false;
 
+	Assert(TransactionIdIsValid(xid));
+
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(tup->t_data) <=
 		   RelationGetNumberOfAttributes(relation));
@@ -2375,7 +2376,6 @@ void
 heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate)
 {
-	TransactionId xid = GetCurrentTransactionId();
 	HeapTuple  *heaptuples;
 	int			i;
 	int			ndone;
@@ -2408,7 +2408,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		tuple = ExecFetchSlotHeapTuple(slots[i], true, NULL);
 		slots[i]->tts_tableOid = RelationGetRelid(relation);
 		tuple->t_tableOid = slots[i]->tts_tableOid;
-		heaptuples[i] = heap_prepare_insert(relation, tuple, xid, cid,
+		heaptuples[i] = heap_prepare_insert(relation, tuple, GetCurrentTransactionId(), cid,
 											options);
 	}
 
@@ -2746,7 +2746,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 void
 simple_heap_insert(Relation relation, HeapTuple tup)
 {
-	heap_insert(relation, tup, GetCurrentCommandId(true), 0, NULL);
+	heap_insert(relation, tup, GetCurrentTransactionId(),
+				GetCurrentCommandId(true), 0, NULL);
 }
 
 /*
@@ -2803,11 +2804,10 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
  */
 TM_Result
 heap_delete(Relation relation, const ItemPointerData *tid,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 			TM_FailureData *tmfd, bool changingPart, bool walLogical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	ItemId		lp;
 	HeapTupleData tp;
 	Page		page;
@@ -2824,6 +2824,7 @@ heap_delete(Relation relation, const ItemPointerData *tid,
 	bool		old_key_copied = false;
 
 	Assert(ItemPointerIsValid(tid));
+	Assert(TransactionIdIsValid(xid));
 
 	AssertHasSnapshotForToast(relation);
 
@@ -3240,7 +3241,7 @@ simple_heap_delete(Relation relation, const ItemPointerData *tid)
 	TM_Result	result;
 	TM_FailureData tmfd;
 
-	result = heap_delete(relation, tid,
+	result = heap_delete(relation, tid, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, false,	/* changingPart */
@@ -3283,12 +3284,11 @@ simple_heap_delete(Relation relation, const ItemPointerData *tid)
  */
 TM_Result
 heap_update(Relation relation, const ItemPointerData *otid, HeapTuple newtup,
-			CommandId cid, Snapshot crosscheck, bool wait,
+			TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
 			TU_UpdateIndexes *update_indexes, bool walLogical)
 {
 	TM_Result	result;
-	TransactionId xid = GetCurrentTransactionId();
 	Bitmapset  *hot_attrs;
 	Bitmapset  *sum_attrs;
 	Bitmapset  *key_attrs;
@@ -3328,6 +3328,7 @@ heap_update(Relation relation, const ItemPointerData *otid, HeapTuple newtup,
 				infomask2_new_tuple;
 
 	Assert(ItemPointerIsValid(otid));
+	Assert(TransactionIdIsValid(xid));
 
 	/* Cheap, simplistic check that the tuple matches the rel's rowtype. */
 	Assert(HeapTupleHeaderGetNatts(newtup->t_data) <=
@@ -4534,7 +4535,7 @@ simple_heap_update(Relation relation, const ItemPointerData *otid, HeapTuple tup
 	TM_FailureData tmfd;
 	LockTupleMode lockmode;
 
-	result = heap_update(relation, otid, tup,
+	result = heap_update(relation, otid, tup, GetCurrentTransactionId(),
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
 						 &tmfd, &lockmode, update_indexes,
@@ -5373,8 +5374,6 @@ compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 	uint16		new_infomask,
 				new_infomask2;
 
-	Assert(TransactionIdIsCurrentTransactionId(add_to_xmax));
-
 l5:
 	new_infomask = 0;
 	new_infomask2 = 0;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e6d630fa2f7..b49f9add5bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -252,7 +252,8 @@ heapam_tuple_insert(Relation relation, TupleTableSlot *slot, CommandId cid,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -275,7 +276,8 @@ heapam_tuple_insert_speculative(Relation relation, TupleTableSlot *slot,
 	options |= HEAP_INSERT_SPECULATIVE;
 
 	/* Perform the insertion, and copy the resulting ItemPointer */
-	heap_insert(relation, tuple, cid, options, bistate);
+	heap_insert(relation, tuple, GetCurrentTransactionId(), cid, options,
+				bistate);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	if (shouldFree)
@@ -309,8 +311,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
-					   true);
+	return heap_delete(relation, tid, GetCurrentTransactionId(), cid,
+					   crosscheck, wait, tmfd, changingPart, true);
 }
 
 
@@ -328,7 +330,8 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	slot->tts_tableOid = RelationGetRelid(relation);
 	tuple->t_tableOid = slot->tts_tableOid;
 
-	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
+	result = heap_update(relation, otid, tuple, GetCurrentTransactionId(),
+						 cid, crosscheck, wait,
 						 tmfd, lockmode, update_indexes, true);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
@@ -2441,9 +2444,16 @@ reform_and_rewrite_tuple(HeapTuple tuple,
 		 * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
 		 * the relation files, it drops this relation, so no logical
 		 * replication subscription should need the data.
+		 *
+		 * It is also crucial to stamp the new record with the exact same xid
+		 * and cid, because the tuple must be visible to the snapshots of the
+		 * concurrent transactions later.
 		 */
-		heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
-					HEAP_INSERT_NO_LOGICAL, NULL);
+		// TODO: looks like cid is not required
+		CommandId	cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
+		TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);
+
+		heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);
 	}
 
 	heap_freetuple(copiedTuple);
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f2a2ec6d3e5..1b1928ce300 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -58,6 +58,7 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/procarray.h"
 #include "storage/procsignal.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
@@ -249,15 +250,20 @@ static bool decode_concurrent_changes(LogicalDecodingContext *ctx,
 									  DecodingWorkerShared *shared);
 static void apply_concurrent_changes(BufFile *file, ChangeDest *dest);
 static void apply_concurrent_insert(Relation rel, HeapTuple tup,
+									TransactionId xid,
 									IndexInsertState *iistate,
 									TupleTableSlot *index_slot);
 static void apply_concurrent_update(Relation rel, HeapTuple tup,
 									HeapTuple tup_target,
+									TransactionId xid,
 									IndexInsertState *iistate,
 									TupleTableSlot *index_slot);
-static void apply_concurrent_delete(Relation rel, HeapTuple tup_target);
+static void apply_concurrent_delete(Relation rel,
+									TransactionId xid,
+									HeapTuple tup_target);
 static HeapTuple find_target_tuple(Relation rel, ChangeDest *dest,
 								   HeapTuple tup_key,
+								   Snapshot snapshot,
 								   TupleTableSlot *ident_slot);
 static void process_concurrent_changes(XLogRecPtr end_of_wal,
 									   ChangeDest *dest,
@@ -1091,7 +1097,14 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent
 
 	/* The historic snapshot won't be needed anymore. */
 	if (snapshot)
+	{
+		TransactionId xmin = snapshot->xmin;
 		PopActiveSnapshot();
+		Assert(concurrent);
+		// TODO: seems like it not required: need to check SnapBuildInitialSnapshotForRepack
+		WaitForOlderSnapshots(xmin, false);
+	}
+
 
 	if (concurrent)
 	{
@@ -1382,30 +1395,35 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
 	 * not to be aggressive about this.
 	 */
 	memset(&params, 0, sizeof(VacuumParams));
-	vacuum_get_cutoffs(OldHeap, params, &cutoffs);
-
-	/*
-	 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
-	 * backwards, so take the max.
-	 */
+	if (!concurrent)
 	{
 		TransactionId relfrozenxid = OldHeap->rd_rel->relfrozenxid;
+		MultiXactId relminmxid = OldHeap->rd_rel->relminmxid;
 
+		vacuum_get_cutoffs(OldHeap, params, &cutoffs);
+		/*
+		 * FreezeXid will become the table's new relfrozenxid, and that mustn't go
+		 * backwards, so take the max.
+		 */
 		if (TransactionIdIsValid(relfrozenxid) &&
 			TransactionIdPrecedes(cutoffs.FreezeLimit, relfrozenxid))
 			cutoffs.FreezeLimit = relfrozenxid;
-	}
-
-	/*
-	 * MultiXactCutoff, similarly, shouldn't go backwards either.
-	 */
-	{
-		MultiXactId relminmxid = OldHeap->rd_rel->relminmxid;
-
+		/*
+		 * MultiXactCutoff, similarly, shouldn't go backwards either.
+		 */
 		if (MultiXactIdIsValid(relminmxid) &&
 			MultiXactIdPrecedes(cutoffs.MultiXactCutoff, relminmxid))
 			cutoffs.MultiXactCutoff = relminmxid;
 	}
+	else
+	{
+		/*
+		 * In concurrent mode we reuse all the xmin/xmax,
+		 * so just use current values for simplicity.
+		 */
+		cutoffs.FreezeLimit = OldHeap->rd_rel->relfrozenxid;
+		cutoffs.MultiXactCutoff = OldHeap->rd_rel->relminmxid;
+	}
 
 	/*
 	 * Decide whether to use an indexscan or seqscan-and-optional-sort to scan
@@ -2745,6 +2763,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		size_t		nread;
 		HeapTuple	tup,
 					tup_exist;
+		TransactionId xid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -2761,6 +2780,17 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		tup->t_len = t_len;
 		ItemPointerSetInvalid(&tup->t_self);
 		tup->t_tableOid = RelationGetRelid(dest->rel);
+		BufFileReadExact(file, &xid, sizeof(TransactionId));
+
+		if (TransactionIdIsValid(xid && TransactionIdIsInProgress(xid)))
+		{
+			/* xmin is committed for sure because we got that update from reorderbuffer.
+			 * but there is a possibility procarray is not yet updated and current backend still see it as
+			 * in-progress. Let's wait for procarray to be updated. */
+			XactLockTableWait(xid, NULL, NULL, XLTW_None);
+			Assert(!TransactionIdIsInProgress(xid));
+			Assert(TransactionIdDidCommit(xid));
+		}
 
 		if (kind == CHANGE_UPDATE_OLD)
 		{
@@ -2771,7 +2801,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		{
 			Assert(tup_old == NULL);
 
-			apply_concurrent_insert(rel, tup, dest->iistate, index_slot);
+			apply_concurrent_insert(rel, tup, xid, dest->iistate, index_slot);
 
 			pfree(tup);
 		}
@@ -2790,17 +2820,21 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 			}
 
 			/*
-			 * Find the tuple to be updated or deleted.
+			 * Find the tuple to be updated or deleted using SnapshotSelf.
+			 * That way we receive the last alive version in case of HOT chain.
+			 * It is guaranteed there is no any non-yet committed, but updated version
+			 * because we here replaying all-committed transactions without any concurrency
+			 * involved.
 			 */
-			tup_exist = find_target_tuple(rel, dest, tup_key, ident_slot);
+			tup_exist = find_target_tuple(rel, dest, tup_key, SnapshotSelf, ident_slot);
 			if (tup_exist == NULL)
 				elog(ERROR, "failed to find target tuple");
 
 			if (kind == CHANGE_UPDATE_NEW)
-				apply_concurrent_update(rel, tup, tup_exist, dest->iistate,
+				apply_concurrent_update(rel, tup, tup_exist, xid, dest->iistate,
 										index_slot);
 			else
-				apply_concurrent_delete(rel, tup_exist);
+				apply_concurrent_delete(rel, xid, tup_exist);
 
 			if (tup_old != NULL)
 			{
@@ -2819,6 +2853,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 		 */
 		if (kind != CHANGE_UPDATE_OLD)
 		{
+			// TODO: not sure it is required at all: we are replaying committed transactions stamping them with committed XID
 			CommandCounterIncrement();
 			UpdateActiveSnapshotCommandId();
 		}
@@ -2830,7 +2865,7 @@ apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 }
 
 static void
-apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
+apply_concurrent_insert(Relation rel, HeapTuple tup, TransactionId xid, IndexInsertState *iistate,
 						TupleTableSlot *index_slot)
 {
 	List	   *recheck;
@@ -2840,9 +2875,12 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 	 * Like simple_heap_insert(), but make sure that the INSERT is not
 	 * logically decoded - see reform_and_rewrite_tuple() for more
 	 * information.
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
-	heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
-				NULL);
+	Assert(TransactionIdIsValid(xid));
+	heap_insert(rel, tup, xid, GetCurrentCommandId(true),
+				HEAP_INSERT_NO_LOGICAL, NULL);
 
 	/*
 	 * Update indexes.
@@ -2850,6 +2888,7 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 	 * In case functions in the index need the active snapshot and caller
 	 * hasn't set one.
 	 */
+	PushActiveSnapshot(GetLatestSnapshot());
 	ExecStoreHeapTuple(tup, index_slot, false);
 	recheck = ExecInsertIndexTuples(iistate->rri,
 									index_slot,
@@ -2860,6 +2899,7 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 									NIL,	/* arbiterIndexes */
 									false	/* onlySummarizing */
 		);
+	PopActiveSnapshot();
 
 	/*
 	 * If recheck is required, it must have been preformed on the source
@@ -2873,6 +2913,7 @@ apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
 
 static void
 apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+						TransactionId xid,
 						IndexInsertState *iistate, TupleTableSlot *index_slot)
 {
 	LockTupleMode lockmode;
@@ -2887,9 +2928,12 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 	 *
 	 * Do it like in simple_heap_update(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
+	Assert(TransactionIdIsValid(xid));
 	res = heap_update(rel, &tup_target->t_self, tup,
-					  GetCurrentCommandId(true),
+					  xid, GetCurrentCommandId(true),
 					  InvalidSnapshot,
 					  false,	/* no wait - only we are doing changes */
 					  &tmfd, &lockmode, &update_indexes,
@@ -2901,6 +2945,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 
 	if (update_indexes != TU_None)
 	{
+		PushActiveSnapshot(GetLatestSnapshot());
 		recheck = ExecInsertIndexTuples(iistate->rri,
 										index_slot,
 										iistate->estate,
@@ -2910,6 +2955,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 										NIL,	/* arbiterIndexes */
 		/* onlySummarizing */
 										update_indexes == TU_Summarizing);
+		PopActiveSnapshot();
 		list_free(recheck);
 	}
 
@@ -2917,7 +2963,7 @@ apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
 }
 
 static void
-apply_concurrent_delete(Relation rel, HeapTuple tup_target)
+apply_concurrent_delete(Relation rel, TransactionId xid, HeapTuple tup_target)
 {
 	TM_Result	res;
 	TM_FailureData tmfd;
@@ -2927,9 +2973,12 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target)
 	 *
 	 * Do it like in simple_heap_delete(), except for 'wal_logical' (and
 	 * except for 'wait').
+	 *
+	 * Use already committed xid to stamp the tuple.
 	 */
-	res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
-					  InvalidSnapshot, false,
+	Assert(TransactionIdIsValid(xid));
+	res = heap_delete(rel, &tup_target->t_self, xid,
+					  GetCurrentCommandId(true), InvalidSnapshot, false,
 					  &tmfd,
 					  false,	/* no wait - only we are doing changes */
 					  false /* wal_logical */ );
@@ -2950,7 +2999,7 @@ apply_concurrent_delete(Relation rel, HeapTuple tup_target)
  */
 static HeapTuple
 find_target_tuple(Relation rel, ChangeDest *dest, HeapTuple tup_key,
-				  TupleTableSlot *ident_slot)
+				  Snapshot snapshot, TupleTableSlot *ident_slot)
 {
 	Relation	ident_index = dest->ident_index;
 	IndexScanDesc scan;
@@ -2959,7 +3008,7 @@ find_target_tuple(Relation rel, ChangeDest *dest, HeapTuple tup_key,
 	HeapTuple	result = NULL;
 
 	/* XXX no instrumentation for now */
-	scan = index_beginscan(rel, ident_index, GetActiveSnapshot(),
+	scan = index_beginscan(rel, ident_index, snapshot,
 						   NULL, dest->ident_key_nentries, 0);
 
 	/*
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index fb9956d392d..8d796e0a684 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -29,7 +29,8 @@ static void plugin_commit_txn(LogicalDecodingContext *ctx,
 static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 						  Relation rel, ReorderBufferChange *change);
 static void store_change(LogicalDecodingContext *ctx,
-						 ConcurrentChangeKind kind, HeapTuple tuple);
+						 ConcurrentChangeKind kind, HeapTuple tuple,
+						 TransactionId xid);
 
 void
 _PG_output_plugin_init(OutputPluginCallbacks *cb)
@@ -120,7 +121,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (newtuple == NULL)
 					elog(ERROR, "Incomplete insert info.");
 
-				store_change(ctx, CHANGE_INSERT, newtuple);
+				store_change(ctx, CHANGE_INSERT, newtuple, change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_UPDATE:
@@ -137,9 +138,11 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 					elog(ERROR, "Incomplete update info.");
 
 				if (oldtuple != NULL)
-					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple,
+								 change->txn->xid);
 
-				store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+				store_change(ctx, CHANGE_UPDATE_NEW, newtuple,
+							 change->txn->xid);
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_DELETE:
@@ -152,7 +155,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 				if (oldtuple == NULL)
 					elog(ERROR, "Incomplete delete info.");
 
-				store_change(ctx, CHANGE_DELETE, oldtuple);
+				store_change(ctx, CHANGE_DELETE, oldtuple, change->txn->xid);
 			}
 			break;
 		default:
@@ -165,7 +168,7 @@ plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
 /* Store concurrent data change. */
 static void
 store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
-			 HeapTuple tuple)
+			 HeapTuple tuple, TransactionId xid)
 {
 	RepackDecodingState *dstate;
 	char		kind_byte = (char) kind;
@@ -195,6 +198,7 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
 	BufFileWrite(dstate->file, &tuple->t_len, sizeof(tuple->t_len));
 	/* ... and the tuple itself. */
 	BufFileWrite(dstate->file, tuple->t_data, tuple->t_len);
+	BufFileWrite(dstate->file, &xid, sizeof(TransactionId));
 
 	/* Free the flat copy if created above. */
 	if (flattened)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b7cd25896f6..d9776f61a0d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,20 +354,20 @@ extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
 extern void ReleaseBulkInsertStatePin(BulkInsertState bistate);
 
-extern void heap_insert(Relation relation, HeapTuple tup, CommandId cid,
+extern void heap_insert(Relation relation, HeapTuple tup, TransactionId xid, CommandId cid,
 						int options, BulkInsertState bistate);
 extern void heap_multi_insert(Relation relation, TupleTableSlot **slots,
 							  int ntuples, CommandId cid, int options,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, const ItemPointerData *tid,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 							 TM_FailureData *tmfd, bool changingPart,
 							 bool wal_logical);
 extern void heap_finish_speculative(Relation relation, const ItemPointerData *tid);
 extern void heap_abort_speculative(Relation relation, const ItemPointerData *tid);
 extern TM_Result heap_update(Relation relation, const ItemPointerData *otid,
 							 HeapTuple newtup,
-							 CommandId cid, Snapshot crosscheck, bool wait,
+							 TransactionId xid, CommandId cid, Snapshot crosscheck, bool wait,
 							 TM_FailureData *tmfd, LockTupleMode *lockmode,
 							 TU_UpdateIndexes *update_indexes, bool wal_logical);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
index d727a9b056b..accd42d78aa 100644
--- a/src/test/modules/injection_points/specs/repack.spec
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -85,9 +85,6 @@ step change_new
 # When applying concurrent data changes, we should see the effects of an
 # in-progress subtransaction.
 #
-# XXX Not sure this test is useful now - it was designed for the patch that
-# preserves tuple visibility and which therefore modifies
-# TransactionIdIsCurrentTransactionId().
 step change_subxact1
 {
 	BEGIN;
@@ -102,7 +99,6 @@ step change_subxact1
 # When applying concurrent data changes, we should not see the effects of a
 # rolled back subtransaction.
 #
-# XXX Is this test useful? See above.
 step change_subxact2
 {
 	BEGIN;
-- 
2.43.0



  [application/x-patch] nocfbot-0002-one-more-stress-test-for-repack-concurrentl.patch (3.5K, 3-nocfbot-0002-one-more-stress-test-for-repack-concurrentl.patch)
  download | inline diff:
From 25fc848068b28e5b2ae099bdecae35fdf8cb6240 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 13 Dec 2025 18:46:46 +0100
Subject: [PATCH vnocfbot 2/2] one more stress test for repack concurrently

---
 contrib/amcheck/meson.build                  |   1 +
 contrib/amcheck/t/008_repack_concurrently.pl | 101 +++++++++++++++++++
 2 files changed, 102 insertions(+)
 create mode 100644 contrib/amcheck/t/008_repack_concurrently.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 2b69081d3bf..f7c70735989 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/005_pitr.pl',
       't/006_verify_gin.pl',
       't/007_repack_concurrently.pl',
+      't/008_repack_concurrently.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/008_repack_concurrently.pl b/contrib/amcheck/t/008_repack_concurrently.pl
new file mode 100644
index 00000000000..220524d41b3
--- /dev/null
+++ b/contrib/amcheck/t/008_repack_concurrently.pl
@@ -0,0 +1,101 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Test REPACK CONCURRENTLY with concurrent modifications
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my $node;
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf(
+	'postgresql.conf', qq(
+wal_level = logical
+));
+
+my $no_hot = int(rand(2));
+
+$node->start;
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i SERIAL PRIMARY KEY, j int)));
+if ($no_hot)
+{
+	$node->safe_psql('postgres', q(CREATE INDEX test_idx ON tbl(j);));
+}
+else
+{
+	$node->safe_psql('postgres', q(CREATE INDEX test_idx ON tbl(i);));
+}
+
+# Load amcheck
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+my $sum = $node->safe_psql('postgres', q(
+	SELECT SUM(j) AS sum FROM tbl
+));
+
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE last_j START 1 INCREMENT 1;));
+
+
+$node->pgbench(
+'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+0,
+[qr{actually processed}],
+[qr{^$}],
+'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+{
+	'concurrent_ops' => qq(
+		SELECT pg_try_advisory_lock(42)::integer AS gotlock \\gset
+		\\if :gotlock
+			REPACK (CONCURRENTLY) tbl USING INDEX tbl_pkey;
+			SELECT bt_index_parent_check('tbl_pkey', heapallindexed => true);
+			SELECT bt_index_parent_check('test_idx', heapallindexed => true);
+			\\sleep 10 ms
+
+			REPACK (CONCURRENTLY) tbl USING INDEX test_idx;
+			SELECT bt_index_parent_check('tbl_pkey', heapallindexed => true);
+			SELECT bt_index_parent_check('test_idx', heapallindexed => true);
+			\\sleep 10 ms
+
+			REPACK (CONCURRENTLY) tbl;
+			SELECT bt_index_parent_check('tbl_pkey', heapallindexed => true);
+			SELECT bt_index_parent_check('test_idx', heapallindexed => true);
+			\\sleep 10 ms
+
+			SELECT pg_advisory_unlock(42);
+		\\else
+			SELECT pg_advisory_lock(43);
+				BEGIN;
+				INSERT INTO tbl(j) VALUES (nextval('last_j')) RETURNING j \\gset p_
+				COMMIT;
+			SELECT pg_advisory_unlock(43);
+			\\sleep 1 ms
+
+			BEGIN
+			--TRANSACTION ISOLATION LEVEL REPEATABLE READ
+			;
+			SELECT 1;
+			\\sleep 1 ms
+			SELECT COUNT(*) AS count FROM tbl WHERE j <= :p_j \\gset p_
+			\\if :p_count != :p_j
+				COMMIT;
+				SELECT (:p_count) / 0;
+			\\endif
+
+			COMMIT;
+		\\endif
+	)
+});
+
+$node->stop;
+done_testing();
-- 
2.43.0



  [application/x-patch] nocfbot-0001-stress-test-for-repack-concurrently.patch (3.6K, 4-nocfbot-0001-stress-test-for-repack-concurrently.patch)
  download | inline diff:
From db84bbad9d10ffacffc763dbf0ed4bb481f42399 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 13 Dec 2025 18:13:37 +0100
Subject: [PATCH vnocfbot 1/2] stress test for repack concurrently

---
 contrib/amcheck/meson.build                  |   1 +
 contrib/amcheck/t/007_repack_concurrently.pl | 110 +++++++++++++++++++
 2 files changed, 111 insertions(+)
 create mode 100644 contrib/amcheck/t/007_repack_concurrently.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index 1f0c347ed54..2b69081d3bf 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -50,6 +50,7 @@ tests += {
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
       't/006_verify_gin.pl',
+      't/007_repack_concurrently.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/007_repack_concurrently.pl b/contrib/amcheck/t/007_repack_concurrently.pl
new file mode 100644
index 00000000000..a47cebb347b
--- /dev/null
+++ b/contrib/amcheck/t/007_repack_concurrently.pl
@@ -0,0 +1,110 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+# Test REPACK CONCURRENTLY with concurrent modifications
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my $node;
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf(
+	'postgresql.conf', qq(
+wal_level = logical
+));
+
+my $n=1000;
+my $no_hot = int(rand(2));
+
+$node->start;
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int PRIMARY KEY, j int)));
+
+if ($no_hot)
+{
+	$node->safe_psql('postgres', q(CREATE INDEX test_idx ON tbl(j);));
+}
+else
+{
+	$node->safe_psql('postgres', q(CREATE INDEX test_idx ON tbl(i);));
+}
+
+
+# Load amcheck
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+# Insert $n rows into tbl
+$node->safe_psql('postgres', qq(
+	INSERT INTO tbl SELECT i, i FROM generate_series(1,$n) i
+));
+
+my $sum = $node->safe_psql('postgres', q(
+	SELECT SUM(j) AS sum FROM tbl
+));
+
+
+$node->pgbench(
+'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=5000',
+0,
+[qr{actually processed}],
+[qr{^$}],
+'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+{
+	'concurrent_ops' => qq(
+		SELECT pg_try_advisory_lock(42)::integer AS gotlock \\gset
+		\\if :gotlock
+			REPACK (CONCURRENTLY) tbl USING INDEX tbl_pkey;
+			SELECT bt_index_parent_check('tbl_pkey', heapallindexed => true);
+			SELECT bt_index_parent_check('test_idx', heapallindexed => true);
+			\\sleep 10 ms
+
+			REPACK (CONCURRENTLY) tbl USING INDEX test_idx;
+			SELECT bt_index_parent_check('tbl_pkey', heapallindexed => true);
+			SELECT bt_index_parent_check('test_idx', heapallindexed => true);
+			\\sleep 10 ms
+
+			REPACK (CONCURRENTLY) tbl;
+			SELECT bt_index_parent_check('tbl_pkey', heapallindexed => true);
+			SELECT bt_index_parent_check('test_idx', heapallindexed => true);
+			\\sleep 10 ms
+
+			SELECT pg_advisory_unlock(42);
+		\\else
+			\\set num_a random(1, $n)
+			\\set num_b random(1, $n)
+			\\set diff random(1, 10000)
+			BEGIN;
+			UPDATE tbl SET j = j + :diff WHERE i = :num_a;
+			\\sleep 1 ms
+			UPDATE tbl SET j = j - :diff WHERE i = :num_b;
+			\\sleep 1 ms
+			COMMIT;
+
+			BEGIN
+			--TRANSACTION ISOLATION LEVEL REPEATABLE READ
+			;
+			SELECT 1;
+			\\sleep 1 ms
+			SELECT COALESCE(SUM(j), 0) AS sum FROM tbl \\gset p_
+			\\if :p_sum != $sum
+				COMMIT;
+				SELECT (:p_sum) / 0;
+			\\endif
+
+			COMMIT;
+		\\endif
+	)
+});
+
+$node->stop;
+done_testing();
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-13 18:48  Antonin Houska <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 2 replies; 106+ messages in thread

From: Antonin Houska @ 2025-12-13 18:48 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Alvaro Herrera <[email protected]> wrote:

> Hello, many thanks for the new version.  Here's a very quick proposal
> for a new top-of-file comment on cluster.c,

The comment matches 0005, but I had to adjust it for 0004 (no background
worker there). Also, the worker writes the changes to a file rather than
tuplestore (storage/sharedfileset.h seems to me an easier way to pass the data
from one process to another) Besides that I made the following changes:

"bloat is greatly reduced" -> "bloat is eliminated"

and

"table, and to cope with" -> "table. To cope with"

> I haven't read build_relation_finish_concurrent() yet to understand how
> exactly do we do the lock upgrade, which I think is an important point
> we should address in this comment.  Also not addressed is how exactly we
> handle indexes.  Feel free to correct this, reword it or include any
> additional details that you think are important.

ok, I'll get back to the earlier parts of the set, including this, in the
beginning of January. Regarding indexes, one thing I've noticed recently that
they get locked in build_new_indexes(), but maybe it should happen earlier.

> (At this point we could just as well rename the file to repack.c, since
> very little of the original remains.  But let's discuss that later.)

ok. Do you mean only the file or the functions as well? (I'm not going to do
that now, w/o that discussion.)

Attached here is a new version of the patch set. Its rebased and extended one
more time: 0006 is a PoC of the "snapshot resetting" technique, as discussed
elsewhere with Mihail Nikalayeu and Matthias van de Meent. The way snapshot
are generated here is different though: we need the snapshots from logical
replication's snapbuild.c, not those from procarray.c. More information is in
the commit message.

I do not insist that this should go to PG 19, just needed some confidence that
it's doable, as well as some feedback. There are no tests for this yet, but
I've played with it for a while and checked the behavior using debugger. I'm
curious to hear if the design is sound.

While working on that, I fixed some problems in 0004 and 0005 too. It
shouldn't be difficult to identify them using git, if needed.

Even if 0005 and 0006 won't land in PG19, these parts show that some
refactoring may be needed regarding the AM callback
table_relation_copy_for_cluster(). The parts 0004, 0005 and 0006 each change
the argument list. It wouldn't be perfect if both PG 19 and 20 changed the
API. I think we should reconsider which arguments are generic and which are
rather AM-specific. Maybe we should then add an opaque pointer (void *) for
the AM-specific information. REPACK could then use it to pass the
CONCURRENTLY-specific information.

I'm now going to prioritize the parts <= 0004.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-13 18:59  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-13 18:59 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Once it was sent, I realized MVCC-safe fails with
007_repack_concurrently.pl with TRANSACTION ISOLATION LEVEL REPEATABLE
READ uncommented.

Don't know why it fails - but happy it fails :)

On Sat, Dec 13, 2025 at 7:45 PM Mihail Nikalayeu
<[email protected]> wrote:
>
> Hello, everyone.
>
> Stress tests for REPACK concurrently in attachment.
> So far I can't break anything (except MVCC of course).
>
> A rebased version of the MVCC-safe "light" version with its own stress
> test is attached also.
>
> Best regards,
> Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-13 19:01  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-13 19:01 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, Antonin!

On Sat, Dec 13, 2025 at 7:48 PM Antonin Houska <[email protected]> wrote:
> Attached here is a new version of the patch set. Its rebased and extended one
> more time: 0006 is a PoC of the "snapshot resetting" technique, as discussed
> elsewhere with Mihail Nikalayeu and Matthias van de Meent. The way snapshot
> are generated here is different though: we need the snapshots from logical
> replication's snapbuild.c, not those from procarray.c. More information is in
> the commit message.

Have you seen my feedback for 0004? Do you plan to check it? Asking to
understand if it is worth reviewing now or later.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-13 19:39  Antonin Houska <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 106+ messages in thread

From: Antonin Houska @ 2025-12-13 19:39 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> On Tue, Dec 9, 2025 at 7:52 PM Antonin Houska <[email protected]> wrote:
> > Worker makes more sense to me - the initial implementation is in 0005.
> 
> Comments for 0005, so far:

Thanks!

> ---
> > export_initial_snapshot
> 
> Hm, should we use ExportSnapshot instead? And ImportSnapshort to import it.

There is at least one thing that I don't want: ImportSnapshot calls
SetTransactionSnapshot() at the end. I chose the way leader process uses to
serialize and pass snapshot to background workers.

> ---
> > get_initial_snapshot
> 
> Should we check if a worker is still alive while waiting? Also is
> "process_concurrent_changes".

ConditionVariableSleep() should handle that - see the WL_EXIT_ON_PM_DEATH flag
in ConditionVariableTimedSleep().

> And AFAIU RegisterDynamicBackgroundWorker does not guarantee new
> workers to be started (in case of some fork-related issues).

Yes, user will get ERROR in such a case. This is different from parallel
workers in query processing: if parallel worker cannot be started, the leader
(AFAICS) still executes the query. I'm not sure though if we should implement
REPACK (CONCURRENTLY) in such a way that it works even w/o the worker. The
code would be more complex and the behaviour quite different (I mean the
possibly huge amount of unprocessed WAL that you pointed out earlier.)

> ---
> > Assert(res = SHM_MQ_DETACHED);
> 
> ==

Thanks!

> ---
> > /* Wait a bit before we retry reading WAL. */
> > (void) WaitLatch(MyLatch,
> >              WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> >              1000L,
> >              WAIT_EVENT_REPACK_WORKER_MAIN);
> 
> Looks like we need ResetLatch(MyLatch); here.

You seem to be right.

> ---
> > * - decoding_ctx - logical decoding context, to capture concurrent data
> 
> Need to be removed together with parameters.

Do you mean in 0005? (It'd help if you pasted the hunk headers.) This should
be fixed in v28 [1]

> ---
> > hpm_context = AllocSetContextCreate(TopMemoryContext,
> >                            "ProcessParallelMessages",
> >                            ALLOCSET_DEFAULT_SIZES);
> 
> "ProcessRepacklMessages"

ok, the copy and pasting is a problem that needs to be addressed (mentioned in
the last paragraph of the commit message of 0005).

> ---
> > if (XLogRecPtrIsInvalid(lsn_upto))
> > {
> >    SpinLockAcquire(&shared->mutex);
> >    lsn_upto = shared->lsn_upto;
> >    /* 'done' should be set at the same time as 'lsn_upto' */
> >    done = shared->done;
> >    SpinLockRelease(&shared->mutex);
> >
> >    /* Check if the work happens to be complete. */
> >    continue;
> > }
> 
> May be moved to the start of the loop to avoid duplication.

I found more problems in this part when working on v28, maybe check that.

> ---
> > SpinLockAcquire(&shared->mutex);
> > valid = shared->sfs_valid;
> > SpinLockRelease(&shared->mutex);
> 
> Better to remember last_exported here to avoid any races/misses.

What races/misses exactly?

> ---
> > shared->lsn_upto = InvalidXLogRecPtr;
> 
> I think it is better to clear it once it is read (after removing duplication).

Maybe, I'll think about it.

> ---
> > bool       done;
> 
> bool exit_after_lsn_upto?

Not sure.

> ---
> > bool       sfs_valid;
> 
> Do we really need it? I think it is better to leave only last_exported
> and in process_concurrent_changes wait add argument
> (last_processed_file) and wait for last_exported to become higher.

I'll consider that (The variable is replaced in the 0006 part of v28, but the
idea should still be applicable.)

> ---
> What if we reverse roles of leader-worker?
> 
> Leader gets a snapshot, transfers it to workers (multiple probably for
> parallel scan) using already ready mechanics - workers are processing
> the scan of the table in parallel. Leader decodes the WAL.

Insertion into a table by multiple workers is a special thing, but maybe it'd
be doable in this case, but ...

> Also, workers may be assigned with a list of indexes they need to build.
> 
> Feels like it reuses more from current infrastructure and also needs
> less different synchronization logic. But I'm not sure about the
> indexes phase - maybe it is not so easy to do.

... my feelings were the opposite, i.e. I thought require higher amount of
code rearrangement. Moreover, the part 0006 of v28 (snapshot switching) would
be trickier. It processes one range of blocks after another, and parallelism
would make it more difficult.

> ---
> Also, should we add some kind of back pressure between building
> indexes/new heap and num of WAL we have?
> But probably it is out of scope of the patch.

Do you mean that the decoding worker should be less active if the amount of
WAL doesn't grow too fast?

> ---
> To build N indexes we need to scan table N times. What is about
> building multiple indexes during a single heap scan?

That sounds like a separate feature, and similarly difficult as enhancing
CREATE INDEX so it can create multiple indexes at a time.

> --
> Just a gentle reminder about the XMIN_COMMITTED flag and WAL storm
> after the switch.

ok, I have it in my notes, moved it more to the top :-)

[1] https://www.postgresql.org/message-id/210036.1765651719%40localhost

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-15 14:25  Alvaro Herrera <[email protected]>
  parent: Antonin Houska <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Alvaro Herrera @ 2025-12-15 14:25 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

On 2025-Dec-13, Antonin Houska wrote:

> From 6279394135f2b693b6fffd174822509e0a067cbf Mon Sep 17 00:00:00 2001
> From: Antonin Houska <[email protected]>
> Date: Sat, 13 Dec 2025 19:27:18 +0100
> Subject: [PATCH 4/6] Add CONCURRENTLY option to REPACK command.

> diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
> index cc03f0706e9..a956892f42f 100644
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -472,6 +473,88 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)

> +	/*
> +	 * Second, skip records which do not contain sufficient information for
> +	 * the decoding.
> +	 *
> +	 * The problem we solve here is that REPACK CONCURRENTLY generates WAL
> +	 * when doing changes in the new table. Those changes should not be useful
> +	 * for any other user (such as logical replication subscription) because
> +	 * the new table will eventually be dropped (after REPACK CONCURRENTLY has
> +	 * assigned its file to the "old table").
> +	 */
> +	switch (info)
> +	{
> +		case XLOG_HEAP_INSERT:
> +			{
> +				xl_heap_insert *rec;
> +
> +				rec = (xl_heap_insert *) XLogRecGetData(buf->record);
> +
> +				/*
> +				 * This does happen when 1) raw_heap_insert marks the TOAST
> +				 * record as HEAP_INSERT_NO_LOGICAL, 2) REPACK CONCURRENTLY
> +				 * replays inserts performed by other backends.
> +				 */
> +				if ((rec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) == 0)
> +					return;
> +
> +				break;
> +			}
> +
> +		case XLOG_HEAP_HOT_UPDATE:
> +		case XLOG_HEAP_UPDATE:
> +			{
> +				xl_heap_update *rec;
> +
> +				rec = (xl_heap_update *) XLogRecGetData(buf->record);
> +				if ((rec->flags &
> +					 (XLH_UPDATE_CONTAINS_NEW_TUPLE |
> +					  XLH_UPDATE_CONTAINS_OLD_TUPLE |
> +					  XLH_UPDATE_CONTAINS_OLD_KEY)) == 0)
> +					return;
> +
> +				break;
> +			}
> +
> +		case XLOG_HEAP_DELETE:
> +			{
> +				xl_heap_delete *rec;
> +
> +				rec = (xl_heap_delete *) XLogRecGetData(buf->record);
> +				if (rec->flags & XLH_DELETE_NO_LOGICAL)
> +					return;
> +				break;
> +			}
> +	}

I'm confused as to the purpose of this addition.  I took this whole
block out, and no tests seem to fail.  Moreover, some of the cases that
are being skipped because of this, would already be skipped by code in
DecodeInsert / DecodeUpdate anyway.  The case for XLOG_HEAP_DELETE seems
to have no effect (that is, the "return" there never hits for any tests
as far as I can tell.)

The reason I ask is that the line immediately below does this:

>  	ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);

which means the Xid is tracked for snapshot building purposes.  Which is
probably important, because of what the comment right below it says:

	/*
	 * If we don't have snapshot or we are just fast-forwarding, there is no
	 * point in decoding data changes. However, it's crucial to build the base
	 * snapshot during fast-forward mode (as is done in
	 * SnapBuildProcessChange()) because we require the snapshot's xmin when
	 * determining the candidate catalog_xmin for the replication slot. See
	 * SnapBuildProcessRunningXacts().
	 */

So what happens here is that we would skip processing the Xid of a xlog
record during snapshot-building, on the grounds that it doesn't contain
logical changes.  I'm not sure this is okay.  If we do indeed need this,
then perhaps it should be done after ReorderBufferProcessXid().

Or did you intend to make this conditional on the backend running
REPACK?

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-18 01:47  Mihail Nikalayeu <[email protected]>
  parent: Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-18 01:47 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello!

On Sat, Dec 13, 2025 at 7:45 PM Mihail Nikalayeu
<[email protected]> wrote:
> Stress tests for REPACK concurrently in attachment.

To run:
ninja && meson test --suite setup && meson test --print-errorlogs
--suite amcheck *007*
ninja && meson test --suite setup && meson test --print-errorlogs
--suite amcheck *008*

Results for v28:

Up to " v28-0005-Use-background-worker-to-do-logical-decoding.patch":

Technically it passes, but sometimes I saw 0% CPU usage for long
periods with such stacks (looks like it happens for 0008 more often):

epoll_wait 0x000078b99512a037
WaitEventSetWaitBlock waiteventset.c:1192
WaitEventSetWait waiteventset.c:1140
WaitLatch latch.c:196
decode_concurrent_changes cluster.c:2702
repack_worker_internal cluster.c:3777
RepackWorkerMain cluster.c:3725
BackgroundWorkerMain bgworker.c:850
postmaster_child_launch launch_backend.c:268
StartBackgroundWorker postmaster.c:4168
maybe_start_bgworkers postmaster.c:4334
LaunchMissingBackgroundProcesses postmaster.c:3408
ServerLoop postmaster.c:1728
PostmasterMain postmaster.c:1403
main main.c:231

epoll_wait 0x000078b99512a037
WaitEventSetWaitBlock waiteventset.c:1192
WaitEventSetWait waiteventset.c:1140
WaitLatch latch.c:196
ConditionVariableTimedSleep condition_variable.c:165
ConditionVariableSleep condition_variable.c:100
process_concurrent_changes cluster.c:3042
rebuild_relation_finish_concurrent cluster.c:3303
rebuild_relation cluster.c:1121
cluster_rel cluster.c:731
process_single_relation cluster.c:2405
ExecRepack cluster.c:391
standard_ProcessUtility utility.c:864
ProcessUtility utility.c:525
PortalRunUtility pquery.c:1148
PortalRunMulti pquery.c:1306
PortalRun pquery.c:783
exec_simple_query postgres.c:1280
PostgresMain postgres.c:4779
BackendMain backend_startup.c:124
postmaster_child_launch launch_backend.c:268
BackendStartup postmaster.c:3598
ServerLoop postmaster.c:1713
PostmasterMain postmaster.c:1403
main main.c:231

Probably it is because
> 100000L,    /* XXX Tune the delay. */

100 seconds is clearly too much.

For "v28-0006-Use-multiple-snapshots-to-copy-the-data.patch":

0007: crash with

TRAP: failed Assert("portal->portalSnapshot == GetActiveSnapshot()"),
File: "../src/backend/tcop/pquery.c", Line: 1169, PID: 178414
postgres: CIC_test: nkey postgres [local]
REPACK(ExceptionalCondition+0xbe)[0x5743f9a955bb]
postgres: CIC_test: nkey postgres [local] REPACK(+0x67fac4)[0x5743f98a7ac4]
postgres: CIC_test: nkey postgres [local] REPACK(+0x67fced)[0x5743f98a7ced]
postgres: CIC_test: nkey postgres [local]
REPACK(PortalRun+0x346)[0x5743f98a7107]
postgres: CIC_test: nkey postgres [local] REPACK(+0x6773bb)[0x5743f989f3bb]
postgres: CIC_test: nkey postgres [local]
REPACK(PostgresMain+0xc1c)[0x5743f98a4f58]
postgres: CIC_test: nkey postgres [local] REPACK(+0x6726c6)[0x5743f989a6c6]
postgres: CIC_test: nkey postgres [local]
REPACK(postmaster_child_launch+0x191)[0x5743f979678c]
postgres: CIC_test: nkey postgres [local] REPACK(+0x5755ca)[0x5743f979d5ca]
postgres: CIC_test: nkey postgres [local] REPACK(+0x572972)[0x5743f979a972]
postgres: CIC_test: nkey postgres [local]
REPACK(PostmasterMain+0x168a)[0x5743f979a225]
postgres: CIC_test: nkey postgres [local] REPACK(main+0x3a1)[0x5743f9662176]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x77f80402a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x77f80402a28b]
postgres: CIC_test: nkey postgres [local] REPACK(_start+0x25)[0x5743f9311eb5]

0008: pass

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 106+ messages in thread

* Re: Adding REPACK [concurrently]
@ 2025-12-18 02:05  Mihail Nikalayeu <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 0 replies; 106+ messages in thread

From: Mihail Nikalayeu @ 2025-12-18 02:05 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Alvaro Herrera <[email protected]>; Pg Hackers <[email protected]>; Robert Treat <[email protected]>

Hello, Antonin!

On Sat, Dec 13, 2025 at 8:39 PM Antonin Houska <[email protected]> wrote:
> > ---
> > > SpinLockAcquire(&shared->mutex);
> > > valid = shared->sfs_valid;
> > > SpinLockRelease(&shared->mutex);
> >
> > Better to remember last_exported here to avoid any races/misses.
>
> What races/misses exactly?

Just as some way to reduce a number of potential scenarios/states
between parallel actors.

> > ---
> > > bool       done;
> >
> > bool exit_after_lsn_upto?
>
> Not sure.

I think it should be named in some way to signal it is a request, not a report.

> > Also, should we add some kind of back pressure between building
> > indexes/new heap and num of WAL we have?
> > But probably it is out of scope of the patch.
>
> Do you mean that the decoding worker should be less active if the amount of
> WAL doesn't grow too fast?

In the previous version (without background) we have some kind of
back-pressure during the scan part (if we have too muchWAL delayed
because of us - we process it).
But it is not more true with a background worker. At the same time -
it never was during the index building phase...

Best regards,
Mikhail.

^ permalink  raw  reply  [nested|flat] 106+ messages in thread

end of thread, other threads:[~2025-12-18 02:05 UTC | newest]

Thread overview: 106+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-26 21:56 Adding REPACK [concurrently] Alvaro Herrera <[email protected]>
2025-07-27 01:59 ` Robert Treat <[email protected]>
2025-07-31 16:50   ` Alvaro Herrera <[email protected]>
2025-08-01 11:07     ` Fujii Masao <[email protected]>
2025-08-04 23:21       ` Mihail Nikalayeu <[email protected]>
2025-08-09 12:55         ` Mihail Nikalayeu <[email protected]>
2025-08-09 13:33           ` Alvaro Herrera <[email protected]>
2025-08-20 23:44             ` Mihail Nikalayeu <[email protected]>
2025-08-21 18:07               ` Antonin Houska <[email protected]>
2025-08-24 16:52                 ` Mihail Nikalayeu <[email protected]>
2025-08-25 13:09                   ` Antonin Houska <[email protected]>
2025-08-25 14:15                     ` Mihail Nikalayeu <[email protected]>
2025-08-25 15:42                       ` Antonin Houska <[email protected]>
2025-08-25 16:23                         ` Mihail Nikalayeu <[email protected]>
2025-08-25 17:22                           ` Antonin Houska <[email protected]>
2025-08-25 18:18                             ` Mihail Nikalayeu <[email protected]>
2025-08-26 08:46                               ` Antonin Houska <[email protected]>
2025-08-26 09:02                                 ` Mihail Nikalayeu <[email protected]>
2025-08-26 13:31                                   ` Antonin Houska <[email protected]>
2025-08-27 00:38                                     ` Mihail Nikalayeu <[email protected]>
2025-08-27 06:16                                       ` Antonin Houska <[email protected]>
2025-08-27 08:22                                         ` Mihail Nikalayeu <[email protected]>
2025-08-27 10:11                                           ` Antonin Houska <[email protected]>
2025-08-27 10:55                                             ` Mihail Nikalayeu <[email protected]>
2025-09-01 00:16                                           ` Michael Paquier <[email protected]>
2025-08-25 16:36                       ` Robert Treat <[email protected]>
2025-08-25 16:54                         ` Antonin Houska <[email protected]>
2025-08-28 21:39               ` Alvaro Herrera <[email protected]>
2025-08-29 00:32                 ` Mihail Nikalayeu <[email protected]>
2025-08-29 07:41                   ` Antonin Houska <[email protected]>
2025-08-11 14:22           ` Antonin Houska <[email protected]>
2025-08-15 12:32             ` Antonin Houska <[email protected]>
2025-08-15 12:48               ` Alvaro Herrera <[email protected]>
2025-07-27 06:00 ` Fujii Masao <[email protected]>
2025-08-05 08:58 ` Antonin Houska <[email protected]>
2025-08-16 13:41   ` Robert Treat <[email protected]>
2025-08-19 12:22     ` Alvaro Herrera <[email protected]>
2025-08-20 08:33       ` Antonin Houska <[email protected]>
2025-08-19 12:23     ` Alvaro Herrera <[email protected]>
2025-08-19 18:53 ` Álvaro Herrera <[email protected]>
2025-08-20 08:53   ` Antonin Houska <[email protected]>
2025-08-20 12:07     ` Álvaro Herrera <[email protected]>
2025-08-20 14:22       ` Antonin Houska <[email protected]>
2025-08-20 16:11         ` Andres Freund <[email protected]>
2025-08-21 18:14           ` Antonin Houska <[email protected]>
2025-08-21 18:16             ` Andres Freund <[email protected]>
2025-08-21 22:06   ` Robert Treat <[email protected]>
2025-08-22 09:40     ` Álvaro Herrera <[email protected]>
2025-08-22 20:32       ` Euler Taveira <[email protected]>
2025-08-23 06:56         ` Michael Banck <[email protected]>
2025-08-23 14:22           ` Álvaro Herrera <[email protected]>
2025-08-25 16:03             ` Robert Treat <[email protected]>
2025-08-30 17:50 ` Alvaro Herrera <[email protected]>
2025-08-31 12:09   ` Alvaro Herrera <[email protected]>
2025-08-31 15:29     ` Mihail Nikalayeu <[email protected]>
2025-08-31 17:43       ` Alvaro Herrera <[email protected]>
2025-09-01 13:00         ` Mihail Nikalayeu <[email protected]>
2025-09-01 05:12       ` Antonin Houska <[email protected]>
2025-09-01 09:06         ` Mihail Nikalayeu <[email protected]>
2025-09-01 15:30           ` Antonin Houska <[email protected]>
2025-09-02 10:44             ` Mihail Nikalayeu <[email protected]>
2025-09-03 09:55               ` Antonin Houska <[email protected]>
2025-09-23 15:51     ` Alvaro Herrera <[email protected]>
2025-09-25 18:12   ` Álvaro Herrera <[email protected]>
2025-09-25 20:20     ` Marcos Pegoraro <[email protected]>
2025-09-25 21:31       ` Robert Treat <[email protected]>
2025-09-25 21:46         ` Marcos Pegoraro <[email protected]>
2025-09-26 14:27     ` Mihail Nikalayeu <[email protected]>
2025-10-07 14:05       ` Álvaro Herrera <[email protected]>
2025-10-09 06:38         ` Antonin Houska <[email protected]>
2025-10-09 11:49           ` Álvaro Herrera <[email protected]>
2025-10-13 00:03         ` Robert Treat <[email protected]>
2025-09-26 17:30     ` Robert Treat <[email protected]>
2025-10-10 14:11 ` Alvaro Herrera <[email protected]>
2025-10-30 23:17 ` Alvaro Herrera <[email protected]>
2025-11-01 12:42   ` jian he <[email protected]>
2025-11-01 12:53     ` Sergei Kornilov <[email protected]>
2025-12-04 13:36     ` Antonin Houska <[email protected]>
2025-11-01 18:16   ` Mihail Nikalayeu <[email protected]>
2025-11-03 07:56     ` Antonin Houska <[email protected]>
2025-12-02 00:50       ` Mihail Nikalayeu <[email protected]>
2025-12-02 16:14         ` Antonin Houska <[email protected]>
2025-12-02 16:22           ` Mihail Nikalayeu <[email protected]>
2025-12-03 07:56             ` Antonin Houska <[email protected]>
2025-11-05 02:48   ` jian he <[email protected]>
2025-11-05 05:10     ` Robert Treat <[email protected]>
2025-11-05 07:12       ` Antonin Houska <[email protected]>
2025-11-09 22:13         ` Robert Treat <[email protected]>
2025-11-05 08:46       ` jian he <[email protected]>
2025-12-04 17:43   ` Antonin Houska <[email protected]>
2025-12-05 00:03     ` Mihail Nikalayeu <[email protected]>
2025-12-06 18:16       ` Mihail Nikalayeu <[email protected]>
2025-12-07 16:03         ` Mihail Nikalayeu <[email protected]>
2025-12-09 18:52           ` Antonin Houska <[email protected]>
2025-12-09 19:22             ` Alvaro Herrera <[email protected]>
2025-12-13 18:48               ` Antonin Houska <[email protected]>
2025-12-13 19:01                 ` Mihail Nikalayeu <[email protected]>
2025-12-15 14:25                 ` Alvaro Herrera <[email protected]>
2025-12-11 20:38             ` Mihail Nikalayeu <[email protected]>
2025-12-13 18:45               ` Mihail Nikalayeu <[email protected]>
2025-12-13 18:59                 ` Mihail Nikalayeu <[email protected]>
2025-12-18 01:47                 ` Mihail Nikalayeu <[email protected]>
2025-12-13 19:39               ` Antonin Houska <[email protected]>
2025-12-18 02:05                 ` Mihail Nikalayeu <[email protected]>
2025-12-08 09:51         ` Antonin Houska <[email protected]>
2025-12-08 07:35       ` Antonin Houska <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox