public inbox for [email protected]  
help / color / mirror / Atom feed
From: Alvaro Herrera <[email protected]>
To: Antonin Houska <[email protected]>
Cc: Mihail Nikalayeu <[email protected]>
Cc: Pg Hackers <[email protected]>
Cc: Robert Treat <[email protected]>
Subject: Re: Adding REPACK [concurrently]
Date: Fri, 6 Feb 2026 16:08:27 +0100
Message-ID: <[email protected]> (raw)
In-Reply-To: <88003.1769511456@localhost>

Here's a v33, where pg_repackdb has been removed, per multiple
discussions off-list.  This is not a statement that we will never have
such a tool; just that for the time being we should not let ourselves be
distracted by it.  If we have time to get something done about it for
v19, then it's fine to bring it back; but I kinda doubt it.  I think
getting bits done including the addition of the CONCURRENTLY option
trumps that.  We can add it in v20 if we decide to; no great loss.

I didn't include Antonin's 0006 "Use multiple snapshots to copy the
data" either.  It seems a bit too experimental yet.  I think it would be
good to have it in v19 also, but it seems less critical than the rest.

I haven't looked at Mihail's patch downthread either.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Los cuentos de hadas no dan al niño su primera idea sobre los monstruos.
Lo que le dan es su primera idea de la posible derrota del monstruo."
                                                   (G. K. Chesterton)


Attachments:

  [text/x-diff] v33-0001-Add-REPACK-command.patch (109.3K, 2-v33-0001-Add-REPACK-command.patch)
  download | inline diff:
From 09eace7f3f495dd19471d41825dc6b070e1fae6f Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Tue, 27 Jan 2026 11:48:40 +0100
Subject: [PATCH v33 1/5] Add REPACK command
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command.  Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.

Author: Antonin Houska <[email protected]>
Co-authored-by: Álvaro Herrera <[email protected]>
Reviewed-by: Mihail Nikalayeu <[email protected]>
Reviewed-by: Robert Treat <[email protected]>
Reviewed-by: Euler Taveira <[email protected]>
Reviewed-by: Matheus Alcantara <[email protected]>
Reviewed-by: Junwang Zhao <[email protected]>
Reviewed-by: jian he <[email protected]>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/[email protected]
---
 doc/src/sgml/monitoring.sgml             | 223 +++++-
 doc/src/sgml/ref/allfiles.sgml           |   1 +
 doc/src/sgml/ref/cluster.sgml            |  97 +--
 doc/src/sgml/ref/repack.sgml             | 328 +++++++++
 doc/src/sgml/ref/vacuum.sgml             |  33 +-
 doc/src/sgml/reference.sgml              |   1 +
 src/backend/access/heap/heapam_handler.c |  32 +-
 src/backend/catalog/index.c              |   2 +-
 src/backend/catalog/system_views.sql     |  29 +-
 src/backend/commands/cluster.c           | 848 +++++++++++++++--------
 src/backend/commands/vacuum.c            |   6 +-
 src/backend/parser/gram.y                |  86 ++-
 src/backend/tcop/utility.c               |  23 +-
 src/backend/utils/adt/pgstatfuncs.c      |   4 +-
 src/bin/psql/tab-complete.in.c           |  42 +-
 src/include/commands/cluster.h           |   8 +-
 src/include/commands/progress.h          |  48 +-
 src/include/nodes/parsenodes.h           |  35 +-
 src/include/parser/kwlist.h              |   1 +
 src/include/tcop/cmdtaglist.h            |   1 +
 src/include/utils/backend_progress.h     |   2 +-
 src/test/regress/expected/cluster.out    | 134 +++-
 src/test/regress/expected/rules.out      |  72 +-
 src/test/regress/sql/cluster.sql         |  70 +-
 src/tools/pgindent/typedefs.list         |   2 +
 25 files changed, 1598 insertions(+), 530 deletions(-)
 create mode 100644 doc/src/sgml/ref/repack.sgml

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b77d189a500..71c92ed53ef 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -405,6 +405,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_progress_repack</structname><indexterm><primary>pg_stat_progress_repack</primary></indexterm></entry>
+      <entry>One row for each backend running
+       <command>REPACK</command>, showing current progress.  See
+       <xref linkend="repack-progress-reporting"/>.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_progress_basebackup</structname><indexterm><primary>pg_stat_progress_basebackup</primary></indexterm></entry>
       <entry>One row for each WAL sender process streaming a base backup,
@@ -5646,7 +5654,8 @@ FROM pg_stat_get_backend_idset() AS backendid;
    certain commands during command execution.  Currently, the only commands
    which support progress reporting are <command>ANALYZE</command>,
    <command>CLUSTER</command>,
-   <command>CREATE INDEX</command>, <command>VACUUM</command>,
+   <command>CREATE INDEX</command>, <command>REPACK</command>,
+   <command>VACUUM</command>,
    <command>COPY</command>,
    and <xref linkend="protocol-replication-base-backup"/> (i.e., replication
    command that <xref linkend="app-pgbasebackup"/> issues to take
@@ -6130,6 +6139,218 @@ FROM pg_stat_get_backend_idset() AS backendid;
   </table>
  </sect2>
 
+ <sect2 id="repack-progress-reporting">
+  <title>REPACK Progress Reporting</title>
+
+  <indexterm>
+   <primary>pg_stat_progress_repack</primary>
+  </indexterm>
+
+  <para>
+   Whenever <command>REPACK</command> is running,
+   the <structname>pg_stat_progress_repack</structname> view will contain a
+   row for each backend that is currently running the command.  The tables
+   below describe the information that will be reported and provide
+   information about how to interpret it.
+  </para>
+
+  <table id="pg-stat-progress-repack-view" xreflabel="pg_stat_progress_repack">
+   <title><structname>pg_stat_progress_repack</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>pid</structfield> <type>integer</type>
+      </para>
+      <para>
+       Process ID of backend.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>datid</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the database to which this backend is connected.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>datname</structfield> <type>name</type>
+      </para>
+      <para>
+       Name of the database to which this backend is connected.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relid</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the table being repacked.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>phase</structfield> <type>text</type>
+      </para>
+      <para>
+       Current processing phase. See <xref linkend="repack-phases"/>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>repack_index_relid</structfield> <type>oid</type>
+      </para>
+      <para>
+       If the table is being scanned using an index, this is the OID of the
+       index being used; otherwise, it is zero.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_scanned</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples scanned.
+       This counter only advances when the phase is
+       <literal>seq scanning heap</literal>,
+       <literal>index scanning heap</literal>
+       or <literal>writing new heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_written</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples written.
+       This counter only advances when the phase is
+       <literal>seq scanning heap</literal>,
+       <literal>index scanning heap</literal>
+       or <literal>writing new heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_blks_total</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Total number of heap blocks in the table.  This number is reported
+       as of the beginning of <literal>seq scanning heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_blks_scanned</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap blocks scanned.  This counter only advances when the
+       phase is <literal>seq scanning heap</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>index_rebuild_count</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of indexes rebuilt.  This counter only advances when the phase
+       is <literal>rebuilding index</literal>.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <table id="repack-phases">
+   <title>REPACK Phases</title>
+   <tgroup cols="2">
+    <colspec colname="col1" colwidth="1*"/>
+    <colspec colname="col2" colwidth="2*"/>
+    <thead>
+    <row>
+      <entry>Phase</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><literal>initializing</literal></entry>
+     <entry>
+       The command is preparing to begin scanning the heap.  This phase is
+       expected to be very brief.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>seq scanning heap</literal></entry>
+     <entry>
+       The command is currently scanning the table using a sequential scan.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>index scanning heap</literal></entry>
+     <entry>
+       <command>REPACK</command> is currently scanning the table using an index scan.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>sorting tuples</literal></entry>
+     <entry>
+       <command>REPACK</command> is currently sorting tuples.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>writing new heap</literal></entry>
+     <entry>
+       <command>REPACK</command> is currently writing the new heap.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>swapping relation files</literal></entry>
+     <entry>
+       The command is currently swapping newly-built files into place.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>rebuilding index</literal></entry>
+     <entry>
+       The command is currently rebuilding an index.
+     </entry>
+    </row>
+    <row>
+     <entry><literal>performing final cleanup</literal></entry>
+     <entry>
+       The command is performing final cleanup.  When this phase is
+       completed, <command>REPACK</command> will end.
+     </entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+ </sect2>
+
  <sect2 id="copy-progress-reporting">
   <title>COPY Progress Reporting</title>
 
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index e167406c744..141ada9c50a 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -167,6 +167,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY refreshMaterializedView SYSTEM "refresh_materialized_view.sgml">
 <!ENTITY reindex            SYSTEM "reindex.sgml">
 <!ENTITY releaseSavepoint   SYSTEM "release_savepoint.sgml">
+<!ENTITY repack             SYSTEM "repack.sgml">
 <!ENTITY reset              SYSTEM "reset.sgml">
 <!ENTITY revoke             SYSTEM "revoke.sgml">
 <!ENTITY rollback           SYSTEM "rollback.sgml">
diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 0b47460080b..2cda711bc9f 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -33,51 +33,13 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
   <title>Description</title>
 
   <para>
-   <command>CLUSTER</command> instructs <productname>PostgreSQL</productname>
-   to cluster the table specified
-   by <replaceable class="parameter">table_name</replaceable>
-   based on the index specified by
-   <replaceable class="parameter">index_name</replaceable>. The index must
-   already have been defined on
-   <replaceable class="parameter">table_name</replaceable>.
+   The <command>CLUSTER</command> command is equivalent to
+   <xref linkend="sql-repack"/> with an <literal>USING INDEX</literal>
+   clause.  See there for more details.
   </para>
 
-  <para>
-   When a table is clustered, it is physically reordered
-   based on the index information. Clustering is a one-time operation:
-   when the table is subsequently updated, the changes are
-   not clustered.  That is, no attempt is made to store new or
-   updated rows according to their index order.  (If one wishes, one can
-   periodically recluster by issuing the command again.  Also, setting
-   the table's <literal>fillfactor</literal> storage parameter to less than
-   100% can aid in preserving cluster ordering during updates, since updated
-   rows are kept on the same page if enough space is available there.)
-  </para>
+<!-- Do we need to describe exactly which options map to what?  They seem obvious to me. -->
 
-  <para>
-   When a table is clustered, <productname>PostgreSQL</productname>
-   remembers which index it was clustered by.  The form
-   <command>CLUSTER <replaceable class="parameter">table_name</replaceable></command>
-   reclusters the table using the same index as before.  You can also
-   use the <literal>CLUSTER</literal> or <literal>SET WITHOUT CLUSTER</literal>
-   forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link> to set the index to be used for
-   future cluster operations, or to clear any previous setting.
-  </para>
-
-  <para>
-   <command>CLUSTER</command> without a
-   <replaceable class="parameter">table_name</replaceable> reclusters all the
-   previously-clustered tables in the current database that the calling user
-   has privileges for.  This form of <command>CLUSTER</command> cannot be
-   executed inside a transaction block.
-  </para>
-
-  <para>
-   When a table is being clustered, an <literal>ACCESS
-   EXCLUSIVE</literal> lock is acquired on it. This prevents any other
-   database operations (both reads and writes) from operating on the
-   table until the <command>CLUSTER</command> is finished.
-  </para>
  </refsect1>
 
  <refsect1>
@@ -136,63 +98,12 @@ CLUSTER [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <r
     on the table.
    </para>
 
-   <para>
-    In cases where you are accessing single rows randomly
-    within a table, the actual order of the data in the
-    table is unimportant. However, if you tend to access some
-    data more than others, and there is an index that groups
-    them together, you will benefit from using <command>CLUSTER</command>.
-    If you are requesting a range of indexed values from a table, or a
-    single indexed value that has multiple rows that match,
-    <command>CLUSTER</command> will help because once the index identifies the
-    table page for the first row that matches, all other rows
-    that match are probably already on the same table page,
-    and so you save disk accesses and speed up the query.
-   </para>
-
-   <para>
-    <command>CLUSTER</command> can re-sort the table using either an index scan
-    on the specified index, or (if the index is a b-tree) a sequential
-    scan followed by sorting.  It will attempt to choose the method that
-    will be faster, based on planner cost parameters and available statistical
-    information.
-   </para>
-
    <para>
     While <command>CLUSTER</command> is running, the <xref
     linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
     pg_temp</literal>.
    </para>
 
-   <para>
-    When an index scan is used, a temporary copy of the table is created that
-    contains the table data in the index order.  Temporary copies of each
-    index on the table are created as well.  Therefore, you need free space on
-    disk at least equal to the sum of the table size and the index sizes.
-   </para>
-
-   <para>
-    When a sequential scan and sort is used, a temporary sort file is
-    also created, so that the peak temporary space requirement is as much
-    as double the table size, plus the index sizes.  This method is often
-    faster than the index scan method, but if the disk space requirement is
-    intolerable, you can disable this choice by temporarily setting <xref
-    linkend="guc-enable-sort"/> to <literal>off</literal>.
-   </para>
-
-   <para>
-    It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to
-    a reasonably large value (but not more than the amount of RAM you can
-    dedicate to the <command>CLUSTER</command> operation) before clustering.
-   </para>
-
-   <para>
-    Because the planner records statistics about the ordering of
-    tables, it is advisable to run <link linkend="sql-analyze"><command>ANALYZE</command></link>
-    on the newly clustered table.
-    Otherwise, the planner might make poor choices of query plans.
-   </para>
-
    <para>
     Because <command>CLUSTER</command> remembers which indexes are clustered,
     one can cluster the tables one wants clustered manually the first time,
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
new file mode 100644
index 00000000000..61d5c2cdef1
--- /dev/null
+++ b/doc/src/sgml/ref/repack.sgml
@@ -0,0 +1,328 @@
+<!--
+doc/src/sgml/ref/repack.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="sql-repack">
+ <indexterm zone="sql-repack">
+  <primary>REPACK</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle>REPACK</refentrytitle>
+  <manvolnum>7</manvolnum>
+  <refmiscinfo>SQL - Language Statements</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>REPACK</refname>
+  <refpurpose>rewrite a table to reclaim disk space</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+<synopsis>
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <replaceable class="parameter">table_and_columns</replaceable> [ USING INDEX [ <replaceable class="parameter">index_name</replaceable> ] ] ]
+REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING INDEX
+
+<phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
+
+    VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
+    ANALYZE [ <replaceable class="parameter">boolean</replaceable> ]
+
+<phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
+
+    <replaceable class="parameter">table_name</replaceable> [ ( <replaceable class="parameter">column_name</replaceable> [, ...] ) ]
+</synopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <command>REPACK</command> reclaims storage occupied by dead
+   tuples. Unlike <command>VACUUM</command>, it does so by rewriting the
+   entire contents of the table specified
+   by <replaceable class="parameter">table_name</replaceable> into a new disk
+   file with no extra space (except for the space guaranteed by
+   the <literal>fillfactor</literal> storage parameter), allowing unused space
+   to be returned to the operating system.
+  </para>
+
+  <para>
+   Without
+   a <replaceable class="parameter">table_name</replaceable>, <command>REPACK</command>
+   processes every table and materialized view in the current database that
+   the current user has the <literal>MAINTAIN</literal> privilege on. This
+   form of <command>REPACK</command> cannot be executed inside a transaction
+   block.
+  </para>
+
+  <para>
+   If a <literal>USING INDEX</literal> clause is specified, the rows are
+   physically reordered based on information from an index.  Please see the
+   notes on clustering below.
+  </para>
+
+  <para>
+   When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
+   is acquired on it. This prevents any other database operations (both reads
+   and writes) from operating on the table until the <command>REPACK</command>
+   is finished.
+  </para>
+
+  <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
+   <title>Notes on Clustering</title>
+
+   <para>
+    If the <literal>USING INDEX</literal> clause is specified, the rows in
+    the table are physically reordered following an index: if an index name
+    is specified in the command, then that index is used; if no index name
+    is specified, then the index that has been configured as the index to
+    cluster on.  If no index has been configured in this way, an error is
+    thrown.  The index given in the <literal>USING INDEX</literal> clause
+    is configured as the index to cluster on, as well as an index given
+    to the <command>CLUSTER</command> command.  An index can be set
+    manually using <command>ALTER TABLE ... CLUSTER ON</command>, and reset
+    with <command>ALTER TABLE ... SET WITHOUT CLUSTER</command>.
+   </para>
+
+   <para>
+    If no table name is specified in <command>REPACK USING INDEX</command>,
+    all tables which have a clustering index defined and which the calling
+    user has privileges for are processed.
+   </para>
+
+   <para>
+    Clustering is a one-time operation: when the table is
+    subsequently updated, the changes are not clustered.  That is, no attempt
+    is made to store new or updated rows according to their index order.  (If
+    one wishes, one can periodically recluster by issuing the command again.
+    Also, setting the table's <literal>fillfactor</literal> storage parameter
+    to less than 100% can aid in preserving cluster ordering during updates,
+    since updated rows are kept on the same page if enough space is available
+    there.)
+   </para>
+
+   <para>
+    In cases where you are accessing single rows randomly within a table, the
+    actual order of the data in the table is unimportant. However, if you tend
+    to access some data more than others, and there is an index that groups
+    them together, you will benefit from using clustering.  If
+    you are requesting a range of indexed values from a table, or a single
+    indexed value that has multiple rows that match,
+    <command>REPACK</command> will help because once the index identifies the
+    table page for the first row that matches, all other rows that match are
+    probably already on the same table page, and so you save disk accesses and
+    speed up the query.
+   </para>
+
+   <para>
+    <command>REPACK</command> can re-sort the table using either an index scan
+    on the specified index (if the index is a b-tree), or a sequential scan
+    followed by sorting.  It will attempt to choose the method that will be
+    faster, based on planner cost parameters and available statistical
+    information.
+   </para>
+
+   <para>
+    Because the planner records statistics about the ordering of tables, it is
+    advisable to
+    run <link linkend="sql-analyze"><command>ANALYZE</command></link> on the
+    newly repacked table.  Otherwise, the planner might make poor choices of
+    query plans.
+   </para>
+  </refsect2>
+
+  <refsect2 id="sql-repack-notes-on-resources" xreflabel="Notes on Resources">
+   <title>Notes on Resources</title>
+
+   <para>
+    When an index scan or a sequential scan without sort is used, a temporary
+    copy of the table is created that contains the table data in the index
+    order.  Temporary copies of each index on the table are created as well.
+    Therefore, you need free space on disk at least equal to the sum of the
+    table size and the index sizes.
+   </para>
+
+   <para>
+    When a sequential scan and sort is used, a temporary sort file is also
+    created, so that the peak temporary space requirement is as much as double
+    the table size, plus the index sizes.  This method is often faster than
+    the index scan method, but if the disk space requirement is intolerable,
+    you can disable this choice by temporarily setting
+    <xref linkend="guc-enable-sort"/> to <literal>off</literal>.
+   </para>
+
+   <para>
+    It is advisable to set <xref linkend="guc-maintenance-work-mem"/> to a
+    reasonably large value (but not more than the amount of RAM you can
+    dedicate to the <command>REPACK</command> operation) before repacking.
+   </para>
+  </refsect2>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term><replaceable class="parameter">table_name</replaceable></term>
+    <listitem>
+     <para>
+      The name (possibly schema-qualified) of a table.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">column_name</replaceable></term>
+    <listitem>
+     <para>
+      The name of a specific column to analyze. Defaults to all columns.
+      If a column list is specific, <literal>ANALYZE</literal> must also
+      be specified.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">index_name</replaceable></term>
+    <listitem>
+     <para>
+      The name of an index.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>VERBOSE</literal></term>
+    <listitem>
+     <para>
+      Prints a progress report as each table is repacked
+      at <literal>INFO</literal> level.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>ANALYZE</literal></term>
+    <term><literal>ANALYSE</literal></term>
+    <listitem>
+     <para>
+      Applies <xref linkend="sql-analyze"/> on the table after repacking.  This is
+      currently only supported when a single (non-partitioned) table is specified.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><replaceable class="parameter">boolean</replaceable></term>
+    <listitem>
+     <para>
+      Specifies whether the selected option should be turned on or off.
+      You can write <literal>TRUE</literal>, <literal>ON</literal>, or
+      <literal>1</literal> to enable the option, and <literal>FALSE</literal>,
+      <literal>OFF</literal>, or <literal>0</literal> to disable it.  The
+      <replaceable class="parameter">boolean</replaceable> value can also
+      be omitted, in which case <literal>TRUE</literal> is assumed.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+   <para>
+    To repack a table, one must have the <literal>MAINTAIN</literal> privilege
+    on the table.
+   </para>
+
+   <para>
+    While <command>REPACK</command> is running, the <xref
+    linkend="guc-search-path"/> is temporarily changed to <literal>pg_catalog,
+    pg_temp</literal>.
+   </para>
+
+  <para>
+    Each backend running <command>REPACK</command> will report its progress
+    in the <structname>pg_stat_progress_repack</structname> view. See
+    <xref linkend="repack-progress-reporting"/> for details.
+  </para>
+
+   <para>
+    Repacking a partitioned table repacks each of its partitions. If an index
+    is specified, each partition is repacked using the partition of that
+    index. <command>REPACK</command> on a partitioned table cannot be executed
+    inside a transaction block.
+   </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Examples</title>
+
+  <para>
+   Repack the table <literal>employees</literal>:
+<programlisting>
+REPACK employees;
+</programlisting>
+  </para>
+
+  <para>
+   Repack the table <literal>employees</literal> on the basis of its
+   index <literal>employees_ind</literal> (Since index is used here, this is
+   effectively clustering):
+<programlisting>
+REPACK employees USING INDEX employees_ind;
+</programlisting>
+  </para>
+
+  <para>
+   Repack the table <literal>cases</literal> on physical ordering,
+   running an <command>ANALYZE</command> on the given columns once
+   repacking is done, showing informational messages:
+<programlisting>
+REPACK (ANALYZE, VERBOSE) cases (district, case_nr);
+</programlisting>
+  </para>
+
+  <para>
+   Repack all tables in the database on which you have
+   the <literal>MAINTAIN</literal> privilege:
+<programlisting>
+REPACK;
+</programlisting>
+  </para>
+
+  <para>
+   Repack all tables for which a clustering index has previously been
+   configured on which you have the <literal>MAINTAIN</literal> privilege,
+   showing informational messages:
+<programlisting>
+REPACK (VERBOSE) USING INDEX;
+</programlisting>
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Compatibility</title>
+
+  <para>
+   There is no <command>REPACK</command> statement in the SQL standard.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="app-pgrepackdb"/></member>
+   <member><xref linkend="repack-progress-reporting"/></member>
+  </simplelist>
+ </refsect1>
+
+</refentry>
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 6d0fdd43cfb..ac5d083d468 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -25,7 +25,6 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
 
 <phrase>where <replaceable class="parameter">option</replaceable> can be one of:</phrase>
 
-    FULL [ <replaceable class="parameter">boolean</replaceable> ]
     FREEZE [ <replaceable class="parameter">boolean</replaceable> ]
     VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
     ANALYZE [ <replaceable class="parameter">boolean</replaceable> ]
@@ -39,6 +38,7 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     SKIP_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
     ONLY_DATABASE_STATS [ <replaceable class="parameter">boolean</replaceable> ]
     BUFFER_USAGE_LIMIT <replaceable class="parameter">size</replaceable>
+    FULL [ <replaceable class="parameter">boolean</replaceable> ]
 
 <phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
 
@@ -95,20 +95,6 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
   <title>Parameters</title>
 
   <variablelist>
-   <varlistentry>
-    <term><literal>FULL</literal></term>
-    <listitem>
-     <para>
-      Selects <quote>full</quote> vacuum, which can reclaim more
-      space, but takes much longer and exclusively locks the table.
-      This method also requires extra disk space, since it writes a
-      new copy of the table and doesn't release the old copy until
-      the operation is complete.  Usually this should only be used when a
-      significant amount of space needs to be reclaimed from within the table.
-     </para>
-    </listitem>
-   </varlistentry>
-
    <varlistentry>
     <term><literal>FREEZE</literal></term>
     <listitem>
@@ -362,6 +348,23 @@ VACUUM [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] [ <re
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>FULL</literal></term>
+    <listitem>
+     <para>
+      This option, which is deprecated, makes <command>VACUUM</command>
+      behave like <command>REPACK</command> without a
+      <literal>USING INDEX</literal> clause.
+      This method of compacting the table takes much longer than
+      <command>VACUUM</command> and exclusively locks the table.
+      This method also requires extra disk space, since it writes a
+      new copy of the table and doesn't release the old copy until
+      the operation is complete.  Usually this should only be used when a
+      significant amount of space needs to be reclaimed from within the table.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><replaceable class="parameter">boolean</replaceable></term>
     <listitem>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index 2cf02c37b17..d9fdbb5d254 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -195,6 +195,7 @@
    &refreshMaterializedView;
    &reindex;
    &releaseSavepoint;
+   &repack;
    &reset;
    &revoke;
    &rollback;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cbef73e5d4b..7d4b48e5a97 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -741,13 +741,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	if (OldIndex != NULL && !use_sort)
 	{
 		const int	ci_index[] = {
-			PROGRESS_CLUSTER_PHASE,
-			PROGRESS_CLUSTER_INDEX_RELID
+			PROGRESS_REPACK_PHASE,
+			PROGRESS_REPACK_INDEX_RELID
 		};
 		int64		ci_val[2];
 
 		/* Set phase and OIDOldIndex to columns */
-		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
+		ci_val[0] = PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP;
 		ci_val[1] = RelationGetRelid(OldIndex);
 		pgstat_progress_update_multi_param(2, ci_index, ci_val);
 
@@ -759,15 +759,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	else
 	{
 		/* In scan-and-sort mode and also VACUUM FULL, set phase */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-									 PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP);
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
 
 		tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
 		heapScan = (HeapScanDesc) tableScan;
 		indexScan = NULL;
 
 		/* Set total heap blocks */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_TOTAL_HEAP_BLKS,
+		pgstat_progress_update_param(PROGRESS_REPACK_TOTAL_HEAP_BLKS,
 									 heapScan->rs_nblocks);
 	}
 
@@ -809,7 +809,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 				 * is manually updated to the correct value when the table
 				 * scan finishes.
 				 */
-				pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+				pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
 											 heapScan->rs_nblocks);
 				break;
 			}
@@ -825,7 +825,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			 */
 			if (prev_cblock != heapScan->rs_cblock)
 			{
-				pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_BLKS_SCANNED,
+				pgstat_progress_update_param(PROGRESS_REPACK_HEAP_BLKS_SCANNED,
 											 (heapScan->rs_cblock +
 											  heapScan->rs_nblocks -
 											  heapScan->rs_startblock
@@ -926,14 +926,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			 * In scan-and-sort mode, report increase in number of tuples
 			 * scanned
 			 */
-			pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
+			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
 										 *num_tuples);
 		}
 		else
 		{
 			const int	ct_index[] = {
-				PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED,
-				PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN
+				PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
+				PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
 			};
 			int64		ct_val[2];
 
@@ -966,14 +966,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		double		n_tuples = 0;
 
 		/* Report that we are now sorting tuples */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-									 PROGRESS_CLUSTER_PHASE_SORT_TUPLES);
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_SORT_TUPLES);
 
 		tuplesort_performsort(tuplesort);
 
 		/* Report that we are now writing new heap */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-									 PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP);
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP);
 
 		for (;;)
 		{
@@ -991,7 +991,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 									 values, isnull,
 									 rwstate);
 			/* Report n_tuples */
-			pgstat_progress_update_param(PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN,
+			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
 										 n_tuples);
 		}
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 43de42ce39e..5ee6389d39c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -4077,7 +4077,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 		Assert(!ReindexIsProcessingIndex(indexOid));
 
 		/* Set index rebuild count */
-		pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
+		pgstat_progress_update_param(PROGRESS_REPACK_INDEX_REBUILD_COUNT,
 									 i);
 		i++;
 	}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7553f31fef0..3f05ba3083a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1283,14 +1283,15 @@ CREATE VIEW pg_stat_progress_vacuum AS
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
-CREATE VIEW pg_stat_progress_cluster AS
+CREATE VIEW pg_stat_progress_repack AS
     SELECT
         S.pid AS pid,
         S.datid AS datid,
         D.datname AS datname,
         S.relid AS relid,
         CASE S.param1 WHEN 1 THEN 'CLUSTER'
-                      WHEN 2 THEN 'VACUUM FULL'
+                      WHEN 2 THEN 'REPACK'
+                      WHEN 3 THEN 'VACUUM FULL'
                       END AS command,
         CASE S.param2 WHEN 0 THEN 'initializing'
                       WHEN 1 THEN 'seq scanning heap'
@@ -1301,15 +1302,35 @@ CREATE VIEW pg_stat_progress_cluster AS
                       WHEN 6 THEN 'rebuilding index'
                       WHEN 7 THEN 'performing final cleanup'
                       END AS phase,
-        CAST(S.param3 AS oid) AS cluster_index_relid,
+        CAST(S.param3 AS oid) AS repack_index_relid,
         S.param4 AS heap_tuples_scanned,
         S.param5 AS heap_tuples_written,
         S.param6 AS heap_blks_total,
         S.param7 AS heap_blks_scanned,
         S.param8 AS index_rebuild_count
-    FROM pg_stat_get_progress_info('CLUSTER') AS S
+    FROM pg_stat_get_progress_info('REPACK') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
+-- This view is as the one above, except for renaming a column and avoiding
+-- 'REPACK' as a command name to report.
+CREATE VIEW pg_stat_progress_cluster AS
+    SELECT
+        pid,
+        datid,
+        datname,
+        relid,
+        CASE WHEN command IN ('CLUSTER', 'VACUUM FULL') THEN command
+             WHEN repack_index_relid = 0 THEN 'VACUUM FULL'
+             ELSE 'CLUSTER' END AS command,
+        phase,
+        repack_index_relid AS cluster_index_relid,
+        heap_tuples_scanned,
+        heap_tuples_written,
+        heap_blks_total,
+        heap_blks_scanned,
+        index_rebuild_count
+    FROM pg_stat_progress_repack;
+
 CREATE VIEW pg_stat_progress_create_index AS
     SELECT
         S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 60a4617a585..e19675a6d05 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1,7 +1,8 @@
 /*-------------------------------------------------------------------------
  *
  * cluster.c
- *	  CLUSTER a table on an index.  This is now also used for VACUUM FULL.
+ *	  CLUSTER a table on an index.  This is now also used for VACUUM FULL and
+ *	  REPACK.
  *
  * There is hardly anything left of Paul Brown's original implementation...
  *
@@ -67,27 +68,35 @@ typedef struct
 	Oid			indexOid;
 } RelToCluster;
 
-
-static void cluster_multiple_rels(List *rtcs, ClusterParams *params);
+static bool cluster_rel_recheck(RepackCommand cmd, Relation OldHeap,
+								Oid indexOid, Oid userid, int options);
 static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
 static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
 							bool verbose, bool *pSwapToastByContent,
 							TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
-static List *get_tables_to_cluster(MemoryContext cluster_context);
-static List *get_tables_to_cluster_partitioned(MemoryContext cluster_context,
-											   Oid indexOid);
-static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
+static List *get_tables_to_repack(RepackCommand cmd, bool usingindex,
+								  MemoryContext permcxt);
+static List *get_tables_to_repack_partitioned(RepackCommand cmd,
+											  Oid relid, bool rel_is_index,
+											  MemoryContext permcxt);
+static bool cluster_is_permitted_for_relation(RepackCommand cmd,
+											  Oid relid, Oid userid);
+static Relation process_single_relation(RepackStmt *stmt,
+										ClusterParams *params);
+static Oid	determine_clustered_index(Relation rel, bool usingindex,
+									  const char *indexname);
+static const char *RepackCommandAsString(RepackCommand cmd);
 
 
-/*---------------------------------------------------------------------------
- * This cluster code allows for clustering multiple tables at once. Because
+/*
+ * The repack code allows for processing multiple tables at once. Because
  * of this, we cannot just run everything on a single transaction, or we
  * would be forced to acquire exclusive locks on all the tables being
  * clustered, simultaneously --- very likely leading to deadlock.
  *
- * To solve this we follow a similar strategy to VACUUM code,
- * clustering each relation in a separate transaction. For this to work,
- * we need to:
+ * To solve this we follow a similar strategy to VACUUM code, processing each
+ * relation in a separate transaction. For this to work, we need to:
+ *
  *	- provide a separate memory context so that we can pass information in
  *	  a way that survives across transactions
  *	- start a new transaction every time a new relation is clustered
@@ -98,197 +107,166 @@ static bool cluster_is_permitted_for_relation(Oid relid, Oid userid);
  *
  * The single-relation case does not have any such overhead.
  *
- * We also allow a relation to be specified without index.  In that case,
- * the indisclustered bit will be looked up, and an ERROR will be thrown
- * if there is no index with the bit set.
- *---------------------------------------------------------------------------
+ * We also allow a relation to be repacked following an index, but without
+ * naming a specific one.  In that case, the indisclustered bit will be
+ * looked up, and an ERROR will be thrown if no so-marked index is found.
  */
 void
-cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
+ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 {
-	ListCell   *lc;
 	ClusterParams params = {0};
-	bool		verbose = false;
 	Relation	rel = NULL;
-	Oid			indexOid = InvalidOid;
-	MemoryContext cluster_context;
+	MemoryContext repack_context;
 	List	   *rtcs;
 
 	/* Parse option list */
-	foreach(lc, stmt->params)
+	foreach_node(DefElem, opt, stmt->params)
 	{
-		DefElem    *opt = (DefElem *) lfirst(lc);
-
 		if (strcmp(opt->defname, "verbose") == 0)
-			verbose = defGetBoolean(opt);
+			params.options |= defGetBoolean(opt) ? CLUOPT_VERBOSE : 0;
+		else if (strcmp(opt->defname, "analyze") == 0 ||
+				 strcmp(opt->defname, "analyse") == 0)
+			params.options |= defGetBoolean(opt) ? CLUOPT_ANALYZE : 0;
 		else
 			ereport(ERROR,
-					(errcode(ERRCODE_SYNTAX_ERROR),
-					 errmsg("unrecognized %s option \"%s\"",
-							"CLUSTER", opt->defname),
-					 parser_errposition(pstate, opt->location)));
+					errcode(ERRCODE_SYNTAX_ERROR),
+					errmsg("unrecognized %s option \"%s\"",
+						   RepackCommandAsString(stmt->command),
+						   opt->defname),
+					parser_errposition(pstate, opt->location));
 	}
 
-	params.options = (verbose ? CLUOPT_VERBOSE : 0);
-
+	/*
+	 * If a single relation is specified, process it and we're done ... unless
+	 * the relation is a partitioned table, in which case we fall through.
+	 */
 	if (stmt->relation != NULL)
 	{
-		/* This is the single-relation case. */
-		Oid			tableOid;
-
-		/*
-		 * Find, lock, and check permissions on the table.  We obtain
-		 * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
-		 * single-transaction case.
-		 */
-		tableOid = RangeVarGetRelidExtended(stmt->relation,
-											AccessExclusiveLock,
-											0,
-											RangeVarCallbackMaintainsTable,
-											NULL);
-		rel = table_open(tableOid, NoLock);
-
-		/*
-		 * Reject clustering a remote temp table ... their local buffer
-		 * manager is not going to cope.
-		 */
-		if (RELATION_IS_OTHER_TEMP(rel))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("cannot cluster temporary tables of other sessions")));
-
-		if (stmt->indexname == NULL)
-		{
-			ListCell   *index;
-
-			/* We need to find the index that has indisclustered set. */
-			foreach(index, RelationGetIndexList(rel))
-			{
-				indexOid = lfirst_oid(index);
-				if (get_index_isclustered(indexOid))
-					break;
-				indexOid = InvalidOid;
-			}
-
-			if (!OidIsValid(indexOid))
-				ereport(ERROR,
-						(errcode(ERRCODE_UNDEFINED_OBJECT),
-						 errmsg("there is no previously clustered index for table \"%s\"",
-								stmt->relation->relname)));
-		}
-		else
-		{
-			/*
-			 * The index is expected to be in the same namespace as the
-			 * relation.
-			 */
-			indexOid = get_relname_relid(stmt->indexname,
-										 rel->rd_rel->relnamespace);
-			if (!OidIsValid(indexOid))
-				ereport(ERROR,
-						(errcode(ERRCODE_UNDEFINED_OBJECT),
-						 errmsg("index \"%s\" for table \"%s\" does not exist",
-								stmt->indexname, stmt->relation->relname)));
-		}
-
-		/* For non-partitioned tables, do what we came here to do. */
-		if (rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
-		{
-			cluster_rel(rel, indexOid, &params);
-			/* cluster_rel closes the relation, but keeps lock */
-
-			return;
-		}
+		rel = process_single_relation(stmt, &params);
+		if (rel == NULL)
+			return;				/* all done */
 	}
 
+	/*
+	 * Don't allow ANALYZE in the multiple-relation case for now.  Maybe we
+	 * can add support for this later.
+	 */
+	if (params.options & CLUOPT_ANALYZE)
+		ereport(ERROR,
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("cannot %s multiple tables", "REPACK (ANALYZE)"));
+
 	/*
 	 * By here, we know we are in a multi-table situation.  In order to avoid
 	 * holding locks for too long, we want to process each table in its own
 	 * transaction.  This forces us to disallow running inside a user
 	 * transaction block.
 	 */
-	PreventInTransactionBlock(isTopLevel, "CLUSTER");
+	PreventInTransactionBlock(isTopLevel, RepackCommandAsString(stmt->command));
 
 	/* Also, we need a memory context to hold our list of relations */
-	cluster_context = AllocSetContextCreate(PortalContext,
-											"Cluster",
-											ALLOCSET_DEFAULT_SIZES);
+	repack_context = AllocSetContextCreate(PortalContext,
+										   "Repack",
+										   ALLOCSET_DEFAULT_SIZES);
+
+	params.options |= CLUOPT_RECHECK;
 
 	/*
-	 * Either we're processing a partitioned table, or we were not given any
-	 * table name at all.  In either case, obtain a list of relations to
-	 * process.
-	 *
-	 * In the former case, an index name must have been given, so we don't
-	 * need to recheck its "indisclustered" bit, but we have to check that it
-	 * is an index that we can cluster on.  In the latter case, we set the
-	 * option bit to have indisclustered verified.
-	 *
-	 * Rechecking the relation itself is necessary here in all cases.
+	 * If we don't have a relation yet, determine a relation list.  If we do,
+	 * then it must be a partitioned table, and we want to process its
+	 * partitions.
 	 */
-	params.options |= CLUOPT_RECHECK;
-	if (rel != NULL)
+	if (rel == NULL)
 	{
-		Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-		check_index_is_clusterable(rel, indexOid, AccessShareLock);
-		rtcs = get_tables_to_cluster_partitioned(cluster_context, indexOid);
-
-		/* close relation, releasing lock on parent table */
-		table_close(rel, AccessExclusiveLock);
+		Assert(stmt->indexname == NULL);
+		rtcs = get_tables_to_repack(stmt->command, stmt->usingindex,
+									repack_context);
+		params.options |= CLUOPT_RECHECK_ISCLUSTERED;
 	}
 	else
 	{
-		rtcs = get_tables_to_cluster(cluster_context);
-		params.options |= CLUOPT_RECHECK_ISCLUSTERED;
+		Oid			relid;
+		bool		rel_is_index;
+
+		Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+
+		/*
+		 * If USING INDEX was specified, resolve the index name now and pass
+		 * it down.
+		 */
+		if (stmt->usingindex)
+		{
+			/*
+			 * If no index name was specified when repacking a partitioned
+			 * table, punt for now.  Maybe we can improve this later.
+			 */
+			if (!stmt->indexname)
+				ereport(ERROR,
+						errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						errmsg("there is no previously clustered index for table \"%s\"",
+							   RelationGetRelationName(rel)));
+
+			relid = determine_clustered_index(rel, stmt->usingindex,
+											  stmt->indexname);
+			if (!OidIsValid(relid))
+				elog(ERROR, "unable to determine index to cluster on");
+			/* XXX is this the right place for this check? */
+			check_index_is_clusterable(rel, relid, AccessExclusiveLock);
+			rel_is_index = true;
+		}
+		else
+		{
+			relid = RelationGetRelid(rel);
+			rel_is_index = false;
+		}
+
+		rtcs = get_tables_to_repack_partitioned(stmt->command,
+												relid, rel_is_index,
+												repack_context);
+
+		/* close parent relation, releasing lock on it */
+		table_close(rel, AccessExclusiveLock);
+		rel = NULL;
 	}
 
-	/* Do the job. */
-	cluster_multiple_rels(rtcs, &params);
-
-	/* Start a new transaction for the cleanup work. */
-	StartTransactionCommand();
-
-	/* Clean up working storage */
-	MemoryContextDelete(cluster_context);
-}
-
-/*
- * Given a list of relations to cluster, process each of them in a separate
- * transaction.
- *
- * We expect to be in a transaction at start, but there isn't one when we
- * return.
- */
-static void
-cluster_multiple_rels(List *rtcs, ClusterParams *params)
-{
-	ListCell   *lc;
-
 	/* Commit to get out of starting transaction */
 	PopActiveSnapshot();
 	CommitTransactionCommand();
 
 	/* Cluster the tables, each in a separate transaction */
-	foreach(lc, rtcs)
+	Assert(rel == NULL);
+	foreach_ptr(RelToCluster, rtc, rtcs)
 	{
-		RelToCluster *rtc = (RelToCluster *) lfirst(lc);
-		Relation	rel;
-
 		/* Start a new transaction for each relation. */
 		StartTransactionCommand();
 
+		/*
+		 * Open the target table, coping with the case where it has been
+		 * dropped.
+		 */
+		rel = try_table_open(rtc->tableOid, AccessExclusiveLock);
+		if (rel == NULL)
+		{
+			CommitTransactionCommand();
+			continue;
+		}
+
 		/* functions in indexes may want a snapshot set */
 		PushActiveSnapshot(GetTransactionSnapshot());
 
-		rel = table_open(rtc->tableOid, AccessExclusiveLock);
-
 		/* Process this table */
-		cluster_rel(rel, rtc->indexOid, params);
+		cluster_rel(stmt->command, rel, rtc->indexOid, &params);
 		/* cluster_rel closes the relation, but keeps lock */
 
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
+
+	/* Start a new transaction for the cleanup work. */
+	StartTransactionCommand();
+
+	/* Clean up working storage */
+	MemoryContextDelete(repack_context);
 }
 
 /*
@@ -304,11 +282,14 @@ cluster_multiple_rels(List *rtcs, ClusterParams *params)
  * them incrementally while we load the table.
  *
  * If indexOid is InvalidOid, the table will be rewritten in physical order
- * instead of index order.  This is the new implementation of VACUUM FULL,
- * and error messages should refer to the operation as VACUUM not CLUSTER.
+ * instead of index order.
+ *
+ * 'cmd' indicates which command is being executed, to be used for error
+ * messages.
  */
 void
-cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
+cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
+			ClusterParams *params)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 	Oid			save_userid;
@@ -323,13 +304,8 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	/* Check for user-requested abort. */
 	CHECK_FOR_INTERRUPTS();
 
-	pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
-	if (OidIsValid(indexOid))
-		pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
-									 PROGRESS_CLUSTER_COMMAND_CLUSTER);
-	else
-		pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
-									 PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+	pgstat_progress_start_command(PROGRESS_COMMAND_REPACK, tableOid);
+	pgstat_progress_update_param(PROGRESS_REPACK_COMMAND, cmd);
 
 	/*
 	 * Switch to the table owner's userid, so that any index functions are run
@@ -350,86 +326,38 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	 * *must* skip the one on indisclustered since it would reject an attempt
 	 * to cluster a not-previously-clustered index.
 	 */
-	if (recheck)
-	{
-		/* Check that the user still has privileges for the relation */
-		if (!cluster_is_permitted_for_relation(tableOid, save_userid))
-		{
-			relation_close(OldHeap, AccessExclusiveLock);
-			goto out;
-		}
-
-		/*
-		 * Silently skip a temp table for a remote session.  Only doing this
-		 * check in the "recheck" case is appropriate (which currently means
-		 * somebody is executing a database-wide CLUSTER or on a partitioned
-		 * table), because there is another check in cluster() which will stop
-		 * any attempt to cluster remote temp tables by name.  There is
-		 * another check in cluster_rel which is redundant, but we leave it
-		 * for extra safety.
-		 */
-		if (RELATION_IS_OTHER_TEMP(OldHeap))
-		{
-			relation_close(OldHeap, AccessExclusiveLock);
-			goto out;
-		}
-
-		if (OidIsValid(indexOid))
-		{
-			/*
-			 * Check that the index still exists
-			 */
-			if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
-			{
-				relation_close(OldHeap, AccessExclusiveLock);
-				goto out;
-			}
-
-			/*
-			 * Check that the index is still the one with indisclustered set,
-			 * if needed.
-			 */
-			if ((params->options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
-				!get_index_isclustered(indexOid))
-			{
-				relation_close(OldHeap, AccessExclusiveLock);
-				goto out;
-			}
-		}
-	}
+	if (recheck &&
+		!cluster_rel_recheck(cmd, OldHeap, indexOid, save_userid,
+							 params->options))
+		goto out;
 
 	/*
-	 * We allow VACUUM FULL, but not CLUSTER, on shared catalogs.  CLUSTER
-	 * would work in most respects, but the index would only get marked as
-	 * indisclustered in the current database, leading to unexpected behavior
-	 * if CLUSTER were later invoked in another database.
+	 * We allow repacking shared catalogs only when not using an index. It
+	 * would work to use an index in most respects, but the index would only
+	 * get marked as indisclustered in the current database, leading to
+	 * unexpected behavior if CLUSTER were later invoked in another database.
 	 */
 	if (OidIsValid(indexOid) && OldHeap->rd_rel->relisshared)
 		ereport(ERROR,
-				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-				 errmsg("cannot cluster a shared catalog")));
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("cannot run %s on a shared catalog",
+					   RepackCommandAsString(cmd)));
 
 	/*
 	 * Don't process temp tables of other backends ... their local buffer
 	 * manager is not going to cope.
 	 */
 	if (RELATION_IS_OTHER_TEMP(OldHeap))
-	{
-		if (OidIsValid(indexOid))
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("cannot cluster temporary tables of other sessions")));
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("cannot vacuum temporary tables of other sessions")));
-	}
+		ereport(ERROR,
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("cannot run %s on temporary tables of other sessions",
+					   RepackCommandAsString(cmd)));
 
 	/*
 	 * Also check for active uses of the relation in the current transaction,
 	 * including open scans and pending AFTER trigger events.
 	 */
-	CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
+	CheckTableNotInUse(OldHeap, RepackCommandAsString(cmd));
 
 	/* Check heap and index are valid to cluster on */
 	if (OidIsValid(indexOid))
@@ -442,6 +370,24 @@ cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params)
 	else
 		index = NULL;
 
+	/*
+	 * When allow_system_table_mods is turned off, we disallow repacking a
+	 * catalog on a particular index unless that's already the clustered index
+	 * for that catalog.
+	 *
+	 * XXX We don't check for this in CLUSTER, because it's historically been
+	 * allowed.
+	 */
+	if (cmd != REPACK_COMMAND_CLUSTER &&
+		!allowSystemTableMods && OidIsValid(indexOid) &&
+		IsCatalogRelation(OldHeap) && !index->rd_index->indisclustered)
+		ereport(ERROR,
+				errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				errmsg("permission denied: \"%s\" is a system catalog",
+					   RelationGetRelationName(OldHeap)),
+				errdetail("System catalogs can only be clustered by the index they're already clustered on, if any, unless \"%s\" is enabled.",
+						  "allow_system_table_mods"));
+
 	/*
 	 * Quietly ignore the request if this is a materialized view which has not
 	 * been populated from its query. No harm is done because there is no data
@@ -482,6 +428,63 @@ out:
 	pgstat_progress_end_command();
 }
 
+/*
+ * Check if the table (and its index) still meets the requirements of
+ * cluster_rel().
+ */
+static bool
+cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
+					Oid userid, int options)
+{
+	Oid			tableOid = RelationGetRelid(OldHeap);
+
+	/* Check that the user still has privileges for the relation */
+	if (!cluster_is_permitted_for_relation(cmd, tableOid, userid))
+	{
+		relation_close(OldHeap, AccessExclusiveLock);
+		return false;
+	}
+
+	/*
+	 * Silently skip a temp table for a remote session.  Only doing this check
+	 * in the "recheck" case is appropriate (which currently means somebody is
+	 * executing a database-wide CLUSTER or on a partitioned table), because
+	 * there is another check in cluster() which will stop any attempt to
+	 * cluster remote temp tables by name.  There is another check in
+	 * cluster_rel which is redundant, but we leave it for extra safety.
+	 */
+	if (RELATION_IS_OTHER_TEMP(OldHeap))
+	{
+		relation_close(OldHeap, AccessExclusiveLock);
+		return false;
+	}
+
+	if (OidIsValid(indexOid))
+	{
+		/*
+		 * Check that the index still exists
+		 */
+		if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
+		{
+			relation_close(OldHeap, AccessExclusiveLock);
+			return false;
+		}
+
+		/*
+		 * Check that the index is still the one with indisclustered set, if
+		 * needed.
+		 */
+		if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
+			!get_index_isclustered(indexOid))
+		{
+			relation_close(OldHeap, AccessExclusiveLock);
+			return false;
+		}
+	}
+
+	return true;
+}
+
 /*
  * Verify that the specified heap and index are valid to cluster on
  *
@@ -642,8 +645,8 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
 	Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
 		   (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
 
-	if (index)
-		/* Mark the correct index as clustered */
+	/* for CLUSTER or REPACK USING INDEX, mark the index as the one to use */
+	if (index != NULL)
 		mark_index_clustered(OldHeap, RelationGetRelid(index), true);
 
 	/* Remember info about rel before closing OldHeap */
@@ -958,20 +961,20 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	/* Log what we're doing */
 	if (OldIndex != NULL && !use_sort)
 		ereport(elevel,
-				(errmsg("clustering \"%s.%s\" using index scan on \"%s\"",
-						nspname,
-						RelationGetRelationName(OldHeap),
-						RelationGetRelationName(OldIndex))));
+				errmsg("repacking \"%s.%s\" using index scan on \"%s\"",
+					   nspname,
+					   RelationGetRelationName(OldHeap),
+					   RelationGetRelationName(OldIndex)));
 	else if (use_sort)
 		ereport(elevel,
-				(errmsg("clustering \"%s.%s\" using sequential scan and sort",
-						nspname,
-						RelationGetRelationName(OldHeap))));
+				errmsg("repacking \"%s.%s\" using sequential scan and sort",
+					   nspname,
+					   RelationGetRelationName(OldHeap)));
 	else
 		ereport(elevel,
-				(errmsg("vacuuming \"%s.%s\"",
-						nspname,
-						RelationGetRelationName(OldHeap))));
+				errmsg("repacking \"%s.%s\" in physical order",
+					   nspname,
+					   RelationGetRelationName(OldHeap)));
 
 	/*
 	 * Hand off the actual copying to AM specific function, the generic code
@@ -1458,8 +1461,8 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	int			i;
 
 	/* Report that we are now swapping relation files */
-	pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-								 PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES);
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
 
 	/* Zero out possible results from swapped_relation_files */
 	memset(mapped_tables, 0, sizeof(mapped_tables));
@@ -1509,14 +1512,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 		reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
 
 	/* Report that we are now reindexing relations */
-	pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-								 PROGRESS_CLUSTER_PHASE_REBUILD_INDEX);
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
 
 	reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
 
 	/* Report that we are now doing clean up */
-	pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE,
-								 PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP);
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
 
 	/*
 	 * If the relation being rebuilt is pg_class, swap_relation_files()
@@ -1632,106 +1635,191 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	}
 }
 
-
 /*
- * Get a list of tables that the current user has privileges on and
- * have indisclustered set.  Return the list in a List * of RelToCluster
- * (stored in the specified memory context), each one giving the tableOid
- * and the indexOid on which the table is already clustered.
+ * Determine which relations to process, when REPACK/CLUSTER is called
+ * without specifying a table name.  The exact process depends on whether
+ * USING INDEX was given or not, and in any case we only return tables and
+ * materialized views that the current user has privileges to repack/cluster.
+ *
+ * If USING INDEX was given, we scan pg_index to find those that have
+ * indisclustered set; if it was not given, scan pg_class and return all
+ * tables.
+ *
+ * Return it as a list of RelToCluster in the given memory context.
  */
 static List *
-get_tables_to_cluster(MemoryContext cluster_context)
+get_tables_to_repack(RepackCommand cmd, bool usingindex, MemoryContext permcxt)
 {
-	Relation	indRelation;
+	Relation	catalog;
 	TableScanDesc scan;
-	ScanKeyData entry;
-	HeapTuple	indexTuple;
-	Form_pg_index index;
-	MemoryContext old_context;
+	HeapTuple	tuple;
 	List	   *rtcs = NIL;
 
-	/*
-	 * Get all indexes that have indisclustered set and that the current user
-	 * has the appropriate privileges for.
-	 */
-	indRelation = table_open(IndexRelationId, AccessShareLock);
-	ScanKeyInit(&entry,
-				Anum_pg_index_indisclustered,
-				BTEqualStrategyNumber, F_BOOLEQ,
-				BoolGetDatum(true));
-	scan = table_beginscan_catalog(indRelation, 1, &entry);
-	while ((indexTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	if (usingindex)
 	{
-		RelToCluster *rtc;
+		ScanKeyData entry;
 
-		index = (Form_pg_index) GETSTRUCT(indexTuple);
+		catalog = table_open(IndexRelationId, AccessShareLock);
+		ScanKeyInit(&entry,
+					Anum_pg_index_indisclustered,
+					BTEqualStrategyNumber, F_BOOLEQ,
+					BoolGetDatum(true));
+		scan = table_beginscan_catalog(catalog, 1, &entry);
+		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		{
+			RelToCluster *rtc;
+			Form_pg_index index;
+			MemoryContext oldcxt;
 
-		if (!cluster_is_permitted_for_relation(index->indrelid, GetUserId()))
-			continue;
+			index = (Form_pg_index) GETSTRUCT(tuple);
 
-		/* Use a permanent memory context for the result list */
-		old_context = MemoryContextSwitchTo(cluster_context);
+			/*
+			 * Try to obtain a light lock on the index's table, to ensure it
+			 * doesn't go away while we collect the list.  If we cannot, just
+			 * disregard it.
+			 */
+			if (!ConditionalLockRelationOid(index->indrelid, AccessShareLock))
+				continue;
 
-		rtc = palloc_object(RelToCluster);
-		rtc->tableOid = index->indrelid;
-		rtc->indexOid = index->indexrelid;
-		rtcs = lappend(rtcs, rtc);
+			/* Verify that the table still exists */
+			if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(index->indrelid)))
+			{
+				/* Release useless lock */
+				UnlockRelationOid(index->indrelid, AccessShareLock);
+				continue;
+			}
 
-		MemoryContextSwitchTo(old_context);
+			if (!cluster_is_permitted_for_relation(cmd, index->indrelid,
+												   GetUserId()))
+				continue;
+
+			/* Use a permanent memory context for the result list */
+			oldcxt = MemoryContextSwitchTo(permcxt);
+			rtc = palloc_object(RelToCluster);
+			rtc->tableOid = index->indrelid;
+			rtc->indexOid = index->indexrelid;
+			rtcs = lappend(rtcs, rtc);
+			MemoryContextSwitchTo(oldcxt);
+		}
 	}
-	table_endscan(scan);
+	else
+	{
+		catalog = table_open(RelationRelationId, AccessShareLock);
+		scan = table_beginscan_catalog(catalog, 0, NULL);
 
-	relation_close(indRelation, AccessShareLock);
+		while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+		{
+			RelToCluster *rtc;
+			Form_pg_class class;
+			MemoryContext oldcxt;
+
+			class = (Form_pg_class) GETSTRUCT(tuple);
+
+			/*
+			 * Try to obtain a light lock on the table, to ensure it doesn't
+			 * go away while we collect the list.  If we cannot, just
+			 * disregard the table.
+			 */
+			if (!ConditionalLockRelationOid(class->oid, AccessShareLock))
+				continue;
+
+			/* Verify that the table still exists */
+			if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(class->oid)))
+			{
+				/* Release useless lock */
+				UnlockRelationOid(class->oid, AccessShareLock);
+				continue;
+			}
+
+			/* Can only process plain tables and matviews */
+			if (class->relkind != RELKIND_RELATION &&
+				class->relkind != RELKIND_MATVIEW)
+				continue;
+
+			/* noisily skip rels which the user can't process */
+			if (!cluster_is_permitted_for_relation(cmd, class->oid,
+												   GetUserId()))
+				continue;
+
+			/* Use a permanent memory context for the result list */
+			oldcxt = MemoryContextSwitchTo(permcxt);
+			rtc = palloc_object(RelToCluster);
+			rtc->tableOid = class->oid;
+			rtc->indexOid = InvalidOid;
+			rtcs = lappend(rtcs, rtc);
+			MemoryContextSwitchTo(oldcxt);
+		}
+	}
+
+	table_endscan(scan);
+	relation_close(catalog, AccessShareLock);
 
 	return rtcs;
 }
 
 /*
- * Given an index on a partitioned table, return a list of RelToCluster for
+ * Given a partitioned table or its index, return a list of RelToCluster for
  * all the children leaves tables/indexes.
  *
  * Like expand_vacuum_rel, but here caller must hold AccessExclusiveLock
  * on the table containing the index.
+ *
+ * 'rel_is_index' tells whether 'relid' is that of an index (true) or of the
+ * owning relation.
  */
 static List *
-get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
+get_tables_to_repack_partitioned(RepackCommand cmd, Oid relid,
+								 bool rel_is_index, MemoryContext permcxt)
 {
 	List	   *inhoids;
-	ListCell   *lc;
 	List	   *rtcs = NIL;
-	MemoryContext old_context;
 
-	/* Do not lock the children until they're processed */
-	inhoids = find_all_inheritors(indexOid, NoLock, NULL);
-
-	foreach(lc, inhoids)
+	/*
+	 * Do not lock the children until they're processed.  Note that we do hold
+	 * a lock on the parent partitioned table.
+	 */
+	inhoids = find_all_inheritors(relid, NoLock, NULL);
+	foreach_oid(child_oid, inhoids)
 	{
-		Oid			indexrelid = lfirst_oid(lc);
-		Oid			relid = IndexGetRelation(indexrelid, false);
+		Oid			table_oid,
+					index_oid;
 		RelToCluster *rtc;
+		MemoryContext oldcxt;
 
-		/* consider only leaf indexes */
-		if (get_rel_relkind(indexrelid) != RELKIND_INDEX)
-			continue;
+		if (rel_is_index)
+		{
+			/* consider only leaf indexes */
+			if (get_rel_relkind(child_oid) != RELKIND_INDEX)
+				continue;
+
+			table_oid = IndexGetRelation(child_oid, false);
+			index_oid = child_oid;
+		}
+		else
+		{
+			/* consider only leaf relations */
+			if (get_rel_relkind(child_oid) != RELKIND_RELATION)
+				continue;
+
+			table_oid = child_oid;
+			index_oid = InvalidOid;
+		}
 
 		/*
 		 * It's possible that the user does not have privileges to CLUSTER the
-		 * leaf partition despite having such privileges on the partitioned
-		 * table.  We skip any partitions which the user is not permitted to
-		 * CLUSTER.
+		 * leaf partition despite having them on the partitioned table.  Skip
+		 * if so.
 		 */
-		if (!cluster_is_permitted_for_relation(relid, GetUserId()))
+		if (!cluster_is_permitted_for_relation(cmd, table_oid, GetUserId()))
 			continue;
 
 		/* Use a permanent memory context for the result list */
-		old_context = MemoryContextSwitchTo(cluster_context);
-
+		oldcxt = MemoryContextSwitchTo(permcxt);
 		rtc = palloc_object(RelToCluster);
-		rtc->tableOid = relid;
-		rtc->indexOid = indexrelid;
+		rtc->tableOid = table_oid;
+		rtc->indexOid = index_oid;
 		rtcs = lappend(rtcs, rtc);
-
-		MemoryContextSwitchTo(old_context);
+		MemoryContextSwitchTo(oldcxt);
 	}
 
 	return rtcs;
@@ -1742,13 +1830,167 @@ get_tables_to_cluster_partitioned(MemoryContext cluster_context, Oid indexOid)
  * function emits a WARNING.
  */
 static bool
-cluster_is_permitted_for_relation(Oid relid, Oid userid)
+cluster_is_permitted_for_relation(RepackCommand cmd, Oid relid, Oid userid)
 {
+	Assert(cmd == REPACK_COMMAND_CLUSTER || cmd == REPACK_COMMAND_REPACK);
+
 	if (pg_class_aclcheck(relid, userid, ACL_MAINTAIN) == ACLCHECK_OK)
 		return true;
 
 	ereport(WARNING,
-			(errmsg("permission denied to cluster \"%s\", skipping it",
-					get_rel_name(relid))));
+			errmsg("permission denied to execute %s on \"%s\", skipping it",
+				   RepackCommandAsString(cmd),
+				   get_rel_name(relid)));
+
 	return false;
 }
+
+
+/*
+ * Given a RepackStmt with an indicated relation name, resolve the relation
+ * name, obtain lock on it, then determine what to do based on the relation
+ * type: if it's table and not partitioned, repack it as indicated (using an
+ * existing clustered index, or following the given one), and return NULL.
+ *
+ * On the other hand, if the table is partitioned, do nothing further and
+ * instead return the opened and locked relcache entry, so that caller can
+ * process the partitions using the multiple-table handling code.  In this
+ * case, if an index name is given, it's up to the caller to resolve it.
+ */
+static Relation
+process_single_relation(RepackStmt *stmt, ClusterParams *params)
+{
+	Relation	rel;
+	Oid			tableOid;
+
+	Assert(stmt->relation != NULL);
+	Assert(stmt->command == REPACK_COMMAND_CLUSTER ||
+		   stmt->command == REPACK_COMMAND_REPACK);
+
+	/*
+	 * Find, lock, and check permissions on the table.  We obtain
+	 * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
+	 * single-transaction case.
+	 */
+	tableOid = RangeVarGetRelidExtended(stmt->relation->relation,
+										AccessExclusiveLock,
+										0,
+										RangeVarCallbackMaintainsTable,
+										NULL);
+	rel = table_open(tableOid, NoLock);
+
+	/*
+	 * Reject clustering a remote temp table ... their local buffer manager is
+	 * not going to cope.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("cannot execute %s on temporary tables of other sessions",
+					   RepackCommandAsString(stmt->command)));
+
+	/*
+	 * Make sure ANALYZE is specified if a column list is present.
+	 */
+	if ((params->options & CLUOPT_ANALYZE) == 0 && stmt->relation->va_cols != NIL)
+		ereport(ERROR,
+				errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("ANALYZE option must be specified when a column list is provided"));
+
+	/*
+	 * For partitioned tables, let caller handle this.  Otherwise, process it
+	 * here and we're done.
+	 */
+	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		return rel;
+	else
+	{
+		Oid			indexOid;
+
+		indexOid = determine_clustered_index(rel, stmt->usingindex,
+											 stmt->indexname);
+		if (OidIsValid(indexOid))
+			check_index_is_clusterable(rel, indexOid, AccessExclusiveLock);
+		cluster_rel(stmt->command, rel, indexOid, params);
+
+		/* Do an analyze, if requested */
+		if (params->options & CLUOPT_ANALYZE)
+		{
+			VacuumParams vac_params = {0};
+
+			vac_params.options |= VACOPT_ANALYZE;
+			if (params->options & CLUOPT_VERBOSE)
+				vac_params.options |= VACOPT_VERBOSE;
+			analyze_rel(tableOid, NULL, vac_params,
+						stmt->relation->va_cols, true, NULL);
+		}
+
+		return NULL;
+	}
+}
+
+/*
+ * Given a relation and the usingindex/indexname options in a
+ * REPACK USING INDEX or CLUSTER command, return the OID of the
+ * index to use for clustering the table.
+ *
+ * Caller must hold lock on the relation so that the set of indexes
+ * doesn't change, and must call check_index_is_clusterable.
+ */
+static Oid
+determine_clustered_index(Relation rel, bool usingindex, const char *indexname)
+{
+	Oid			indexOid;
+
+	if (indexname == NULL && usingindex)
+	{
+		/*
+		 * If USING INDEX with no name is given, find a clustered index, or
+		 * error out if none.
+		 */
+		indexOid = InvalidOid;
+		foreach_oid(idxoid, RelationGetIndexList(rel))
+		{
+			if (get_index_isclustered(idxoid))
+			{
+				indexOid = idxoid;
+				break;
+			}
+		}
+
+		if (!OidIsValid(indexOid))
+			ereport(ERROR,
+					errcode(ERRCODE_UNDEFINED_OBJECT),
+					errmsg("there is no previously clustered index for table \"%s\"",
+						   RelationGetRelationName(rel)));
+	}
+	else if (indexname != NULL)
+	{
+		/* An index was specified; obtain its OID. */
+		indexOid = get_relname_relid(indexname, rel->rd_rel->relnamespace);
+		if (!OidIsValid(indexOid))
+			ereport(ERROR,
+					errcode(ERRCODE_UNDEFINED_OBJECT),
+					errmsg("index \"%s\" for table \"%s\" does not exist",
+						   indexname, RelationGetRelationName(rel)));
+	}
+	else
+		indexOid = InvalidOid;
+
+	return indexOid;
+}
+
+static const char *
+RepackCommandAsString(RepackCommand cmd)
+{
+	switch (cmd)
+	{
+		case REPACK_COMMAND_REPACK:
+			return "REPACK";
+		case REPACK_COMMAND_VACUUMFULL:
+			return "VACUUM";
+		case REPACK_COMMAND_CLUSTER:
+			return "CLUSTER";
+	}
+	return "???";	/* keep compiler quiet */
+}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 03932f45c8a..aea998260e1 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -351,7 +351,6 @@ ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel)
 		}
 	}
 
-
 	/*
 	 * Sanity check DISABLE_PAGE_SKIPPING option.
 	 */
@@ -2289,8 +2288,9 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
 			if ((params.options & VACOPT_VERBOSE) != 0)
 				cluster_params.options |= CLUOPT_VERBOSE;
 
-			/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
-			cluster_rel(rel, InvalidOid, &cluster_params);
+			/* VACUUM FULL is a variant of REPACK; see cluster.c */
+			cluster_rel(REPACK_COMMAND_VACUUMFULL, rel, InvalidOid,
+						&cluster_params);
 			/* cluster_rel closes the relation, but keeps lock */
 
 			rel = NULL;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 713ee5c10a2..54d37c10447 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -287,7 +287,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		AlterCompositeTypeStmt AlterUserMappingStmt
 		AlterRoleStmt AlterRoleSetStmt AlterPolicyStmt AlterStatsStmt
 		AlterDefaultPrivilegesStmt DefACLAction
-		AnalyzeStmt CallStmt ClosePortalStmt ClusterStmt CommentStmt
+		AnalyzeStmt CallStmt ClosePortalStmt CommentStmt
 		ConstraintsSetStmt CopyStmt CreateAsStmt CreateCastStmt
 		CreateDomainStmt CreateExtensionStmt CreateGroupStmt CreateOpClassStmt
 		CreateOpFamilyStmt AlterOpFamilyStmt CreatePLangStmt
@@ -304,7 +304,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 		GrantStmt GrantRoleStmt ImportForeignSchemaStmt IndexStmt InsertStmt
 		ListenStmt LoadStmt LockStmt MergeStmt NotifyStmt ExplainableStmt PreparableStmt
 		CreateFunctionStmt AlterFunctionStmt ReindexStmt RemoveAggrStmt
-		RemoveFuncStmt RemoveOperStmt RenameStmt ReturnStmt RevokeStmt RevokeRoleStmt
+		RemoveFuncStmt RemoveOperStmt RenameStmt RepackStmt ReturnStmt RevokeStmt RevokeRoleStmt
 		RuleActionStmt RuleActionStmtOrEmpty RuleStmt
 		SecLabelStmt SelectStmt TransactionStmt TransactionStmtLegacy TruncateStmt
 		UnlistenStmt UpdateStmt VacuumStmt
@@ -323,7 +323,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 
 %type <str>			opt_single_name
 %type <list>		opt_qualified_name
-%type <boolean>		opt_concurrently
+%type <boolean>		opt_concurrently opt_usingindex
 %type <dbehavior>	opt_drop_behavior
 %type <list>		opt_utility_option_list
 %type <list>		opt_wait_with_clause
@@ -773,7 +773,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	QUOTE QUOTES
 
 	RANGE READ REAL REASSIGN RECURSIVE REF_P REFERENCES REFERENCING
-	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE REPLICA
+	REFRESH REINDEX RELATIVE_P RELEASE RENAME REPACK REPEATABLE REPLACE REPLICA
 	RESET RESPECT_P RESTART RESTRICT RETURN RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROLLUP
 	ROUTINE ROUTINES ROW ROWS RULE
 
@@ -1035,7 +1035,6 @@ stmt:
 			| CallStmt
 			| CheckPointStmt
 			| ClosePortalStmt
-			| ClusterStmt
 			| CommentStmt
 			| ConstraintsSetStmt
 			| CopyStmt
@@ -1109,6 +1108,7 @@ stmt:
 			| RemoveFuncStmt
 			| RemoveOperStmt
 			| RenameStmt
+			| RepackStmt
 			| RevokeStmt
 			| RevokeRoleStmt
 			| RuleStmt
@@ -1146,6 +1146,11 @@ opt_concurrently:
 			| /*EMPTY*/						{ $$ = false; }
 		;
 
+opt_usingindex:
+			USING INDEX						{ $$ = true; }
+			| /* EMPTY */					{ $$ = false; }
+		;
+
 opt_drop_behavior:
 			CASCADE							{ $$ = DROP_CASCADE; }
 			| RESTRICT						{ $$ = DROP_RESTRICT; }
@@ -12036,38 +12041,82 @@ CreateConversionStmt:
 /*****************************************************************************
  *
  *		QUERY:
+ *				REPACK [ (options) ] [ <qualified_name> [ <name_list> ] [ USING INDEX <index_name> ] ]
+ *
+ *			obsolete variants:
  *				CLUSTER (options) [ <qualified_name> [ USING <index_name> ] ]
  *				CLUSTER [VERBOSE] [ <qualified_name> [ USING <index_name> ] ]
  *				CLUSTER [VERBOSE] <index_name> ON <qualified_name> (for pre-8.3)
  *
  *****************************************************************************/
 
-ClusterStmt:
-			CLUSTER '(' utility_option_list ')' qualified_name cluster_index_specification
+RepackStmt:
+			REPACK opt_utility_option_list vacuum_relation USING INDEX name
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
-					n->relation = $5;
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = (VacuumRelation *) $3;
 					n->indexname = $6;
+					n->usingindex = true;
+					n->params = $2;
+					$$ = (Node *) n;
+				}
+			| REPACK opt_utility_option_list vacuum_relation opt_usingindex
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = (VacuumRelation *) $3;
+					n->indexname = NULL;
+					n->usingindex = $4;
+					n->params = $2;
+					$$ = (Node *) n;
+				}
+			| REPACK opt_utility_option_list opt_usingindex
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_REPACK;
+					n->relation = NULL;
+					n->indexname = NULL;
+					n->usingindex = $3;
+					n->params = $2;
+					$$ = (Node *) n;
+				}
+			| CLUSTER '(' utility_option_list ')' qualified_name cluster_index_specification
+				{
+					RepackStmt *n = makeNode(RepackStmt);
+
+					n->command = REPACK_COMMAND_CLUSTER;
+					n->relation = makeNode(VacuumRelation);
+					n->relation->relation = $5;
+					n->indexname = $6;
+					n->usingindex = true;
 					n->params = $3;
 					$$ = (Node *) n;
 				}
 			| CLUSTER opt_utility_option_list
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = NULL;
 					n->indexname = NULL;
+					n->usingindex = true;
 					n->params = $2;
 					$$ = (Node *) n;
 				}
 			/* unparenthesized VERBOSE kept for pre-14 compatibility */
 			| CLUSTER opt_verbose qualified_name cluster_index_specification
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
-					n->relation = $3;
+					n->command = REPACK_COMMAND_CLUSTER;
+					n->relation = makeNode(VacuumRelation);
+					n->relation->relation = $3;
 					n->indexname = $4;
+					n->usingindex = true;
 					if ($2)
 						n->params = list_make1(makeDefElem("verbose", NULL, @2));
 					$$ = (Node *) n;
@@ -12075,20 +12124,25 @@ ClusterStmt:
 			/* unparenthesized VERBOSE kept for pre-17 compatibility */
 			| CLUSTER VERBOSE
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
+					n->command = REPACK_COMMAND_CLUSTER;
 					n->relation = NULL;
 					n->indexname = NULL;
+					n->usingindex = true;
 					n->params = list_make1(makeDefElem("verbose", NULL, @2));
 					$$ = (Node *) n;
 				}
 			/* kept for pre-8.3 compatibility */
 			| CLUSTER opt_verbose name ON qualified_name
 				{
-					ClusterStmt *n = makeNode(ClusterStmt);
+					RepackStmt *n = makeNode(RepackStmt);
 
-					n->relation = $5;
+					n->command = REPACK_COMMAND_CLUSTER;
+					n->relation = makeNode(VacuumRelation);
+					n->relation->relation = $5;
 					n->indexname = $3;
+					n->usingindex = true;
 					if ($2)
 						n->params = list_make1(makeDefElem("verbose", NULL, @2));
 					$$ = (Node *) n;
@@ -18127,6 +18181,7 @@ unreserved_keyword:
 			| RELATIVE_P
 			| RELEASE
 			| RENAME
+			| REPACK
 			| REPEATABLE
 			| REPLACE
 			| REPLICA
@@ -18764,6 +18819,7 @@ bare_label_keyword:
 			| RELATIVE_P
 			| RELEASE
 			| RENAME
+			| REPACK
 			| REPEATABLE
 			| REPLACE
 			| REPLICA
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 34dd6e18df5..ca737b05115 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -279,9 +279,9 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
 				return COMMAND_OK_IN_RECOVERY | COMMAND_OK_IN_READ_ONLY_TXN;
 			}
 
-		case T_ClusterStmt:
 		case T_ReindexStmt:
 		case T_VacuumStmt:
+		case T_RepackStmt:
 			{
 				/*
 				 * These commands write WAL, so they're not strictly
@@ -856,14 +856,14 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			ExecuteCallStmt(castNode(CallStmt, parsetree), params, isAtomicContext, dest);
 			break;
 
-		case T_ClusterStmt:
-			cluster(pstate, (ClusterStmt *) parsetree, isTopLevel);
-			break;
-
 		case T_VacuumStmt:
 			ExecVacuum(pstate, (VacuumStmt *) parsetree, isTopLevel);
 			break;
 
+		case T_RepackStmt:
+			ExecRepack(pstate, (RepackStmt *) parsetree, isTopLevel);
+			break;
+
 		case T_ExplainStmt:
 			ExplainQuery(pstate, (ExplainStmt *) parsetree, params, dest);
 			break;
@@ -2865,10 +2865,6 @@ CreateCommandTag(Node *parsetree)
 			tag = CMDTAG_CALL;
 			break;
 
-		case T_ClusterStmt:
-			tag = CMDTAG_CLUSTER;
-			break;
-
 		case T_VacuumStmt:
 			if (((VacuumStmt *) parsetree)->is_vacuumcmd)
 				tag = CMDTAG_VACUUM;
@@ -2876,6 +2872,13 @@ CreateCommandTag(Node *parsetree)
 				tag = CMDTAG_ANALYZE;
 			break;
 
+		case T_RepackStmt:
+			if (((RepackStmt *) parsetree)->command == REPACK_COMMAND_CLUSTER)
+				tag = CMDTAG_CLUSTER;
+			else
+				tag = CMDTAG_REPACK;
+			break;
+
 		case T_ExplainStmt:
 			tag = CMDTAG_EXPLAIN;
 			break;
@@ -3517,7 +3520,7 @@ GetCommandLogLevel(Node *parsetree)
 			lev = LOGSTMT_ALL;
 			break;
 
-		case T_ClusterStmt:
+		case T_RepackStmt:
 			lev = LOGSTMT_DDL;
 			break;
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 73ca0bb0b7f..55a69bf681d 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -287,8 +287,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		cmdtype = PROGRESS_COMMAND_VACUUM;
 	else if (pg_strcasecmp(cmd, "ANALYZE") == 0)
 		cmdtype = PROGRESS_COMMAND_ANALYZE;
-	else if (pg_strcasecmp(cmd, "CLUSTER") == 0)
-		cmdtype = PROGRESS_COMMAND_CLUSTER;
+	else if (pg_strcasecmp(cmd, "REPACK") == 0)
+		cmdtype = PROGRESS_COMMAND_REPACK;
 	else if (pg_strcasecmp(cmd, "CREATE INDEX") == 0)
 		cmdtype = PROGRESS_COMMAND_CREATE_INDEX;
 	else if (pg_strcasecmp(cmd, "BASEBACKUP") == 0)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 8b91bc00062..2a1bb47ff03 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -1267,7 +1267,7 @@ static const char *const sql_commands[] = {
 	"DELETE FROM", "DISCARD", "DO", "DROP", "END", "EXECUTE", "EXPLAIN",
 	"FETCH", "GRANT", "IMPORT FOREIGN SCHEMA", "INSERT INTO", "LISTEN", "LOAD", "LOCK",
 	"MERGE INTO", "MOVE", "NOTIFY", "PREPARE",
-	"REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE",
+	"REASSIGN", "REFRESH MATERIALIZED VIEW", "REINDEX", "RELEASE", "REPACK",
 	"RESET", "REVOKE", "ROLLBACK",
 	"SAVEPOINT", "SECURITY LABEL", "SELECT", "SET", "SHOW", "START",
 	"TABLE", "TRUNCATE", "UNLISTEN", "UPDATE", "VACUUM", "VALUES",
@@ -5086,6 +5086,46 @@ match_previous_words(int pattern_id,
 			COMPLETE_WITH_QUERY(Query_for_list_of_tablespaces);
 	}
 
+/* REPACK */
+	else if (Matches("REPACK"))
+		COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+										"(", "USING INDEX");
+	else if (Matches("REPACK", "(*)"))
+		COMPLETE_WITH_SCHEMA_QUERY_PLUS(Query_for_list_of_clusterables,
+										"USING INDEX");
+	else if (Matches("REPACK", MatchAnyExcept("(")))
+		COMPLETE_WITH("USING INDEX");
+	else if (Matches("REPACK", "(*)", MatchAnyExcept("(")))
+		COMPLETE_WITH("USING INDEX");
+	else if (Matches("REPACK", MatchAny, "USING", "INDEX") ||
+			 Matches("REPACK", "(*)", MatchAny, "USING", "INDEX"))
+	{
+		set_completion_reference(prev3_wd);
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+	}
+	/*
+	 * Complete ... [ (*) ] <sth> USING INDEX, with a list of indexes for
+	 * <sth>.
+	 */
+	else if (TailMatches(MatchAny, "USING", "INDEX"))
+	{
+		set_completion_reference(prev3_wd);
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_index_of_table);
+	}
+	else if (HeadMatches("REPACK", "(*") &&
+			 !HeadMatches("REPACK", "(*)"))
+	{
+		/*
+		 * This fires if we're in an unfinished parenthesized option list.
+		 * get_previous_words treats a completed parenthesized option list as
+		 * one word, so the above test is correct.
+		 */
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("ANALYZE", "VERBOSE");
+		else if (TailMatches("ANALYZE", "VERBOSE"))
+			COMPLETE_WITH("ON", "OFF");
+	}
+
 /* SECURITY LABEL */
 	else if (Matches("SECURITY"))
 		COMPLETE_WITH("LABEL");
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 8ea81622f9d..28741988478 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -24,6 +24,7 @@
 #define CLUOPT_RECHECK 0x02		/* recheck relation state */
 #define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
 										 * indisclustered */
+#define CLUOPT_ANALYZE 0x08		/* do an ANALYZE */
 
 /* options for CLUSTER */
 typedef struct ClusterParams
@@ -31,8 +32,11 @@ typedef struct ClusterParams
 	bits32		options;		/* bitmask of CLUOPT_* */
 } ClusterParams;
 
-extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Relation OldHeap, Oid indexOid, ClusterParams *params);
+
+extern void ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
+
+extern void cluster_rel(RepackCommand command, Relation OldHeap, Oid indexOid,
+						ClusterParams *params);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 									   LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 359221dc296..f00e39b937d 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -73,28 +73,34 @@
 #define PROGRESS_ANALYZE_STARTED_BY_MANUAL			1
 #define PROGRESS_ANALYZE_STARTED_BY_AUTOVACUUM		2
 
-/* Progress parameters for cluster */
-#define PROGRESS_CLUSTER_COMMAND				0
-#define PROGRESS_CLUSTER_PHASE					1
-#define PROGRESS_CLUSTER_INDEX_RELID			2
-#define PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED	3
-#define PROGRESS_CLUSTER_HEAP_TUPLES_WRITTEN	4
-#define PROGRESS_CLUSTER_TOTAL_HEAP_BLKS		5
-#define PROGRESS_CLUSTER_HEAP_BLKS_SCANNED		6
-#define PROGRESS_CLUSTER_INDEX_REBUILD_COUNT	7
+/*
+ * Progress parameters for REPACK.
+ *
+ * Values for PROGRESS_REPACK_COMMAND are defined as in RepackCommand.
+ *
+ * Note: Since REPACK shares code with CLUSTER, these values are also
+ * used by CLUSTER. (CLUSTER is now deprecated, so it makes little sense to
+ * introduce a separate set of constants.)
+ */
+#define PROGRESS_REPACK_COMMAND					0
+#define PROGRESS_REPACK_PHASE					1
+#define PROGRESS_REPACK_INDEX_RELID				2
+#define PROGRESS_REPACK_HEAP_TUPLES_SCANNED		3
+#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN		4
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS			5
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED		6
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT		7
 
-/* Phases of cluster (as advertised via PROGRESS_CLUSTER_PHASE) */
-#define PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP	1
-#define PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP	2
-#define PROGRESS_CLUSTER_PHASE_SORT_TUPLES		3
-#define PROGRESS_CLUSTER_PHASE_WRITE_NEW_HEAP	4
-#define PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES	5
-#define PROGRESS_CLUSTER_PHASE_REBUILD_INDEX	6
-#define PROGRESS_CLUSTER_PHASE_FINAL_CLEANUP	7
-
-/* Commands of PROGRESS_CLUSTER */
-#define PROGRESS_CLUSTER_COMMAND_CLUSTER		1
-#define PROGRESS_CLUSTER_COMMAND_VACUUM_FULL	2
+/*
+ * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
+ */
+#define PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP		1
+#define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP	2
+#define PROGRESS_REPACK_PHASE_SORT_TUPLES		3
+#define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP	4
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES	5
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX		6
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP		7
 
 /* Progress parameters for CREATE INDEX */
 /* 3, 4 and 5 reserved for "waitfor" metrics */
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 646d6ced763..c15ba5e6f29 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -3980,18 +3980,6 @@ typedef struct AlterSystemStmt
 	VariableSetStmt *setstmt;	/* SET subcommand */
 } AlterSystemStmt;
 
-/* ----------------------
- *		Cluster Statement (support pbrown's cluster index implementation)
- * ----------------------
- */
-typedef struct ClusterStmt
-{
-	NodeTag		type;
-	RangeVar   *relation;		/* relation being indexed, or NULL if all */
-	char	   *indexname;		/* original index defined */
-	List	   *params;			/* list of DefElem nodes */
-} ClusterStmt;
-
 /* ----------------------
  *		Vacuum and Analyze Statements
  *
@@ -4004,7 +3992,7 @@ typedef struct VacuumStmt
 	NodeTag		type;
 	List	   *options;		/* list of DefElem nodes */
 	List	   *rels;			/* list of VacuumRelation, or NIL for all */
-	bool		is_vacuumcmd;	/* true for VACUUM, false for ANALYZE */
+	bool		is_vacuumcmd;	/* true for VACUUM, false otherwise */
 } VacuumStmt;
 
 /*
@@ -4022,6 +4010,27 @@ typedef struct VacuumRelation
 	List	   *va_cols;		/* list of column names, or NIL for all */
 } VacuumRelation;
 
+/* ----------------------
+ *		Repack Statement
+ * ----------------------
+ */
+typedef enum RepackCommand
+{
+	REPACK_COMMAND_CLUSTER = 1,
+	REPACK_COMMAND_REPACK,
+	REPACK_COMMAND_VACUUMFULL,
+} RepackCommand;
+
+typedef struct RepackStmt
+{
+	NodeTag		type;
+	RepackCommand command;		/* type of command being run */
+	VacuumRelation *relation;	/* relation being repacked */
+	char	   *indexname;		/* order tuples by this index */
+	bool		usingindex;		/* whether USING INDEX is specified */
+	List	   *params;			/* list of DefElem nodes */
+} RepackStmt;
+
 /* ----------------------
  *		Explain Statement
  *
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index f7753c5c8a8..6f74a8c05c7 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -377,6 +377,7 @@ PG_KEYWORD("reindex", REINDEX, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("relative", RELATIVE_P, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("release", RELEASE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("rename", RENAME, UNRESERVED_KEYWORD, BARE_LABEL)
+PG_KEYWORD("repack", REPACK, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("repeatable", REPEATABLE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replace", REPLACE, UNRESERVED_KEYWORD, BARE_LABEL)
 PG_KEYWORD("replica", REPLICA, UNRESERVED_KEYWORD, BARE_LABEL)
diff --git a/src/include/tcop/cmdtaglist.h b/src/include/tcop/cmdtaglist.h
index 1290c9bab68..652dc61b834 100644
--- a/src/include/tcop/cmdtaglist.h
+++ b/src/include/tcop/cmdtaglist.h
@@ -196,6 +196,7 @@ PG_CMDTAG(CMDTAG_REASSIGN_OWNED, "REASSIGN OWNED", false, false, false)
 PG_CMDTAG(CMDTAG_REFRESH_MATERIALIZED_VIEW, "REFRESH MATERIALIZED VIEW", true, false, false)
 PG_CMDTAG(CMDTAG_REINDEX, "REINDEX", true, false, false)
 PG_CMDTAG(CMDTAG_RELEASE, "RELEASE", false, false, false)
+PG_CMDTAG(CMDTAG_REPACK, "REPACK", false, false, false)
 PG_CMDTAG(CMDTAG_RESET, "RESET", false, false, false)
 PG_CMDTAG(CMDTAG_REVOKE, "REVOKE", true, false, false)
 PG_CMDTAG(CMDTAG_REVOKE_ROLE, "REVOKE ROLE", false, false, false)
diff --git a/src/include/utils/backend_progress.h b/src/include/utils/backend_progress.h
index 19f63b41431..6300dbd15d5 100644
--- a/src/include/utils/backend_progress.h
+++ b/src/include/utils/backend_progress.h
@@ -24,10 +24,10 @@ typedef enum ProgressCommandType
 	PROGRESS_COMMAND_INVALID,
 	PROGRESS_COMMAND_VACUUM,
 	PROGRESS_COMMAND_ANALYZE,
-	PROGRESS_COMMAND_CLUSTER,
 	PROGRESS_COMMAND_CREATE_INDEX,
 	PROGRESS_COMMAND_BASEBACKUP,
 	PROGRESS_COMMAND_COPY,
+	PROGRESS_COMMAND_REPACK,
 } ProgressCommandType;
 
 #define PGSTAT_NUM_PROGRESS_PARAM	20
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 4d40a6809ab..277854418fa 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -495,6 +495,43 @@ ALTER TABLE clstrpart SET WITHOUT CLUSTER;
 ERROR:  cannot mark index clustered in partitioned table
 ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
 ERROR:  cannot mark index clustered in partitioned table
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+   relname   | level | relkind | ?column? 
+-------------+-------+---------+----------
+ clstrpart   |     0 | p       | t
+ clstrpart1  |     1 | p       | t
+ clstrpart11 |     2 | r       | f
+ clstrpart12 |     2 | p       | t
+ clstrpart2  |     1 | r       | f
+ clstrpart3  |     1 | p       | t
+ clstrpart33 |     2 | r       | f
+(7 rows)
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+   relname   | level | relkind | ?column? 
+-------------+-------+---------+----------
+ clstrpart   |     0 | p       | t
+ clstrpart1  |     1 | p       | t
+ clstrpart11 |     2 | r       | f
+ clstrpart12 |     2 | p       | t
+ clstrpart2  |     1 | r       | f
+ clstrpart3  |     1 | p       | t
+ clstrpart33 |     2 | r       | f
+(7 rows)
+
 DROP TABLE clstrpart;
 -- Ownership of partitions is checked
 CREATE TABLE ptnowner(i int unique) PARTITION BY LIST (i);
@@ -513,7 +550,7 @@ CREATE TEMP TABLE ptnowner_oldnodes AS
   JOIN pg_class AS c ON c.oid=tree.relid;
 SET SESSION AUTHORIZATION regress_ptnowner;
 CLUSTER ptnowner USING ptnowner_i_idx;
-WARNING:  permission denied to cluster "ptnowner2", skipping it
+WARNING:  permission denied to execute CLUSTER on "ptnowner2", skipping it
 RESET SESSION AUTHORIZATION;
 SELECT a.relname, a.relfilenode=b.relfilenode FROM pg_class a
   JOIN ptnowner_oldnodes b USING (oid) ORDER BY a.relname COLLATE "C";
@@ -665,6 +702,101 @@ SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
 (4 rows)
 
 COMMIT;
+----------------------------------------------------------------------
+--
+-- REPACK
+--
+----------------------------------------------------------------------
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+ a  |  b  |        c         |           substring            | length 
+----+-----+------------------+--------------------------------+--------
+ 10 |  14 | catorce          |                                |       
+ 18 |   5 | cinco            |                                |       
+  9 |   4 | cuatro           |                                |       
+ 26 |  19 | diecinueve       |                                |       
+ 12 |  18 | dieciocho        |                                |       
+ 30 |  16 | dieciseis        |                                |       
+ 24 |  17 | diecisiete       |                                |       
+  2 |  10 | diez             |                                |       
+ 23 |  12 | doce             |                                |       
+ 11 |   2 | dos              |                                |       
+ 25 |   9 | nueve            |                                |       
+ 31 |   8 | ocho             |                                |       
+  1 |  11 | once             |                                |       
+ 28 |  15 | quince           |                                |       
+ 32 |   6 | seis             | xyzzyxyzzyxyzzyxyzzyxyzzyxyzzy | 500000
+ 29 |   7 | siete            |                                |       
+ 15 |  13 | trece            |                                |       
+ 22 |  30 | treinta          |                                |       
+ 17 |  32 | treinta y dos    |                                |       
+  3 |  31 | treinta y uno    |                                |       
+  5 |   3 | tres             |                                |       
+ 20 |   1 | uno              |                                |       
+  6 |  20 | veinte           |                                |       
+ 14 |  25 | veinticinco      |                                |       
+ 21 |  24 | veinticuatro     |                                |       
+  4 |  22 | veintidos        |                                |       
+ 19 |  29 | veintinueve      |                                |       
+ 16 |  28 | veintiocho       |                                |       
+ 27 |  26 | veintiseis       |                                |       
+ 13 |  27 | veintisiete      |                                |       
+  7 |  23 | veintitres       |                                |       
+  8 |  21 | veintiuno        |                                |       
+  0 | 100 | in child table   |                                |       
+  0 | 100 | in child table 2 |                                |       
+(34 rows)
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+ERROR:  insert or update on table "clstr_tst" violates foreign key constraint "clstr_tst_con"
+DETAIL:  Key (b)=(1111) is not present in table "clstr_tst_s".
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+       conname        
+----------------------
+ clstr_tst_a_not_null
+ clstr_tst_con
+ clstr_tst_pkey
+(3 rows)
+
+-- Verify partial analyze works
+REPACK (ANALYZE) clstr_tst (a);
+REPACK (ANALYZE) clstr_tst;
+REPACK (VERBOSE) clstr_tst (a);
+ERROR:  ANALYZE option must be specified when a column list is provided
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR;  -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+ relname 
+---------
+ clstr_1
+ clstr_3
+(2 rows)
+
 -- clean up
 DROP TABLE clustertest;
 DROP TABLE clstr_1;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f4ee2bd7459..48461550636 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2002,34 +2002,23 @@ pg_stat_progress_basebackup| SELECT pid,
             ELSE NULL::text
         END AS backup_type
    FROM pg_stat_get_progress_info('BASEBACKUP'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20);
-pg_stat_progress_cluster| SELECT s.pid,
-    s.datid,
-    d.datname,
-    s.relid,
-        CASE s.param1
-            WHEN 1 THEN 'CLUSTER'::text
-            WHEN 2 THEN 'VACUUM FULL'::text
-            ELSE NULL::text
+pg_stat_progress_cluster| SELECT pid,
+    datid,
+    datname,
+    relid,
+        CASE
+            WHEN (command = ANY (ARRAY['CLUSTER'::text, 'VACUUM FULL'::text])) THEN command
+            WHEN (repack_index_relid = (0)::oid) THEN 'VACUUM FULL'::text
+            ELSE 'CLUSTER'::text
         END AS command,
-        CASE s.param2
-            WHEN 0 THEN 'initializing'::text
-            WHEN 1 THEN 'seq scanning heap'::text
-            WHEN 2 THEN 'index scanning heap'::text
-            WHEN 3 THEN 'sorting tuples'::text
-            WHEN 4 THEN 'writing new heap'::text
-            WHEN 5 THEN 'swapping relation files'::text
-            WHEN 6 THEN 'rebuilding index'::text
-            WHEN 7 THEN 'performing final cleanup'::text
-            ELSE NULL::text
-        END AS phase,
-    (s.param3)::oid AS cluster_index_relid,
-    s.param4 AS heap_tuples_scanned,
-    s.param5 AS heap_tuples_written,
-    s.param6 AS heap_blks_total,
-    s.param7 AS heap_blks_scanned,
-    s.param8 AS index_rebuild_count
-   FROM (pg_stat_get_progress_info('CLUSTER'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
-     LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+    phase,
+    repack_index_relid AS cluster_index_relid,
+    heap_tuples_scanned,
+    heap_tuples_written,
+    heap_blks_total,
+    heap_blks_scanned,
+    index_rebuild_count
+   FROM pg_stat_progress_repack;
 pg_stat_progress_copy| SELECT s.pid,
     s.datid,
     d.datname,
@@ -2089,6 +2078,35 @@ pg_stat_progress_create_index| SELECT s.pid,
     s.param15 AS partitions_done
    FROM (pg_stat_get_progress_info('CREATE INDEX'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
+pg_stat_progress_repack| SELECT s.pid,
+    s.datid,
+    d.datname,
+    s.relid,
+        CASE s.param1
+            WHEN 1 THEN 'CLUSTER'::text
+            WHEN 2 THEN 'REPACK'::text
+            WHEN 3 THEN 'VACUUM FULL'::text
+            ELSE NULL::text
+        END AS command,
+        CASE s.param2
+            WHEN 0 THEN 'initializing'::text
+            WHEN 1 THEN 'seq scanning heap'::text
+            WHEN 2 THEN 'index scanning heap'::text
+            WHEN 3 THEN 'sorting tuples'::text
+            WHEN 4 THEN 'writing new heap'::text
+            WHEN 5 THEN 'swapping relation files'::text
+            WHEN 6 THEN 'rebuilding index'::text
+            WHEN 7 THEN 'performing final cleanup'::text
+            ELSE NULL::text
+        END AS phase,
+    (s.param3)::oid AS repack_index_relid,
+    s.param4 AS heap_tuples_scanned,
+    s.param5 AS heap_tuples_written,
+    s.param6 AS heap_blks_total,
+    s.param7 AS heap_blks_scanned,
+    s.param8 AS index_rebuild_count
+   FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
+     LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_progress_vacuum| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index b7115f86104..c976823a3cb 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -76,7 +76,6 @@ INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
 SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
 ORDER BY 1;
 
-
 SELECT relname, relkind,
     EXISTS(SELECT 1 FROM pg_class WHERE oid = c.reltoastrelid) AS hastoast
 FROM pg_class c WHERE relname LIKE 'clstr_tst%' ORDER BY relname;
@@ -229,6 +228,24 @@ SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM o
 CLUSTER clstrpart;
 ALTER TABLE clstrpart SET WITHOUT CLUSTER;
 ALTER TABLE clstrpart CLUSTER ON clstrpart_idx;
+
+-- Check that REPACK sets new relfilenodes: it should process exactly the same
+-- tables as CLUSTER did.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart USING INDEX clstrpart_idx;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
+-- And finally the same for REPACK w/o index.
+DROP TABLE old_cluster_info;
+DROP TABLE new_cluster_info;
+CREATE TEMP TABLE old_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+REPACK clstrpart;
+CREATE TEMP TABLE new_cluster_info AS SELECT relname, level, relfilenode, relkind FROM pg_partition_tree('clstrpart'::regclass) AS tree JOIN pg_class c ON c.oid=tree.relid ;
+SELECT relname, old.level, old.relkind, old.relfilenode = new.relfilenode FROM old_cluster_info AS old JOIN new_cluster_info AS new USING (relname) ORDER BY relname COLLATE "C";
+
 DROP TABLE clstrpart;
 
 -- Ownership of partitions is checked
@@ -313,6 +330,57 @@ EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
 SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
 COMMIT;
 
+----------------------------------------------------------------------
+--
+-- REPACK
+--
+----------------------------------------------------------------------
+
+-- REPACK handles individual tables identically to CLUSTER, but it's worth
+-- checking if it handles table hierarchies identically as well.
+REPACK clstr_tst USING INDEX clstr_tst_c;
+
+-- Verify that inheritance link still works
+INSERT INTO clstr_tst_inh VALUES (0, 100, 'in child table 2');
+SELECT a,b,c,substring(d for 30), length(d) from clstr_tst;
+
+-- Verify that foreign key link still works
+INSERT INTO clstr_tst (b, c) VALUES (1111, 'this should fail');
+
+SELECT conname FROM pg_constraint WHERE conrelid = 'clstr_tst'::regclass
+ORDER BY 1;
+
+-- Verify partial analyze works
+REPACK (ANALYZE) clstr_tst (a);
+REPACK (ANALYZE) clstr_tst;
+REPACK (VERBOSE) clstr_tst (a);
+
+-- REPACK w/o argument performs no ordering, so we can only check which tables
+-- have the relfilenode changed.
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_old AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+SET SESSION AUTHORIZATION regress_clstr_user;
+SET client_min_messages = ERROR;  -- order of "skipping" warnings may vary
+REPACK;
+RESET client_min_messages;
+
+RESET SESSION AUTHORIZATION;
+CREATE TEMP TABLE relnodes_new AS
+(SELECT relname, relfilenode
+FROM pg_class
+WHERE relname IN ('clstr_1', 'clstr_2', 'clstr_3'));
+
+-- Do the actual comparison. Unlike CLUSTER, clstr_3 should have been
+-- processed because there is nothing like clustering index here.
+SELECT o.relname FROM relnodes_old o
+JOIN relnodes_new n ON o.relname = n.relname
+WHERE o.relfilenode <> n.relfilenode
+ORDER BY o.relname;
+
 -- clean up
 DROP TABLE clustertest;
 DROP TABLE clstr_1;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9f5ee8fd482..6c4af1c210d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2575,6 +2575,8 @@ ReorderBufferTupleCidEnt
 ReorderBufferTupleCidKey
 ReorderBufferUpdateProgressTxnCB
 ReorderTuple
+RepackCommand
+RepackStmt
 ReparameterizeForeignPathByChild_function
 ReplOriginId
 ReplOriginXactState
-- 
2.47.3



  [text/x-diff] v33-0002-Refactor-index_concurrently_create_copy-for-use-.patch (8.7K, 3-v33-0002-Refactor-index_concurrently_create_copy-for-use-.patch)
  download | inline diff:
From 51af72c80c987888360e4c3263451c31337bda79 Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Tue, 27 Jan 2026 11:48:40 +0100
Subject: [PATCH v33 2/5] Refactor index_concurrently_create_copy() for use
 with REPACK (CONCURRENTLY).

This patch moves the code to index_create_copy() and adds a "concurrently"
parameter so it can be used by REPACK (CONCURRENTLY).

With the CONCURRENTLY option, REPACK cannot simply swap the heap file and
rebuild its indexes. Instead, it needs to build a separate set of indexes
(including system catalog entries) *before* the actual swap, to reduce the
time AccessExclusiveLock needs to be held for.
---
 src/backend/catalog/index.c      | 54 +++++++++++++++++++++++---------
 src/backend/commands/indexcmds.c |  6 ++--
 src/backend/nodes/makefuncs.c    |  9 +++---
 src/include/catalog/index.h      |  3 ++
 src/include/nodes/makefuncs.h    |  4 ++-
 5 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5ee6389d39c..f8e6c3d804e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1288,15 +1288,32 @@ index_create(Relation heapRelation,
 /*
  * index_concurrently_create_copy
  *
- * Create concurrently an index based on the definition of the one provided by
- * caller.  The index is inserted into catalogs and needs to be built later
- * on.  This is called during concurrent reindex processing.
- *
- * "tablespaceOid" is the tablespace to use for this index.
+ * Variant of index_create_copy(), called during concurrent reindex
+ * processing.
  */
 Oid
 index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							   Oid tablespaceOid, const char *newName)
+{
+	return index_create_copy(heapRelation, oldIndexId, tablespaceOid, newName,
+							 true);
+}
+
+/*
+ * index_create_copy
+ *
+ * Create an index based on the definition of the one provided by caller.  The
+ * index is inserted into catalogs. If 'concurrently' is TRUE, it needs to be
+ * built later on, otherwise it's built immediately.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ *
+ * The actual implementation of index_concurrently_create_copy(), reusable for
+ * other purposes.
+ */
+Oid
+index_create_copy(Relation heapRelation, Oid oldIndexId, Oid tablespaceOid,
+				  const char *newName, bool concurrently)
 {
 	Relation	indexRelation;
 	IndexInfo  *oldInfo,
@@ -1315,6 +1332,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	List	   *indexColNames = NIL;
 	List	   *indexExprs = NIL;
 	List	   *indexPreds = NIL;
+	int			flags = 0;
 
 	indexRelation = index_open(oldIndexId, RowExclusiveLock);
 
@@ -1325,7 +1343,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	 * Concurrent build of an index with exclusion constraints is not
 	 * supported.
 	 */
-	if (oldInfo->ii_ExclusionOps != NULL)
+	if (oldInfo->ii_ExclusionOps != NULL && concurrently)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("concurrent index creation for exclusion constraints is not supported")));
@@ -1381,9 +1399,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	}
 
 	/*
-	 * Build the index information for the new index.  Note that rebuild of
-	 * indexes with exclusion constraints is not supported, hence there is no
-	 * need to fill all the ii_Exclusion* fields.
+	 * Build the index information for the new index.
 	 */
 	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
 							oldInfo->ii_NumIndexKeyAttrs,
@@ -1392,10 +1408,13 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							indexPreds,
 							oldInfo->ii_Unique,
 							oldInfo->ii_NullsNotDistinct,
-							false,	/* not ready for inserts */
-							true,
+							!concurrently,	/* isready */
+							concurrently,	/* concurrent */
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							oldInfo->ii_ExclusionOps,
+							oldInfo->ii_ExclusionProcs,
+							oldInfo->ii_ExclusionStrats);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1433,6 +1452,9 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 		stattargets[i].isnull = isnull;
 	}
 
+	if (concurrently)
+		flags = INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT;
+
 	/*
 	 * Now create the new index.
 	 *
@@ -1456,7 +1478,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  indcoloptions->values,
 							  stattargets,
 							  reloptionsDatum,
-							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT,
+							  flags,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
@@ -2450,7 +2472,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   NULL, NULL, NULL);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2510,7 +2533,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   NULL, NULL, NULL);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 635679cc1f2..34209bd1393 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -243,7 +243,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing, isWithoutOverlaps,
+							  NULL, NULL, NULL);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -930,7 +931,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  NULL, NULL, NULL);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 2caec621d73..ca7e21e8349 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,8 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, Oid *exclusion_ops, Oid *exclusion_procs,
+			  uint16 *exclusion_strats)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -863,9 +864,9 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_PredicateState = NULL;
 
 	/* exclusion constraints */
-	n->ii_ExclusionOps = NULL;
-	n->ii_ExclusionProcs = NULL;
-	n->ii_ExclusionStrats = NULL;
+	n->ii_ExclusionOps = exclusion_ops;
+	n->ii_ExclusionProcs = exclusion_procs;
+	n->ii_ExclusionStrats = exclusion_strats;
 
 	/* speculative inserts */
 	n->ii_UniqueOps = NULL;
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index b259c4141ed..3426087b445 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -99,6 +99,9 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid oldIndexId,
 										   Oid tablespaceOid,
 										   const char *newName);
+extern Oid	index_create_copy(Relation heapRelation, Oid oldIndexId,
+							  Oid tablespaceOid, const char *newName,
+							  bool concurrently);
 
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 982ec25ae14..dcea148ae1a 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,9 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								Oid *exclusion_ops, Oid *exclusion_procs,
+								uint16 *exclusion_strats);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
-- 
2.47.3



  [text/x-diff] v33-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch (5.6K, 4-v33-0003-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch)
  download | inline diff:
From fedf60087c884553bbb5fe2c54c9fad52757551c Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Tue, 27 Jan 2026 11:48:40 +0100
Subject: [PATCH v33 3/5] Move conversion of a "historic" to MVCC snapshot to a
 separate function.

The conversion is now handled by SnapBuildMVCCFromHistoric(). REPACK
CONCURRENTLY will also need it.
---
 src/backend/replication/logical/snapbuild.c | 59 +++++++++++++++++----
 src/backend/utils/time/snapmgr.c            |  3 +-
 src/include/replication/snapbuild.h         |  1 +
 src/include/utils/snapmgr.h                 |  1 +
 4 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..a738ad8a864 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -440,10 +440,7 @@ Snapshot
 SnapBuildInitialSnapshot(SnapBuild *builder)
 {
 	Snapshot	snap;
-	TransactionId xid;
 	TransactionId safeXid;
-	TransactionId *newxip;
-	int			newxcnt = 0;
 
 	Assert(XactIsoLevel == XACT_REPEATABLE_READ);
 	Assert(builder->building_full_snapshot);
@@ -485,7 +482,35 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 
 	MyProc->xmin = snap->xmin;
 
-	/* allocate in transaction context */
+	/* Convert the historic snapshot to MVCC snapshot. */
+	return SnapBuildMVCCFromHistoric(snap, true);
+}
+
+/*
+ * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
+ *
+ * Unlike a regular (non-historic) MVCC snapshot, the 'xip' array of this
+ * snapshot contains not only running main transactions, but also their
+ * subtransactions. On the other hand, 'subxip' will usually be empty. This
+ * difference does not affect the result of XidInMVCCSnapshot() because it
+ * searches both in 'xip' and 'subxip'.
+ *
+ * Pass true for 'in_place' if you don't care about modifying the source
+ * snapshot. If you need a new instance, and one that was allocated as a
+ * single chunk of memory, pass false.
+ */
+Snapshot
+SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place)
+{
+	TransactionId xid;
+	TransactionId *oldxip = snapshot->xip;
+	uint32		oldxcnt = snapshot->xcnt;
+	TransactionId *newxip;
+	int			newxcnt = 0;
+	Snapshot	result;
+
+	Assert(snapshot->snapshot_type == SNAPSHOT_HISTORIC_MVCC);
+
 	newxip = palloc_array(TransactionId, GetMaxSnapshotXidCount());
 
 	/*
@@ -494,7 +519,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	 * classical snapshot by marking all non-committed transactions as
 	 * in-progress. This can be expensive.
 	 */
-	for (xid = snap->xmin; NormalTransactionIdPrecedes(xid, snap->xmax);)
+	for (xid = snapshot->xmin; NormalTransactionIdPrecedes(xid, snapshot->xmax);)
 	{
 		void	   *test;
 
@@ -502,7 +527,7 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 		 * Check whether transaction committed using the decoding snapshot
 		 * meaning of ->xip.
 		 */
-		test = bsearch(&xid, snap->xip, snap->xcnt,
+		test = bsearch(&xid, snapshot->xip, snapshot->xcnt,
 					   sizeof(TransactionId), xidComparator);
 
 		if (test == NULL)
@@ -519,11 +544,25 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	}
 
 	/* adjust remaining snapshot fields as needed */
-	snap->snapshot_type = SNAPSHOT_MVCC;
-	snap->xcnt = newxcnt;
-	snap->xip = newxip;
+	snapshot->xcnt = newxcnt;
+	snapshot->xip = newxip;
 
-	return snap;
+	if (in_place)
+		result = snapshot;
+	else
+	{
+		result = CopySnapshot(snapshot);
+
+		/* Restore the original values so the source is intact. */
+		snapshot->xip = oldxip;
+		snapshot->xcnt = oldxcnt;
+
+		/* newxip has been copied */
+		pfree(newxip);
+	}
+	result->snapshot_type = SNAPSHOT_MVCC;
+
+	return result;
 }
 
 /*
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2e6197f5f35..3af1b366adf 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -213,7 +213,6 @@ typedef struct ExportedSnapshot
 static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
-static Snapshot CopySnapshot(Snapshot snapshot);
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
 static void FreeSnapshot(Snapshot snapshot);
 static void SnapshotResetXmin(void);
@@ -604,7 +603,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
  * The copy is palloc'd in TopTransactionContext and has initial refcounts set
  * to 0.  The returned snapshot has the copied flag set.
  */
-static Snapshot
+Snapshot
 CopySnapshot(Snapshot snapshot)
 {
 	Snapshot	newsnap;
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index ccded021433..34383dea776 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
 
 extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
 extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
 extern void SnapBuildClearExportedSnapshot(void);
 extern void SnapBuildResetExportedSnapshotState(void);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b8c01a291a1..de824945f0b 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -63,6 +63,7 @@ extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
 
+extern Snapshot CopySnapshot(Snapshot snapshot);
 extern Snapshot GetCatalogSnapshot(Oid relid);
 extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
 extern void InvalidateCatalogSnapshot(void);
-- 
2.47.3



  [text/x-diff] v33-0004-Add-CONCURRENTLY-option-to-REPACK-command.patch (146.0K, 5-v33-0004-Add-CONCURRENTLY-option-to-REPACK-command.patch)
  download | inline diff:
From 7a0d30d4468d5806fb2e6a7093689b1a550f9147 Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Tue, 27 Jan 2026 11:48:40 +0100
Subject: [PATCH v33 4/5] Add CONCURRENTLY option to REPACK command.

The REPACK command copies the relation data into a new file, creates new
indexes and eventually swaps the files. To make sure that the old file does
not change during the copying, the relation is locked in an exclusive mode,
which prevents applications from both reading and writing. (To keep the data
consistent, we'd only need to prevent the applications from writing, but even
reading needs to be blocked before we can swap the files - otherwise some
applications could continue using the old file. Since we should not request a
stronger lock without releasing the weaker one first, we acquire the exclusive
lock in the beginning and keep it till the end of the processing.)

This patch introduces an alternative workflow, which only requires the
exclusive lock when the relation (and index) files are being swapped.
(Supposedly, the swapping should be pretty fast.) On the other hand, when we
copy the data to the new file, we allow applications to read from the relation
and even to write to it.

First, we scan the relation using a "historic snapshot", and insert all the
tuples satisfying this snapshot into the new file.

Second, logical decoding is used to capture the data changes done by
applications during the copying (i.e. changes that do not satisfy the historic
snapshot mentioned above), and those are applied to the new file before we
acquire the exclusive lock that we need to swap the files. (Of course, more
data changes can take place while we are waiting for the lock - these will be
applied to the new file after we have acquired the lock, before we swap the
files.)

Since the logical decoding system, during its startup, waits until all the
transactions which already have XID assigned have finished, there is a risk of
deadlock if a transaction that already changed anything in the database tries
to acquire a conflicting lock on the table REPACK CONCURRENTLY is working
on. As an example, consider transaction running CREATE INDEX command on the
table that is being REPACKed CONCURRENTLY. On the other hand, DML commands
(INSERT, UPDATE, DELETE) are not a problem as their lock does not conflict
with REPACK CONCURRENTLY.

The current approach is that we accept the risk. If we tried to avoid it, it'd
be necessary to unlock the table before the logical decoding is setup and lock
it again afterwards. Such temporary unlocking would imply re-checking if the
table still meets all the requirements for REPACK CONCURRENTLY.

Like the existing implementation of REPACK, the variant with the CONCURRENTLY
option also requires an extra space for the new relation and index files
(which coexist with the old files for some time). In addition, the
CONCURRENTLY option might introduce a lag in releasing WAL segments for
archiving / recycling. This is due to the decoding of the data changes done by
applications concurrently. When copying the table contents into the new file,
we check the lag periodically. If it exceeds the size of a WAL segment, we
decode all the available WAL before resuming the copying. (Of course, the
changes are not applied until the whole table contents is copied.) A
background worker might be a better approach for the decoding - let's consider
implementing it in the future.

The WAL records produced by running DML commands on the new relation do not
contain enough information to be processed by the logical decoding system. All
we need from the new relation is the file (relfilenode), while the actual
relation is eventually dropped. Thus there is no point in replaying the DMLs
anywhere.
---
 doc/src/sgml/monitoring.sgml                  |   37 +-
 doc/src/sgml/mvcc.sgml                        |   12 +-
 doc/src/sgml/ref/repack.sgml                  |  129 +-
 src/Makefile                                  |    1 +
 src/backend/access/heap/heapam.c              |   34 +-
 src/backend/access/heap/heapam_handler.c      |  259 ++-
 src/backend/access/heap/rewriteheap.c         |    6 +-
 src/backend/catalog/system_views.sql          |   19 +-
 src/backend/commands/cluster.c                | 1641 +++++++++++++++--
 src/backend/commands/matview.c                |    2 +-
 src/backend/commands/tablecmds.c              |    1 +
 src/backend/commands/vacuum.c                 |   12 +-
 src/backend/meson.build                       |    1 +
 src/backend/replication/logical/decode.c      |   37 +-
 src/backend/replication/logical/snapbuild.c   |   21 +
 .../replication/pgoutput_repack/Makefile      |   32 +
 .../replication/pgoutput_repack/meson.build   |   18 +
 .../pgoutput_repack/pgoutput_repack.c         |  239 +++
 .../storage/lmgr/generate-lwlocknames.pl      |    2 +-
 src/backend/utils/time/snapmgr.c              |    3 +-
 src/bin/psql/tab-complete.in.c                |    4 +-
 src/include/access/heapam.h                   |    5 +-
 src/include/access/heapam_xlog.h              |    2 +
 src/include/access/tableam.h                  |   10 +
 src/include/commands/cluster.h                |   88 +-
 src/include/commands/progress.h               |   17 +-
 src/include/replication/snapbuild.h           |    1 +
 src/include/storage/lockdefs.h                |    4 +-
 src/include/utils/snapmgr.h                   |    2 +
 src/test/modules/injection_points/Makefile    |    2 +
 .../injection_points/expected/repack.out      |  113 ++
 .../expected/repack_toast.out                 |   64 +
 src/test/modules/injection_points/meson.build |    2 +
 .../injection_points/specs/repack.spec        |  142 ++
 .../injection_points/specs/repack_toast.spec  |  105 ++
 src/test/regress/expected/rules.out           |   19 +-
 src/tools/pgindent/typedefs.list              |    5 +
 37 files changed, 2829 insertions(+), 262 deletions(-)
 create mode 100644 src/backend/replication/pgoutput_repack/Makefile
 create mode 100644 src/backend/replication/pgoutput_repack/meson.build
 create mode 100644 src/backend/replication/pgoutput_repack/pgoutput_repack.c
 create mode 100644 src/test/modules/injection_points/expected/repack.out
 create mode 100644 src/test/modules/injection_points/expected/repack_toast.out
 create mode 100644 src/test/modules/injection_points/specs/repack.spec
 create mode 100644 src/test/modules/injection_points/specs/repack_toast.spec

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 71c92ed53ef..ec97197461a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6239,14 +6239,35 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>heap_tuples_written</structfield> <type>bigint</type>
+       <structfield>heap_tuples_inserted</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of heap tuples written.
+       Number of heap tuples inserted.
        This counter only advances when the phase is
        <literal>seq scanning heap</literal>,
-       <literal>index scanning heap</literal>
-       or <literal>writing new heap</literal>.
+       <literal>index scanning heap</literal>,
+       <literal>writing new heap</literal>
+       or <literal>catch-up</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_updated</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples updated.
+       This counter only advances when the phase is <literal>catch-up</literal>.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>heap_tuples_deleted</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of heap tuples deleted.
+       This counter only advances when the phase is <literal>catch-up</literal>.
       </para></entry>
      </row>
 
@@ -6327,6 +6348,14 @@ FROM pg_stat_get_backend_idset() AS backendid;
        <command>REPACK</command> is currently writing the new heap.
      </entry>
     </row>
+    <row>
+     <entry><literal>catch-up</literal></entry>
+     <entry>
+       <command>REPACK CONCURRENTLY</command> is currently processing the DML
+       commands that other transactions executed during any of the preceding
+       phases.
+     </entry>
+    </row>
     <row>
      <entry><literal>swapping relation files</literal></entry>
      <entry>
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 049ee75a4ba..0f5c34af542 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -1833,15 +1833,17 @@ SELECT pg_advisory_lock(q.id) FROM
    <title>Caveats</title>
 
    <para>
-    Some DDL commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link> and the
-    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER TABLE</command></link>, are not
+    Some commands, currently only <link linkend="sql-truncate"><command>TRUNCATE</command></link>, the
+    table-rewriting forms of <link linkend="sql-altertable"><command>ALTER
+    TABLE</command></link> and <command>REPACK</command> with
+    the <literal>CONCURRENTLY</literal> option, are not
     MVCC-safe.  This means that after the truncation or rewrite commits, the
     table will appear empty to concurrent transactions, if they are using a
-    snapshot taken before the DDL command committed.  This will only be an
+    snapshot taken before the command committed.  This will only be an
     issue for a transaction that did not access the table in question
-    before the DDL command started &mdash; any transaction that has done so
+    before the command started &mdash; any transaction that has done so
     would hold at least an <literal>ACCESS SHARE</literal> table lock,
-    which would block the DDL command until that transaction completes.
+    which would block the truncating or rewriting command until that transaction completes.
     So these commands will not cause any apparent inconsistency in the
     table contents for successive queries on the target table, but they
     could cause visible inconsistency between the contents of the target
diff --git a/doc/src/sgml/ref/repack.sgml b/doc/src/sgml/ref/repack.sgml
index 61d5c2cdef1..30c43c49069 100644
--- a/doc/src/sgml/ref/repack.sgml
+++ b/doc/src/sgml/ref/repack.sgml
@@ -28,6 +28,7 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING
 
     VERBOSE [ <replaceable class="parameter">boolean</replaceable> ]
     ANALYZE [ <replaceable class="parameter">boolean</replaceable> ]
+    CONCURRENTLY [ <replaceable class="parameter">boolean</replaceable> ]
 
 <phrase>and <replaceable class="parameter">table_and_columns</replaceable> is:</phrase>
 
@@ -54,7 +55,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING
    processes every table and materialized view in the current database that
    the current user has the <literal>MAINTAIN</literal> privilege on. This
    form of <command>REPACK</command> cannot be executed inside a transaction
-   block.
+   block.  Also, this form is not allowed if
+   the <literal>CONCURRENTLY</literal> option is used.
   </para>
 
   <para>
@@ -67,7 +69,8 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING
    When a table is being repacked, an <literal>ACCESS EXCLUSIVE</literal> lock
    is acquired on it. This prevents any other database operations (both reads
    and writes) from operating on the table until the <command>REPACK</command>
-   is finished.
+   is finished. If you want to keep the table accessible during the repacking,
+   consider using the <literal>CONCURRENTLY</literal> option.
   </para>
 
   <refsect2 id="sql-repack-notes-on-clustering" xreflabel="Notes on Clustering">
@@ -195,6 +198,128 @@ REPACK [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] USING
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>CONCURRENTLY</literal></term>
+    <listitem>
+     <para>
+      Allow other transactions to use the table while it is being repacked.
+     </para>
+
+     <para>
+      Internally, <command>REPACK</command> copies the contents of the table
+      (ignoring dead tuples) into a new file, sorted by the specified index,
+      and also creates a new file for each index. Then it swaps the old and
+      new files for the table and all the indexes, and deletes the old
+      files. The <literal>ACCESS EXCLUSIVE</literal> lock is needed to make
+      sure that the old files do not change during the processing because the
+      changes would get lost due to the swap.
+     </para>
+
+     <para>
+      With the <literal>CONCURRENTLY</literal> option, the <literal>ACCESS
+      EXCLUSIVE</literal> lock is only acquired to swap the table and index
+      files. The data changes that took place during the creation of the new
+      table and index files are captured using logical decoding
+      (<xref linkend="logicaldecoding"/>) and applied before
+      the <literal>ACCESS EXCLUSIVE</literal> lock is requested. Thus the lock
+      is typically held only for the time needed to swap the files, which
+      should be pretty short. However, the time might still be noticeable if
+      too many data changes have been done to the table while
+      <command>REPACK</command> was waiting for the lock: those changes must
+      be processed just before the files are swapped, while the
+      <literal>ACCESS EXCLUSIVE</literal> lock is being held.
+     </para>
+
+     <para>
+      Note that <command>REPACK</command> with the
+      <literal>CONCURRENTLY</literal> option does not try to order the rows
+      inserted into the table after the repacking started. Also
+      note <command>REPACK</command> might fail to complete due to DDL
+      commands executed on the table by other transactions during the
+      repacking.
+     </para>
+
+     <note>
+      <para>
+       In addition to the temporary space requirements explained in
+       <xref linkend="sql-repack-notes-on-resources"/>,
+       the <literal>CONCURRENTLY</literal> option can add to the usage of
+       temporary space a bit more. The reason is that other transactions can
+       perform DML operations which cannot be applied to the new file until
+       <command>REPACK</command> has copied all the tuples from the old
+       file. Thus the tuples inserted into the old file during the copying are
+       also stored separately in a temporary file, so they can eventually be
+       applied to the new file.
+      </para>
+
+      <para>
+       Furthermore, the data changes performed during the copying are
+       extracted from <link linkend="wal">write-ahead log</link> (WAL), and
+       this extraction (decoding) only takes place when certain amount of WAL
+       has been written. Therefore, WAL removal can be delayed by this
+       threshold. Currently the threshold is equal to the value of
+       the <link linkend="guc-wal-segment-size"><varname>wal_segment_size</varname></link>
+       configuration parameter.
+      </para>
+     </note>
+
+     <para>
+      The <literal>CONCURRENTLY</literal> option cannot be used in the
+      following cases:
+
+      <itemizedlist>
+       <listitem>
+        <para>
+          The table is <literal>UNLOGGED</literal>.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+          The table is partitioned.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+          The table is a system catalog or a <acronym>TOAST</acronym> table.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+         <command>REPACK</command> is executed inside a transaction block.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+          The <link linkend="guc-wal-level"><varname>wal_level</varname></link>
+          configuration parameter is less than <literal>logical</literal>.
+        </para>
+       </listitem>
+
+       <listitem>
+        <para>
+         The <link linkend="guc-max-replication-slots"><varname>max_replication_slots</varname></link>
+         configuration parameter does not allow for creation of an additional
+         replication slot.
+        </para>
+       </listitem>
+      </itemizedlist>
+     </para>
+
+     <warning>
+      <para>
+       <command>REPACK</command> with the <literal>CONCURRENTLY</literal>
+       option is not MVCC-safe, see <xref linkend="mvcc-caveats"/> for
+       details.
+      </para>
+     </warning>
+
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>VERBOSE</literal></term>
     <listitem>
diff --git a/src/Makefile b/src/Makefile
index 2f31a2f20a7..b18c9a14ffa 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -23,6 +23,7 @@ SUBDIRS = \
 	interfaces \
 	backend/replication/libpqwalreceiver \
 	backend/replication/pgoutput \
+	backend/replication/pgoutput_repack \
 	fe_utils \
 	bin \
 	pl \
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 3004964ab7f..a89a456135d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,7 +60,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
 								  Buffer newbuf, HeapTuple oldtup,
 								  HeapTuple newtup, HeapTuple old_key_tuple,
-								  bool all_visible_cleared, bool new_all_visible_cleared);
+								  bool all_visible_cleared, bool new_all_visible_cleared,
+								  bool walLogical);
 #ifdef USE_ASSERT_CHECKING
 static void check_lock_if_inplace_updateable_rel(Relation relation,
 												 const ItemPointerData *otid,
@@ -2841,7 +2842,7 @@ xmax_infomask_changed(uint16 new_infomask, uint16 old_infomask)
 TM_Result
 heap_delete(Relation relation, const ItemPointerData *tid,
 			CommandId cid, Snapshot crosscheck, bool wait,
-			TM_FailureData *tmfd, bool changingPart)
+			TM_FailureData *tmfd, bool changingPart, bool walLogical)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -3088,7 +3089,8 @@ l1:
 	 * Compute replica identity tuple before entering the critical section so
 	 * we don't PANIC upon a memory allocation failure.
 	 */
-	old_key_tuple = ExtractReplicaIdentity(relation, &tp, true, &old_key_copied);
+	old_key_tuple = walLogical ?
+		ExtractReplicaIdentity(relation, &tp, true, &old_key_copied) : NULL;
 
 	/*
 	 * If this is the first possibly-multixact-able operation in the current
@@ -3178,6 +3180,15 @@ l1:
 				xlrec.flags |= XLH_DELETE_CONTAINS_OLD_KEY;
 		}
 
+		/*
+		 * Unlike UPDATE, DELETE is decoded even if there is no old key, so it
+		 * does not help to clear both XLH_DELETE_CONTAINS_OLD_TUPLE and
+		 * XLH_DELETE_CONTAINS_OLD_KEY. Thus we need an extra flag. TODO
+		 * Consider not decoding tuples w/o the old tuple/key instead.
+		 */
+		if (!walLogical)
+			xlrec.flags |= XLH_DELETE_NO_LOGICAL;
+
 		XLogBeginInsert();
 		XLogRegisterData(&xlrec, SizeOfHeapDelete);
 
@@ -3270,7 +3281,8 @@ simple_heap_delete(Relation relation, const ItemPointerData *tid)
 	result = heap_delete(relation, tid,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, false /* changingPart */ );
+						 &tmfd, false,	/* changingPart */
+						 true /* walLogical */ );
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -3311,7 +3323,7 @@ TM_Result
 heap_update(Relation relation, const ItemPointerData *otid, HeapTuple newtup,
 			CommandId cid, Snapshot crosscheck, bool wait,
 			TM_FailureData *tmfd, LockTupleMode *lockmode,
-			TU_UpdateIndexes *update_indexes)
+			TU_UpdateIndexes *update_indexes, bool walLogical)
 {
 	TM_Result	result;
 	TransactionId xid = GetCurrentTransactionId();
@@ -4204,7 +4216,8 @@ l2:
 								 newbuf, &oldtup, heaptup,
 								 old_key_tuple,
 								 all_visible_cleared,
-								 all_visible_cleared_new);
+								 all_visible_cleared_new,
+								 walLogical);
 		if (newbuf != buffer)
 		{
 			PageSetLSN(BufferGetPage(newbuf), recptr);
@@ -4562,7 +4575,8 @@ simple_heap_update(Relation relation, const ItemPointerData *otid, HeapTuple tup
 	result = heap_update(relation, otid, tup,
 						 GetCurrentCommandId(true), InvalidSnapshot,
 						 true /* wait for commit */ ,
-						 &tmfd, &lockmode, update_indexes);
+						 &tmfd, &lockmode, update_indexes,
+						 true /* walLogical */ );
 	switch (result)
 	{
 		case TM_SelfModified:
@@ -8918,7 +8932,8 @@ static XLogRecPtr
 log_heap_update(Relation reln, Buffer oldbuf,
 				Buffer newbuf, HeapTuple oldtup, HeapTuple newtup,
 				HeapTuple old_key_tuple,
-				bool all_visible_cleared, bool new_all_visible_cleared)
+				bool all_visible_cleared, bool new_all_visible_cleared,
+				bool walLogical)
 {
 	xl_heap_update xlrec;
 	xl_heap_header xlhdr;
@@ -8929,7 +8944,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 				suffixlen = 0;
 	XLogRecPtr	recptr;
 	Page		page = BufferGetPage(newbuf);
-	bool		need_tuple_data = RelationIsLogicallyLogged(reln);
+	bool		need_tuple_data = RelationIsLogicallyLogged(reln) &&
+		walLogical;
 	bool		init;
 	int			bufflags;
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7d4b48e5a97..908f1ef66c6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
 #include "catalog/index.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "commands/cluster.h"
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -309,7 +310,8 @@ heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
 	 * the storage itself is cleaning the dead tuples by itself, it is the
 	 * time to call the index tuple deletion also.
 	 */
-	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
+	return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart,
+					   true);
 }
 
 
@@ -328,7 +330,7 @@ heapam_tuple_update(Relation relation, ItemPointer otid, TupleTableSlot *slot,
 	tuple->t_tableOid = slot->tts_tableOid;
 
 	result = heap_update(relation, otid, tuple, cid, crosscheck, wait,
-						 tmfd, lockmode, update_indexes);
+						 tmfd, lockmode, update_indexes, true);
 	ItemPointerCopy(&tuple->t_self, &slot->tts_tid);
 
 	/*
@@ -685,13 +687,15 @@ static void
 heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 Relation OldIndex, bool use_sort,
 								 TransactionId OldestXmin,
+								 Snapshot snapshot,
+								 LogicalDecodingContext *decoding_ctx,
 								 TransactionId *xid_cutoff,
 								 MultiXactId *multi_cutoff,
 								 double *num_tuples,
 								 double *tups_vacuumed,
 								 double *tups_recently_dead)
 {
-	RewriteState rwstate;
+	RewriteState rwstate = NULL;
 	IndexScanDesc indexScan;
 	TableScanDesc tableScan;
 	HeapScanDesc heapScan;
@@ -705,6 +709,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	bool	   *isnull;
 	BufferHeapTupleTableSlot *hslot;
 	BlockNumber prev_cblock = InvalidBlockNumber;
+	bool		concurrent = snapshot != NULL;
+	XLogRecPtr	end_of_wal_prev = GetFlushRecPtr(NULL);
 
 	/* Remember if it's a system catalog */
 	is_system_catalog = IsSystemRelation(OldHeap);
@@ -720,9 +726,12 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	values = palloc_array(Datum, natts);
 	isnull = palloc_array(bool, natts);
 
-	/* Initialize the rewrite operation */
-	rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-								 *multi_cutoff);
+	/*
+	 * Initialize the rewrite operation.
+	 */
+	if (!concurrent)
+		rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin,
+									 *xid_cutoff, *multi_cutoff);
 
 
 	/* Set up sorting if wanted */
@@ -737,6 +746,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	 * Prepare to scan the OldHeap.  To ensure we see recently-dead tuples
 	 * that still need to be copied, we scan with SnapshotAny and use
 	 * HeapTupleSatisfiesVacuum for the visibility test.
+	 *
+	 * In the CONCURRENTLY case, we do regular MVCC visibility tests, using
+	 * the snapshot passed by the caller.
 	 */
 	if (OldIndex != NULL && !use_sort)
 	{
@@ -753,7 +765,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex,
+									snapshot ? snapshot : SnapshotAny,
+									NULL, 0, 0);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
@@ -762,7 +776,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
 									 PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP);
 
-		tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+		tableScan = table_beginscan(OldHeap,
+									snapshot ? snapshot : SnapshotAny,
+									0, (ScanKey) NULL);
 		heapScan = (HeapScanDesc) tableScan;
 		indexScan = NULL;
 
@@ -838,83 +854,90 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		buf = hslot->buffer;
 
 		/*
-		 * To be able to guarantee that we can set the hint bit, acquire an
-		 * exclusive lock on the old buffer. We need the hint bits, set in
-		 * heapam_relation_copy_for_cluster() -> HeapTupleSatisfiesVacuum(),
-		 * to be set, as otherwise reform_and_rewrite_tuple() ->
-		 * rewrite_heap_tuple() will get confused. Specifically,
-		 * rewrite_heap_tuple() checks for HEAP_XMAX_INVALID in the old tuple
-		 * to determine whether to check the old-to-new mapping hash table.
-		 *
-		 * It'd be better if we somehow could avoid setting hint bits on the
-		 * old page. One reason to use VACUUM FULL are very bloated tables -
-		 * rewriting most of the old table during VACUUM FULL doesn't exactly
-		 * help...
+		 * Regarding CONCURRENTLY, see the comments on MVCC snapshot above.
 		 */
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-
-		switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
+		if (!concurrent)
 		{
-			case HEAPTUPLE_DEAD:
-				/* Definitely dead */
-				isdead = true;
-				break;
-			case HEAPTUPLE_RECENTLY_DEAD:
-				*tups_recently_dead += 1;
-				/* fall through */
-			case HEAPTUPLE_LIVE:
-				/* Live or recently dead, must copy it */
-				isdead = false;
-				break;
-			case HEAPTUPLE_INSERT_IN_PROGRESS:
+			/*
+			 * To be able to guarantee that we can set the hint bit, acquire an
+			 * exclusive lock on the old buffer. We need the hint bits, set in
+			 * heapam_relation_copy_for_cluster() -> HeapTupleSatisfiesVacuum(),
+			 * to be set, as otherwise reform_and_rewrite_tuple() ->
+			 * rewrite_heap_tuple() will get confused. Specifically,
+			 * rewrite_heap_tuple() checks for HEAP_XMAX_INVALID in the old tuple
+			 * to determine whether to check the old-to-new mapping hash table.
+			 *
+			 * It'd be better if we somehow could avoid setting hint bits on the
+			 * old page. One reason to use VACUUM FULL are very bloated tables -
+			 * rewriting most of the old table during VACUUM FULL doesn't exactly
+			 * help...
+			 */
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
-				/*
-				 * Since we hold exclusive lock on the relation, normally the
-				 * only way to see this is if it was inserted earlier in our
-				 * own transaction.  However, it can happen in system
-				 * catalogs, since we tend to release write lock before commit
-				 * there.  Give a warning if neither case applies; but in any
-				 * case we had better copy it.
-				 */
-				if (!is_system_catalog &&
-					!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
-					elog(WARNING, "concurrent insert in progress within table \"%s\"",
-						 RelationGetRelationName(OldHeap));
-				/* treat as live */
-				isdead = false;
-				break;
-			case HEAPTUPLE_DELETE_IN_PROGRESS:
-
-				/*
-				 * Similar situation to INSERT_IN_PROGRESS case.
-				 */
-				if (!is_system_catalog &&
-					!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
-					elog(WARNING, "concurrent delete in progress within table \"%s\"",
-						 RelationGetRelationName(OldHeap));
-				/* treat as recently dead */
-				*tups_recently_dead += 1;
-				isdead = false;
-				break;
-			default:
-				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-				isdead = false; /* keep compiler quiet */
-				break;
-		}
-
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
-		if (isdead)
-		{
-			*tups_vacuumed += 1;
-			/* heap rewrite module still needs to see it... */
-			if (rewrite_heap_dead_tuple(rwstate, tuple))
+			switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
 			{
-				/* A previous recently-dead tuple is now known dead */
-				*tups_vacuumed += 1;
-				*tups_recently_dead -= 1;
+				case HEAPTUPLE_DEAD:
+					/* Definitely dead */
+					isdead = true;
+					break;
+				case HEAPTUPLE_RECENTLY_DEAD:
+					*tups_recently_dead += 1;
+					/* fall through */
+				case HEAPTUPLE_LIVE:
+					/* Live or recently dead, must copy it */
+					isdead = false;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+					/*
+					 * As long as we hold exclusive lock on the relation,
+					 * normally the only way to see this is if it was inserted
+					 * earlier in our own transaction.  However, it can happen
+					 * in system catalogs, since we tend to release write lock
+					 * before commit there. Give a warning if neither case
+					 * applies; but in any case we had better copy it.
+					 */
+					if (!is_system_catalog &&
+						!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
+						elog(WARNING, "concurrent insert in progress within table \"%s\"",
+							 RelationGetRelationName(OldHeap));
+					/* treat as live */
+					isdead = false;
+					break;
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+
+					/*
+					 * Similar situation to INSERT_IN_PROGRESS case.
+					 */
+					if (!is_system_catalog &&
+						!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple->t_data)))
+						elog(WARNING, "concurrent delete in progress within table \"%s\"",
+							 RelationGetRelationName(OldHeap));
+					/* treat as recently dead */
+					*tups_recently_dead += 1;
+					isdead = false;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					isdead = false; /* keep compiler quiet */
+					break;
+			}
+
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+			if (isdead)
+			{
+				*tups_vacuumed += 1;
+				/* heap rewrite module still needs to see it... */
+				if (rewrite_heap_dead_tuple(rwstate, tuple))
+				{
+					/* A previous recently-dead tuple is now known dead */
+					*tups_vacuumed += 1;
+					*tups_recently_dead -= 1;
+				}
+
+				continue;
 			}
-			continue;
 		}
 
 		*num_tuples += 1;
@@ -933,7 +956,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		{
 			const int	ct_index[] = {
 				PROGRESS_REPACK_HEAP_TUPLES_SCANNED,
-				PROGRESS_REPACK_HEAP_TUPLES_WRITTEN
+				PROGRESS_REPACK_HEAP_TUPLES_INSERTED
 			};
 			int64		ct_val[2];
 
@@ -948,6 +971,31 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			ct_val[1] = *num_tuples;
 			pgstat_progress_update_multi_param(2, ct_index, ct_val);
 		}
+
+		/*
+		 * Process the WAL produced by the load, as well as by other
+		 * transactions, so that the replication slot can advance and WAL does
+		 * not pile up. Use wal_segment_size as a threshold so that we do not
+		 * introduce the decoding overhead too often.
+		 *
+		 * Of course, we must not apply the changes until the initial load has
+		 * completed.
+		 *
+		 * Note that our insertions into the new table should not be decoded
+		 * as we (intentionally) do not write the logical decoding specific
+		 * information to WAL.
+		 */
+		if (concurrent)
+		{
+			XLogRecPtr	end_of_wal;
+
+			end_of_wal = GetFlushRecPtr(NULL);
+			if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+			{
+				repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+				end_of_wal_prev = end_of_wal;
+			}
+		}
 	}
 
 	if (indexScan != NULL)
@@ -991,15 +1039,32 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 									 values, isnull,
 									 rwstate);
 			/* Report n_tuples */
-			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_WRITTEN,
+			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
 										 n_tuples);
+
+			/*
+			 * Try to keep the amount of not-yet-decoded WAL small, like
+			 * above.
+			 */
+			if (concurrent)
+			{
+				XLogRecPtr	end_of_wal;
+
+				end_of_wal = GetFlushRecPtr(NULL);
+				if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
+				{
+					repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+					end_of_wal_prev = end_of_wal;
+				}
+			}
 		}
 
 		tuplesort_end(tuplesort);
 	}
 
 	/* Write out any remaining tuples, and fsync if needed */
-	end_heap_rewrite(rwstate);
+	if (rwstate)
+		end_heap_rewrite(rwstate);
 
 	/* Clean up */
 	pfree(values);
@@ -2390,6 +2455,10 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
  * SET WITHOUT OIDS.
  *
  * So, we must reconstruct the tuple from component Datums.
+ *
+ * If rwstate=NULL, use simple_heap_insert() instead of rewriting - in that
+ * case we still need to deform/form the tuple. TODO Shouldn't we rename the
+ * function, as might not do any rewrite?
  */
 static void
 reform_and_rewrite_tuple(HeapTuple tuple,
@@ -2412,8 +2481,28 @@ reform_and_rewrite_tuple(HeapTuple tuple,
 
 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
 
-	/* The heap rewrite module does the rest */
-	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+	if (rwstate)
+		/* The heap rewrite module does the rest */
+		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+	else
+	{
+		/*
+		 * Insert tuple when processing REPACK CONCURRENTLY.
+		 *
+		 * rewriteheap.c is not used in the CONCURRENTLY case because it'd be
+		 * difficult to do the same in the catch-up phase (as the logical
+		 * decoding does not provide us with sufficient visibility
+		 * information). Thus we must use heap_insert() both during the
+		 * catch-up and here.
+		 *
+		 * The following is like simple_heap_insert() except that we pass the
+		 * flag to skip logical decoding: as soon as REPACK CONCURRENTLY swaps
+		 * the relation files, it drops this relation, so no logical
+		 * replication subscription should need the data.
+		 */
+		heap_insert(NewHeap, copiedTuple, GetCurrentCommandId(true),
+					HEAP_INSERT_NO_LOGICAL, NULL);
+	}
 
 	heap_freetuple(copiedTuple);
 }
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 77fd48eb59e..96be7684660 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -620,9 +620,9 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 		int			options = HEAP_INSERT_SKIP_FSM;
 
 		/*
-		 * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
-		 * for the TOAST table are not logically decoded.  The main heap is
-		 * WAL-logged as XLOG FPI records, which are not logically decoded.
+		 * While rewriting the heap for REPACK, make sure data for the TOAST
+		 * table are not logically decoded.  The main heap is WAL-logged as
+		 * XLOG FPI records, which are not logically decoded.
 		 */
 		options |= HEAP_INSERT_NO_LOGICAL;
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3f05ba3083a..d79eab5670c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1298,16 +1298,19 @@ CREATE VIEW pg_stat_progress_repack AS
                       WHEN 2 THEN 'index scanning heap'
                       WHEN 3 THEN 'sorting tuples'
                       WHEN 4 THEN 'writing new heap'
-                      WHEN 5 THEN 'swapping relation files'
-                      WHEN 6 THEN 'rebuilding index'
-                      WHEN 7 THEN 'performing final cleanup'
+                      WHEN 5 THEN 'catch-up'
+                      WHEN 6 THEN 'swapping relation files'
+                      WHEN 7 THEN 'rebuilding index'
+                      WHEN 8 THEN 'performing final cleanup'
                       END AS phase,
         CAST(S.param3 AS oid) AS repack_index_relid,
         S.param4 AS heap_tuples_scanned,
-        S.param5 AS heap_tuples_written,
-        S.param6 AS heap_blks_total,
-        S.param7 AS heap_blks_scanned,
-        S.param8 AS index_rebuild_count
+        S.param5 AS heap_tuples_inserted,
+        S.param6 AS heap_tuples_updated,
+        S.param7 AS heap_tuples_deleted,
+        S.param8 AS heap_blks_total,
+        S.param9 AS heap_blks_scanned,
+        S.param10 AS index_rebuild_count
     FROM pg_stat_get_progress_info('REPACK') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
@@ -1325,7 +1328,7 @@ CREATE VIEW pg_stat_progress_cluster AS
         phase,
         repack_index_relid AS cluster_index_relid,
         heap_tuples_scanned,
-        heap_tuples_written,
+        heap_tuples_inserted + heap_tuples_updated AS heap_tuples_written,
         heap_blks_total,
         heap_blks_scanned,
         index_rebuild_count
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index e19675a6d05..03ccf10b782 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1,8 +1,23 @@
 /*-------------------------------------------------------------------------
  *
  * cluster.c
- *	  CLUSTER a table on an index.  This is now also used for VACUUM FULL and
- *	  REPACK.
+ *		Implementation of REPACK [CONCURRENTLY], also known as CLUSTER and
+ *		VACUUM FULL.
+ *
+ * There are two somewhat different ways to rewrite a table.  In non-
+ * concurrent mode, it's easy: take AccessExclusiveLock, create a new
+ * transient relation, copy the tuples over to the relfilenode of the new
+ * relation, swap the relfilenodes, then drop the old relation.
+ *
+ * In concurrent mode, we lock the table with only ShareUpdateExclusiveLock,
+ * then do an initial copy as above.  However, while the tuples are being
+ * copied, concurrent transactions could modify the table. To cope with those
+ * changes, we rely on logical decoding to obtain them from WAL.  The changes
+ * are accumulated in a tuplestore.  Once the initial copy is complete, we
+ * read the changes from the tuplestore and re-apply them on the new heap.
+ * Then we upgrade our ShareUpdateExclusiveLock to AccessExclusiveLock and
+ * swap the relfilenodes.  This way, the time we hold a strong lock on the
+ * table is much reduced, and the bloat is eliminated.
  *
  * There is hardly anything left of Paul Brown's original implementation...
  *
@@ -26,6 +41,10 @@
 #include "access/toast_internals.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
@@ -33,6 +52,7 @@
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_am.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
@@ -40,15 +60,21 @@
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
+#include "executor/executor.h"
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "pgstat.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
+#include "replication/snapbuild.h"
 #include "storage/bufmgr.h"
+#include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -68,12 +94,62 @@ typedef struct
 	Oid			indexOid;
 } RelToCluster;
 
+/*
+ * The following definitions are used for concurrent processing.
+ */
+
+/*
+ * The locators are used to avoid logical decoding of data that we do not need
+ * for our table.
+ */
+static RelFileLocator repacked_rel_locator = {.relNumber = InvalidOid};
+static RelFileLocator repacked_rel_toast_locator = {.relNumber = InvalidOid};
+
+/*
+ * Everything we need to call ExecInsertIndexTuples().
+ */
+typedef struct IndexInsertState
+{
+	ResultRelInfo *rri;
+	EState	   *estate;
+} IndexInsertState;
+
+/* The WAL segment being decoded. */
+static XLogSegNo repack_current_segment = 0;
+
+/*
+ * Information needed to apply concurrent data changes.
+ */
+typedef struct ChangeDest
+{
+	/* The relation the changes are applied to. */
+	Relation	rel;
+
+	/*
+	 * The following is needed to find the existing tuple if the change is
+	 * UPDATE or DELETE. 'ident_key' should have all the fields except for
+	 * 'sk_argument' initialized.
+	 */
+	Relation	ident_index;
+	ScanKey		ident_key;
+	int			ident_key_nentries;
+
+	/* Needed to update indexes of rel_dst. */
+	IndexInsertState *iistate;
+} ChangeDest;
+
 static bool cluster_rel_recheck(RepackCommand cmd, Relation OldHeap,
-								Oid indexOid, Oid userid, int options);
-static void rebuild_relation(Relation OldHeap, Relation index, bool verbose);
+								Oid indexOid, Oid userid, LOCKMODE lmode,
+								int options);
+static void check_repack_concurrently_requirements(Relation rel);
+static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
+							 bool concurrent);
 static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
-							bool verbose, bool *pSwapToastByContent,
-							TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
+							Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+							bool verbose,
+							bool *pSwapToastByContent,
+							TransactionId *pFreezeXid,
+							MultiXactId *pCutoffMulti);
 static List *get_tables_to_repack(RepackCommand cmd, bool usingindex,
 								  MemoryContext permcxt);
 static List *get_tables_to_repack_partitioned(RepackCommand cmd,
@@ -81,13 +157,50 @@ static List *get_tables_to_repack_partitioned(RepackCommand cmd,
 											  MemoryContext permcxt);
 static bool cluster_is_permitted_for_relation(RepackCommand cmd,
 											  Oid relid, Oid userid);
+
+static void begin_concurrent_repack(Relation rel);
+static void end_concurrent_repack(void);
+static LogicalDecodingContext *setup_logical_decoding(Oid relid);
+static HeapTuple get_changed_tuple(char *change);
+static void apply_concurrent_changes(RepackDecodingState *dstate,
+									 ChangeDest *dest);
+static void apply_concurrent_insert(Relation rel, HeapTuple tup,
+									IndexInsertState *iistate,
+									TupleTableSlot *index_slot);
+static void apply_concurrent_update(Relation rel, HeapTuple tup,
+									HeapTuple tup_target,
+									IndexInsertState *iistate,
+									TupleTableSlot *index_slot);
+static void apply_concurrent_delete(Relation rel, HeapTuple tup_target);
+static HeapTuple find_target_tuple(Relation rel, ChangeDest *dest,
+								   HeapTuple tup_key,
+								   TupleTableSlot *ident_slot);
+static void process_concurrent_changes(LogicalDecodingContext *decoding_ctx,
+									   XLogRecPtr end_of_wal,
+									   ChangeDest *dest);
+static IndexInsertState *get_index_insert_state(Relation relation,
+												Oid ident_index_id,
+												Relation *ident_index_p);
+static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
+								  int *nentries);
+static void free_index_insert_state(IndexInsertState *iistate);
+static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
+static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+											   LogicalDecodingContext *decoding_ctx,
+											   TransactionId frozenXid,
+											   MultiXactId cutoffMulti);
+static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
 static Relation process_single_relation(RepackStmt *stmt,
+										LOCKMODE lockmode,
+										bool isTopLevel,
 										ClusterParams *params);
 static Oid	determine_clustered_index(Relation rel, bool usingindex,
 									  const char *indexname);
 static const char *RepackCommandAsString(RepackCommand cmd);
 
 
+#define REPL_PLUGIN_NAME   "pgoutput_repack"
+
 /*
  * The repack code allows for processing multiple tables at once. Because
  * of this, we cannot just run everything on a single transaction, or we
@@ -117,6 +230,7 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 	ClusterParams params = {0};
 	Relation	rel = NULL;
 	MemoryContext repack_context;
+	LOCKMODE	lockmode;
 	List	   *rtcs;
 
 	/* Parse option list */
@@ -127,6 +241,16 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 		else if (strcmp(opt->defname, "analyze") == 0 ||
 				 strcmp(opt->defname, "analyse") == 0)
 			params.options |= defGetBoolean(opt) ? CLUOPT_ANALYZE : 0;
+		else if (strcmp(opt->defname, "concurrently") == 0 &&
+				 defGetBoolean(opt))
+		{
+			if (stmt->command != REPACK_COMMAND_REPACK)
+				ereport(ERROR,
+						errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("CONCURRENTLY option not supported for %s",
+							   RepackCommandAsString(stmt->command)));
+			params.options |= CLUOPT_CONCURRENT;
+		}
 		else
 			ereport(ERROR,
 					errcode(ERRCODE_SYNTAX_ERROR),
@@ -136,13 +260,25 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 					parser_errposition(pstate, opt->location));
 	}
 
+	/*
+	 * Determine the lock mode expected by cluster_rel().
+	 *
+	 * In the exclusive case, we obtain AccessExclusiveLock right away to
+	 * avoid lock-upgrade hazard in the single-transaction case. In the
+	 * CONCURRENTLY case, the AccessExclusiveLock will only be used at the end
+	 * of processing, supposedly for very short time. Until then, we'll have
+	 * to unlock the relation temporarily, so there's no lock-upgrade hazard.
+	 */
+	lockmode = (params.options & CLUOPT_CONCURRENT) == 0 ?
+		AccessExclusiveLock : ShareUpdateExclusiveLock;
+
 	/*
 	 * If a single relation is specified, process it and we're done ... unless
 	 * the relation is a partitioned table, in which case we fall through.
 	 */
 	if (stmt->relation != NULL)
 	{
-		rel = process_single_relation(stmt, &params);
+		rel = process_single_relation(stmt, lockmode, isTopLevel, &params);
 		if (rel == NULL)
 			return;				/* all done */
 	}
@@ -157,10 +293,29 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 				errmsg("cannot %s multiple tables", "REPACK (ANALYZE)"));
 
 	/*
-	 * By here, we know we are in a multi-table situation.  In order to avoid
-	 * holding locks for too long, we want to process each table in its own
-	 * transaction.  This forces us to disallow running inside a user
-	 * transaction block.
+	 * By here, we know we are in a multi-table situation.
+	 *
+	 * Concurrent processing is currently considered rather special (e.g. in
+	 * terms of resources consumed) so it is not performed in bulk.
+	 */
+	if (params.options & CLUOPT_CONCURRENT)
+	{
+		if (rel != NULL)
+		{
+			Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
+			ereport(ERROR,
+					errmsg("REPACK CONCURRENTLY not supported for partitioned tables"),
+					errhint("Consider running the command for individual partitions."));
+		}
+		else
+			ereport(ERROR,
+					errmsg("REPACK CONCURRENTLY requires explicit table name"));
+	}
+
+	/*
+	 * In order to avoid holding locks for too long, we want to process each
+	 * table in its own transaction.  This forces us to disallow running
+	 * inside a user transaction block.
 	 */
 	PreventInTransactionBlock(isTopLevel, RepackCommandAsString(stmt->command));
 
@@ -244,7 +399,7 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 		 * Open the target table, coping with the case where it has been
 		 * dropped.
 		 */
-		rel = try_table_open(rtc->tableOid, AccessExclusiveLock);
+		rel = try_table_open(rtc->tableOid, lockmode);
 		if (rel == NULL)
 		{
 			CommitTransactionCommand();
@@ -255,7 +410,7 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		/* Process this table */
-		cluster_rel(stmt->command, rel, rtc->indexOid, &params);
+		cluster_rel(stmt->command, rel, rtc->indexOid, &params, isTopLevel);
 		/* cluster_rel closes the relation, but keeps lock */
 
 		PopActiveSnapshot();
@@ -284,22 +439,53 @@ ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel)
  * If indexOid is InvalidOid, the table will be rewritten in physical order
  * instead of index order.
  *
+ * Note that, in the concurrent case, the function releases the lock at some
+ * point, in order to get AccessExclusiveLock for the final steps (i.e. to
+ * swap the relation files). To make things simpler, the caller should expect
+ * OldHeap to be closed on return, regardless CLUOPT_CONCURRENT. (The
+ * AccessExclusiveLock is kept till the end of the transaction.)
+ *
  * 'cmd' indicates which command is being executed, to be used for error
  * messages.
  */
 void
 cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
-			ClusterParams *params)
+			ClusterParams *params, bool isTopLevel)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
+	Relation	index;
+	LOCKMODE	lmode;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
 	bool		verbose = ((params->options & CLUOPT_VERBOSE) != 0);
 	bool		recheck = ((params->options & CLUOPT_RECHECK) != 0);
-	Relation	index;
+	bool		concurrent = ((params->options & CLUOPT_CONCURRENT) != 0);
 
-	Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false));
+	/*
+	 * The lock mode is AccessExclusiveLock for normal processing and
+	 * ShareUpdateExclusiveLock for concurrent processing (so that SELECT,
+	 * INSERT, UPDATE and DELETE commands work, but cluster_rel() cannot be
+	 * called concurrently for the same relation).
+	 */
+	lmode = !concurrent ? AccessExclusiveLock : ShareUpdateExclusiveLock;
+
+	/* There are specific requirements on concurrent processing. */
+	if (concurrent)
+	{
+		/*
+		 * Make sure we have no XID assigned, otherwise call of
+		 * setup_logical_decoding() can cause a deadlock.
+		 *
+		 * The existence of transaction block actually does not imply that XID
+		 * was already assigned, but it very likely is. We might want to check
+		 * the result of GetCurrentTransactionIdIfAny() instead, but that
+		 * would be less clear from user's perspective.
+		 */
+		PreventInTransactionBlock(isTopLevel, "REPACK (CONCURRENTLY)");
+
+		check_repack_concurrently_requirements(OldHeap);
+	}
 
 	/* Check for user-requested abort. */
 	CHECK_FOR_INTERRUPTS();
@@ -325,10 +511,13 @@ cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	 * If this is a single-transaction CLUSTER, we can skip these tests. We
 	 * *must* skip the one on indisclustered since it would reject an attempt
 	 * to cluster a not-previously-clustered index.
+	 *
+	 * XXX move [some of] these comments to where the RECHECK flag is
+	 * determined?
 	 */
 	if (recheck &&
 		!cluster_rel_recheck(cmd, OldHeap, indexOid, save_userid,
-							 params->options))
+							 lmode, params->options))
 		goto out;
 
 	/*
@@ -343,6 +532,12 @@ cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 				errmsg("cannot run %s on a shared catalog",
 					   RepackCommandAsString(cmd)));
 
+	/*
+	 * The CONCURRENTLY case should have been rejected earlier because it does
+	 * not support system catalogs.
+	 */
+	Assert(!(OldHeap->rd_rel->relisshared && concurrent));
+
 	/*
 	 * Don't process temp tables of other backends ... their local buffer
 	 * manager is not going to cope.
@@ -363,7 +558,7 @@ cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	if (OidIsValid(indexOid))
 	{
 		/* verify the index is good and lock it */
-		check_index_is_clusterable(OldHeap, indexOid, AccessExclusiveLock);
+		check_index_is_clusterable(OldHeap, indexOid, lmode);
 		/* also open it */
 		index = index_open(indexOid, NoLock);
 	}
@@ -398,7 +593,9 @@ cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	if (OldHeap->rd_rel->relkind == RELKIND_MATVIEW &&
 		!RelationIsPopulated(OldHeap))
 	{
-		relation_close(OldHeap, AccessExclusiveLock);
+		if (index)
+			index_close(index, lmode);
+		relation_close(OldHeap, lmode);
 		goto out;
 	}
 
@@ -411,11 +608,34 @@ cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	 * invalid, because we move tuples around.  Promote them to relation
 	 * locks.  Predicate locks on indexes will be promoted when they are
 	 * reindexed.
+	 *
+	 * During concurrent processing, the heap as well as its indexes stay in
+	 * operation, so we postpone this step until they are locked using
+	 * AccessExclusiveLock near the end of the processing.
 	 */
-	TransferPredicateLocksToHeapRelation(OldHeap);
+	if (!concurrent)
+		TransferPredicateLocksToHeapRelation(OldHeap);
 
 	/* rebuild_relation does all the dirty work */
-	rebuild_relation(OldHeap, index, verbose);
+	PG_TRY();
+	{
+		/*
+		 * For concurrent processing, make sure that our logical decoding
+		 * ignores data changes of other tables than the one we are
+		 * processing.
+		 */
+		if (concurrent)
+			begin_concurrent_repack(OldHeap);
+
+		rebuild_relation(OldHeap, index, verbose, concurrent);
+	}
+	PG_FINALLY();
+	{
+		if (concurrent)
+			end_concurrent_repack();
+	}
+	PG_END_TRY();
+
 	/* rebuild_relation closes OldHeap, and index if valid */
 
 out:
@@ -434,14 +654,14 @@ out:
  */
 static bool
 cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
-					Oid userid, int options)
+					Oid userid, LOCKMODE lmode, int options)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 
 	/* Check that the user still has privileges for the relation */
 	if (!cluster_is_permitted_for_relation(cmd, tableOid, userid))
 	{
-		relation_close(OldHeap, AccessExclusiveLock);
+		relation_close(OldHeap, lmode);
 		return false;
 	}
 
@@ -455,7 +675,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	 */
 	if (RELATION_IS_OTHER_TEMP(OldHeap))
 	{
-		relation_close(OldHeap, AccessExclusiveLock);
+		relation_close(OldHeap, lmode);
 		return false;
 	}
 
@@ -466,7 +686,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 		 */
 		if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(indexOid)))
 		{
-			relation_close(OldHeap, AccessExclusiveLock);
+			relation_close(OldHeap, lmode);
 			return false;
 		}
 
@@ -477,7 +697,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 		if ((options & CLUOPT_RECHECK_ISCLUSTERED) != 0 &&
 			!get_index_isclustered(indexOid))
 		{
-			relation_close(OldHeap, AccessExclusiveLock);
+			relation_close(OldHeap, lmode);
 			return false;
 		}
 	}
@@ -489,7 +709,7 @@ cluster_rel_recheck(RepackCommand cmd, Relation OldHeap, Oid indexOid,
  * Verify that the specified heap and index are valid to cluster on
  *
  * Side effect: obtains lock on the index.  The caller may
- * in some cases already have AccessExclusiveLock on the table, but
+ * in some cases already have a lock of the same strength on the table, but
  * not in all cases so we can't rely on the table-level lock for
  * protection here.
  */
@@ -619,17 +839,86 @@ mark_index_clustered(Relation rel, Oid indexOid, bool is_internal)
 }
 
 /*
- * rebuild_relation: rebuild an existing relation in index or physical order
- *
- * OldHeap: table to rebuild.
- * index: index to cluster by, or NULL to rewrite in physical order.
- *
- * On entry, heap and index (if one is given) must be open, and
- * AccessExclusiveLock held on them.
- * On exit, they are closed, but locks on them are not released.
+ * Check if the CONCURRENTLY option is legal for the relation.
  */
 static void
-rebuild_relation(Relation OldHeap, Relation index, bool verbose)
+check_repack_concurrently_requirements(Relation rel)
+{
+	char		relpersistence,
+				replident;
+	Oid			ident_idx;
+
+	/* Data changes in system relations are not logically decoded. */
+	if (IsCatalogRelation(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("REPACK CONCURRENTLY is not supported for catalog relations.")));
+
+	/*
+	 * reorderbuffer.c does not seem to handle processing of TOAST relation
+	 * alone.
+	 */
+	if (IsToastRelation(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("REPACK CONCURRENTLY is not supported for TOAST relations, unless the main relation is repacked too.")));
+
+	relpersistence = rel->rd_rel->relpersistence;
+	if (relpersistence != RELPERSISTENCE_PERMANENT)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("REPACK CONCURRENTLY is only allowed for permanent relations.")));
+
+	/* With NOTHING, WAL does not contain the old tuple. */
+	replident = rel->rd_rel->relreplident;
+	if (replident == REPLICA_IDENTITY_NOTHING)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot repack relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("Relation \"%s\" has insufficient replication identity.",
+						 RelationGetRelationName(rel))));
+
+	/*
+	 * Identity index is not set if the replica identity is FULL, but PK might
+	 * exist in such a case.
+	 */
+	ident_idx = RelationGetReplicaIndex(rel);
+	if (!OidIsValid(ident_idx) && OidIsValid(rel->rd_pkindex))
+		ident_idx = rel->rd_pkindex;
+	if (!OidIsValid(ident_idx))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot process relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errhint("Relation \"%s\" has no identity index.",
+						 RelationGetRelationName(rel))));
+}
+
+
+/*
+ * rebuild_relation: rebuild an existing relation in index or physical order
+ *
+ * OldHeap: table to rebuild.  See cluster_rel() for comments on the required
+ * lock strength.
+ *
+ * index: index to cluster by, or NULL to rewrite in physical order.
+ *
+ * On entry, heap and index (if one is given) must be open, and the
+ * appropriate lock held on them -- AccessExclusiveLock for exclusive
+ * processing and ShareUpdateExclusiveLock for concurrent processing.
+ *
+ * On exit, they are closed, but still locked with AccessExclusiveLock.
+ * (The function handles the lock upgrade if 'concurrent' is true.)
+ */
+static void
+rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent)
 {
 	Oid			tableOid = RelationGetRelid(OldHeap);
 	Oid			accessMethod = OldHeap->rd_rel->relam;
@@ -637,13 +926,38 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
 	Oid			OIDNewHeap;
 	Relation	NewHeap;
 	char		relpersistence;
-	bool		is_system_catalog;
 	bool		swap_toast_by_content;
 	TransactionId frozenXid;
 	MultiXactId cutoffMulti;
+	LogicalDecodingContext *decoding_ctx = NULL;
+	Snapshot	snapshot = NULL;
+#if USE_ASSERT_CHECKING
+	LOCKMODE	lmode;
 
-	Assert(CheckRelationLockedByMe(OldHeap, AccessExclusiveLock, false) &&
-		   (index == NULL || CheckRelationLockedByMe(index, AccessExclusiveLock, false)));
+	lmode = concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock;
+
+	Assert(CheckRelationLockedByMe(OldHeap, lmode, false));
+	Assert(index == NULL || CheckRelationLockedByMe(index, lmode, false));
+#endif
+
+	if (concurrent)
+	{
+		/*
+		 * Prepare to capture the concurrent data changes.
+		 *
+		 * Note that this call waits for all transactions with XID already
+		 * assigned to finish. If some of those transactions is waiting for a
+		 * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
+		 * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
+		 * risk is worth unlocking/locking the table (and its clustering
+		 * index) and checking again if its still eligible for REPACK
+		 * CONCURRENTLY.
+		 */
+		decoding_ctx = setup_logical_decoding(tableOid);
+
+		snapshot = SnapBuildInitialSnapshotForRepack(decoding_ctx->snapshot_builder);
+		PushActiveSnapshot(snapshot);
+	}
 
 	/* for CLUSTER or REPACK USING INDEX, mark the index as the one to use */
 	if (index != NULL)
@@ -651,7 +965,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
 
 	/* Remember info about rel before closing OldHeap */
 	relpersistence = OldHeap->rd_rel->relpersistence;
-	is_system_catalog = IsSystemRelation(OldHeap);
 
 	/*
 	 * Create the transient table that will receive the re-ordered data.
@@ -667,30 +980,61 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose)
 	NewHeap = table_open(OIDNewHeap, NoLock);
 
 	/* Copy the heap data into the new table in the desired order */
-	copy_table_data(NewHeap, OldHeap, index, verbose,
+	copy_table_data(NewHeap, OldHeap, index, snapshot, decoding_ctx, verbose,
 					&swap_toast_by_content, &frozenXid, &cutoffMulti);
 
+	/* The historic snapshot won't be needed anymore. */
+	if (snapshot)
+	{
+		PopActiveSnapshot();
+		UpdateActiveSnapshotCommandId();
+	}
 
-	/* Close relcache entries, but keep lock until transaction commit */
-	table_close(OldHeap, NoLock);
-	if (index)
-		index_close(index, NoLock);
+	if (concurrent)
+	{
+		Assert(!swap_toast_by_content);
 
-	/*
-	 * Close the new relation so it can be dropped as soon as the storage is
-	 * swapped. The relation is not visible to others, so no need to unlock it
-	 * explicitly.
-	 */
-	table_close(NewHeap, NoLock);
+		/*
+		 * Close the index, but keep the lock. Both heaps will be closed by
+		 * the following call.
+		 */
+		if (index)
+			index_close(index, NoLock);
 
-	/*
-	 * Swap the physical files of the target and transient tables, then
-	 * rebuild the target's indexes and throw away the transient table.
-	 */
-	finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
-					 swap_toast_by_content, false, true,
-					 frozenXid, cutoffMulti,
-					 relpersistence);
+		rebuild_relation_finish_concurrent(NewHeap, OldHeap, decoding_ctx,
+										   frozenXid, cutoffMulti);
+
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
+
+		/* Done with decoding. */
+		cleanup_logical_decoding(decoding_ctx);
+	}
+	else
+	{
+		bool		is_system_catalog = IsSystemRelation(OldHeap);
+
+		/* Close relcache entries, but keep lock until transaction commit */
+		table_close(OldHeap, NoLock);
+		if (index)
+			index_close(index, NoLock);
+
+		/*
+		 * Close the new relation so it can be dropped as soon as the storage
+		 * is swapped. The relation is not visible to others, so no need to
+		 * unlock it explicitly.
+		 */
+		table_close(NewHeap, NoLock);
+
+		/*
+		 * Swap the physical files of the target and transient tables, then
+		 * rebuild the target's indexes and throw away the transient table.
+		 */
+		finish_heap_swap(tableOid, OIDNewHeap, is_system_catalog,
+						 swap_toast_by_content, false, true, true,
+						 frozenXid, cutoffMulti,
+						 relpersistence);
+	}
 }
 
 
@@ -825,15 +1169,19 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
 /*
  * Do the physical copying of table data.
  *
+ * 'snapshot' and 'decoding_ctx': see table_relation_copy_for_cluster(). Pass
+ * iff concurrent processing is required.
+ *
  * There are three output parameters:
  * *pSwapToastByContent is set true if toast tables must be swapped by content.
  * *pFreezeXid receives the TransactionId used as freeze cutoff point.
  * *pCutoffMulti receives the MultiXactId used as a cutoff point.
  */
 static void
-copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verbose,
-				bool *pSwapToastByContent, TransactionId *pFreezeXid,
-				MultiXactId *pCutoffMulti)
+copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
+				Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+				bool verbose, bool *pSwapToastByContent,
+				TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
 {
 	Relation	relRelation;
 	HeapTuple	reltup;
@@ -850,6 +1198,10 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	int			elevel = verbose ? INFO : DEBUG2;
 	PGRUsage	ru0;
 	char	   *nspname;
+	bool		concurrent = snapshot != NULL;
+	LOCKMODE	lmode;
+
+	lmode = concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock;
 
 	pg_rusage_init(&ru0);
 
@@ -878,7 +1230,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	 * will be held till end of transaction.
 	 */
 	if (OldHeap->rd_rel->reltoastrelid)
-		LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+		LockRelationOid(OldHeap->rd_rel->reltoastrelid, lmode);
 
 	/*
 	 * If both tables have TOAST tables, perform toast swap by content.  It is
@@ -887,7 +1239,8 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	 * swap by links.  This is okay because swap by content is only essential
 	 * for system catalogs, and we don't support schema changes for them.
 	 */
-	if (OldHeap->rd_rel->reltoastrelid && NewHeap->rd_rel->reltoastrelid)
+	if (OldHeap->rd_rel->reltoastrelid && NewHeap->rd_rel->reltoastrelid &&
+		!concurrent)
 	{
 		*pSwapToastByContent = true;
 
@@ -908,6 +1261,10 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 		 * follow the toast pointers to the wrong place.  (It would actually
 		 * work for values copied over from the old toast table, but not for
 		 * any values that we toast which were previously not toasted.)
+		 *
+		 * This would not work with CONCURRENTLY because we may need to delete
+		 * TOASTed tuples from the new heap. With this hack, we'd delete them
+		 * from the old heap.
 		 */
 		NewHeap->rd_toastoid = OldHeap->rd_rel->reltoastrelid;
 	}
@@ -983,7 +1340,9 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	 * values (e.g. because the AM doesn't use freezing).
 	 */
 	table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
-									cutoffs.OldestXmin, &cutoffs.FreezeLimit,
+									cutoffs.OldestXmin, snapshot,
+									decoding_ctx,
+									&cutoffs.FreezeLimit,
 									&cutoffs.MultiXactCutoff,
 									&num_tuples, &tups_vacuumed,
 									&tups_recently_dead);
@@ -992,7 +1351,11 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
 	*pFreezeXid = cutoffs.FreezeLimit;
 	*pCutoffMulti = cutoffs.MultiXactCutoff;
 
-	/* Reset rd_toastoid just to be tidy --- it shouldn't be looked at again */
+	/*
+	 * Reset rd_toastoid just to be tidy --- it shouldn't be looked at again.
+	 * In the CONCURRENTLY case, we need to set it again before applying the
+	 * concurrent changes.
+	 */
 	NewHeap->rd_toastoid = InvalidOid;
 
 	num_pages = RelationGetNumberOfBlocks(NewHeap);
@@ -1450,14 +1813,13 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 				 bool swap_toast_by_content,
 				 bool check_constraints,
 				 bool is_internal,
+				 bool reindex,
 				 TransactionId frozenXid,
 				 MultiXactId cutoffMulti,
 				 char newrelpersistence)
 {
 	ObjectAddress object;
 	Oid			mapped_tables[4];
-	int			reindex_flags;
-	ReindexParams reindex_params = {0};
 	int			i;
 
 	/* Report that we are now swapping relation files */
@@ -1483,39 +1845,47 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	if (is_system_catalog)
 		CacheInvalidateCatalog(OIDOldHeap);
 
-	/*
-	 * Rebuild each index on the relation (but not the toast table, which is
-	 * all-new at this point).  It is important to do this before the DROP
-	 * step because if we are processing a system catalog that will be used
-	 * during DROP, we want to have its indexes available.  There is no
-	 * advantage to the other order anyway because this is all transactional,
-	 * so no chance to reclaim disk space before commit.  We do not need a
-	 * final CommandCounterIncrement() because reindex_relation does it.
-	 *
-	 * Note: because index_build is called via reindex_relation, it will never
-	 * set indcheckxmin true for the indexes.  This is OK even though in some
-	 * sense we are building new indexes rather than rebuilding existing ones,
-	 * because the new heap won't contain any HOT chains at all, let alone
-	 * broken ones, so it can't be necessary to set indcheckxmin.
-	 */
-	reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
-	if (check_constraints)
-		reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
+	if (reindex)
+	{
+		int			reindex_flags;
+		ReindexParams reindex_params = {0};
 
-	/*
-	 * Ensure that the indexes have the same persistence as the parent
-	 * relation.
-	 */
-	if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
-		reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
-	else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
-		reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+		/*
+		 * Rebuild each index on the relation (but not the toast table, which
+		 * is all-new at this point).  It is important to do this before the
+		 * DROP step because if we are processing a system catalog that will
+		 * be used during DROP, we want to have its indexes available.  There
+		 * is no advantage to the other order anyway because this is all
+		 * transactional, so no chance to reclaim disk space before commit. We
+		 * do not need a final CommandCounterIncrement() because
+		 * reindex_relation does it.
+		 *
+		 * Note: because index_build is called via reindex_relation, it will
+		 * never set indcheckxmin true for the indexes.  This is OK even
+		 * though in some sense we are building new indexes rather than
+		 * rebuilding existing ones, because the new heap won't contain any
+		 * HOT chains at all, let alone broken ones, so it can't be necessary
+		 * to set indcheckxmin.
+		 */
+		reindex_flags = REINDEX_REL_SUPPRESS_INDEX_USE;
+		if (check_constraints)
+			reindex_flags |= REINDEX_REL_CHECK_CONSTRAINTS;
 
-	/* Report that we are now reindexing relations */
-	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
-								 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+		/*
+		 * Ensure that the indexes have the same persistence as the parent
+		 * relation.
+		 */
+		if (newrelpersistence == RELPERSISTENCE_UNLOGGED)
+			reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+		else if (newrelpersistence == RELPERSISTENCE_PERMANENT)
+			reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
 
-	reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+		/* Report that we are now reindexing relations */
+		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+									 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+		reindex_relation(NULL, OIDOldHeap, reindex_flags, &reindex_params);
+	}
 
 	/* Report that we are now doing clean up */
 	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
@@ -1559,6 +1929,17 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 	object.objectId = OIDNewHeap;
 	object.objectSubId = 0;
 
+	if (!reindex)
+	{
+		/*
+		 * Make sure the changes in pg_class are visible. This is especially
+		 * important if !swap_toast_by_content, so that the correct TOAST
+		 * relation is dropped. (reindex_relation() above did not help in this
+		 * case))
+		 */
+		CommandCounterIncrement();
+	}
+
 	/*
 	 * The new relation is local to our transaction and we know nothing
 	 * depends on it, so DROP_RESTRICT should be OK.
@@ -1598,7 +1979,7 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
 			/* Get the associated valid index to be renamed */
 			toastidx = toast_get_valid_index(newrel->rd_rel->reltoastrelid,
-											 NoLock);
+											 AccessExclusiveLock);
 
 			/* rename the toast table ... */
 			snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
@@ -1858,7 +2239,8 @@ cluster_is_permitted_for_relation(RepackCommand cmd, Oid relid, Oid userid)
  * case, if an index name is given, it's up to the caller to resolve it.
  */
 static Relation
-process_single_relation(RepackStmt *stmt, ClusterParams *params)
+process_single_relation(RepackStmt *stmt, LOCKMODE lockmode, bool isTopLevel,
+						ClusterParams *params)
 {
 	Relation	rel;
 	Oid			tableOid;
@@ -1867,13 +2249,9 @@ process_single_relation(RepackStmt *stmt, ClusterParams *params)
 	Assert(stmt->command == REPACK_COMMAND_CLUSTER ||
 		   stmt->command == REPACK_COMMAND_REPACK);
 
-	/*
-	 * Find, lock, and check permissions on the table.  We obtain
-	 * AccessExclusiveLock right away to avoid lock-upgrade hazard in the
-	 * single-transaction case.
-	 */
+	/* Find, lock, and check permissions on the table. */
 	tableOid = RangeVarGetRelidExtended(stmt->relation->relation,
-										AccessExclusiveLock,
+										lockmode,
 										0,
 										RangeVarCallbackMaintainsTable,
 										NULL);
@@ -1905,13 +2283,14 @@ process_single_relation(RepackStmt *stmt, ClusterParams *params)
 		return rel;
 	else
 	{
-		Oid			indexOid;
+		Oid			indexOid = InvalidOid;
 
 		indexOid = determine_clustered_index(rel, stmt->usingindex,
 											 stmt->indexname);
 		if (OidIsValid(indexOid))
-			check_index_is_clusterable(rel, indexOid, AccessExclusiveLock);
-		cluster_rel(stmt->command, rel, indexOid, params);
+			check_index_is_clusterable(rel, indexOid, lockmode);
+
+		cluster_rel(stmt->command, rel, indexOid, params, isTopLevel);
 
 		/* Do an analyze, if requested */
 		if (params->options & CLUOPT_ANALYZE)
@@ -1994,3 +2373,1047 @@ RepackCommandAsString(RepackCommand cmd)
 	}
 	return "???";	/* keep compiler quiet */
 }
+
+
+/*
+ * Call this function before REPACK CONCURRENTLY starts to setup logical
+ * decoding. It makes sure that other users of the table put enough
+ * information into WAL.
+ *
+ * The point is that at various places we expect that the table we're
+ * processing is treated like a system catalog. For example, we need to be
+ * able to scan it using a "historic snapshot" anytime during the processing
+ * (as opposed to scanning only at the start point of the decoding, as logical
+ * replication does during initial table synchronization), in order to apply
+ * concurrent UPDATE / DELETE commands.
+ *
+ * Note that TOAST table needs no attention here as it's not scanned using
+ * historic snapshot.
+ */
+static void
+begin_concurrent_repack(Relation rel)
+{
+	Oid			toastrelid;
+
+	/*
+	 * Avoid logical decoding of other relations by this backend. The lock we
+	 * have guarantees that the actual locator cannot be changed concurrently:
+	 * TRUNCATE needs AccessExclusiveLock.
+	 */
+	Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, false));
+	repacked_rel_locator = rel->rd_locator;
+	toastrelid = rel->rd_rel->reltoastrelid;
+	if (OidIsValid(toastrelid))
+	{
+		Relation	toastrel;
+
+		/* Avoid logical decoding of other TOAST relations. */
+		toastrel = table_open(toastrelid, AccessShareLock);
+		repacked_rel_toast_locator = toastrel->rd_locator;
+		table_close(toastrel, AccessShareLock);
+	}
+}
+
+/*
+ * Call this when done with REPACK CONCURRENTLY.
+ */
+static void
+end_concurrent_repack(void)
+{
+	/*
+	 * Restore normal function of (future) logical decoding for this backend.
+	 */
+	repacked_rel_locator.relNumber = InvalidOid;
+	repacked_rel_toast_locator.relNumber = InvalidOid;
+}
+
+/*
+ * Is this backend performing logical decoding on behalf of REPACK
+ * (CONCURRENTLY) ?
+ */
+bool
+am_decoding_for_repack(void)
+{
+	return OidIsValid(repacked_rel_locator.relNumber);
+}
+
+/*
+ * Does the WAL record contain a data change that this backend does not need
+ * to decode on behalf of REPACK (CONCURRENTLY)?
+ */
+bool
+change_useless_for_repack(XLogRecordBuffer *buf)
+{
+	XLogReaderState *r = buf->record;
+	RelFileLocator locator;
+
+	/* TOAST locator should not be set unless the main is. */
+	Assert(!OidIsValid(repacked_rel_toast_locator.relNumber) ||
+		   OidIsValid(repacked_rel_locator.relNumber));
+
+	/*
+	 * Backends not involved in REPACK (CONCURRENTLY) should not do the
+	 * filtering.
+	 */
+	if (!am_decoding_for_repack())
+		return false;
+
+	/*
+	 * If the record does not contain the block 0, it's probably not INSERT /
+	 * UPDATE / DELETE. In any case, we do not have enough information to
+	 * filter the change out.
+	 */
+	if (!XLogRecGetBlockTagExtended(r, 0, &locator, NULL, NULL, NULL))
+		return false;
+
+	/*
+	 * Decode the change if it belongs to the table we are repacking, or if it
+	 * belongs to its TOAST relation.
+	 */
+	if (RelFileLocatorEquals(locator, repacked_rel_locator))
+		return false;
+	if (OidIsValid(repacked_rel_toast_locator.relNumber) &&
+		RelFileLocatorEquals(locator, repacked_rel_toast_locator))
+		return false;
+
+	/* Filter out changes of other tables. */
+	return true;
+}
+
+/*
+ * This function is much like pg_create_logical_replication_slot() except that
+ * the new slot is neither released (if anyone else could read changes from
+ * our slot, we could miss changes other backends do while we copy the
+ * existing data into temporary table), nor persisted (it's easier to handle
+ * crash by restarting all the work from scratch).
+ */
+static LogicalDecodingContext *
+setup_logical_decoding(Oid relid)
+{
+	Relation	rel;
+	TupleDesc	tupdesc;
+	LogicalDecodingContext *ctx;
+	RepackDecodingState *dstate = palloc0_object(RepackDecodingState);
+
+	/*
+	 * REPACK CONCURRENTLY is not allowed in a transaction block, so this
+	 * should never fire.
+	 */
+	Assert(!TransactionIdIsValid(GetTopTransactionIdIfAny()));
+
+	/*
+	 * A single backend should not execute multiple REPACK commands at a time,
+	 * so use PID to make the slot unique.
+	 */
+	snprintf(NameStr(dstate->slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+
+	/*
+	 * Make sure we can use logical decoding.
+	 */
+	CheckSlotPermissions();
+	CheckLogicalDecodingRequirements();
+	/* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
+	ReplicationSlotCreate(NameStr(dstate->slotname), true, RS_TEMPORARY,
+						  false, false, false);
+	EnsureLogicalDecodingEnabled();
+
+	/*
+	 * Neither prepare_write nor do_write callback nor update_progress is
+	 * useful for us.
+	 */
+	ctx = CreateInitDecodingContext(REPL_PLUGIN_NAME,
+									NIL,
+									true,
+									InvalidXLogRecPtr,
+									XL_ROUTINE(.page_read = read_local_xlog_page,
+											   .segment_open = wal_segment_open,
+											   .segment_close = wal_segment_close),
+									NULL, NULL, NULL);
+
+	/*
+	 * We don't have control on setting fast_forward, so at least check it.
+	 */
+	Assert(!ctx->fast_forward);
+
+	DecodingContextFindStartpoint(ctx);
+
+	/* Some WAL records should have been read. */
+	Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
+
+	XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
+				wal_segment_size);
+
+	/*
+	 * Setup structures to store decoded changes.
+	 */
+	dstate->relid = relid;
+	dstate->tstore = tuplestore_begin_heap(false, false,
+										   maintenance_work_mem);
+
+	/* Caller should already have the table locked. */
+	rel = table_open(relid, NoLock);
+	tupdesc = CreateTupleDescCopy(RelationGetDescr(rel));
+	dstate->tupdesc = tupdesc;
+	table_close(rel, NoLock);
+
+	/* Initialize the descriptor to store the changes ... */
+	dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+
+	TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
+	/* ... as well as the corresponding slot. */
+	dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
+											  &TTSOpsMinimalTuple);
+
+	dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
+										   "logical decoding");
+
+	ctx->output_writer_private = dstate;
+	return ctx;
+}
+
+/*
+ * Retrieve tuple from ConcurrentChange structure.
+ *
+ * The input data starts with the structure but it might not be appropriately
+ * aligned.
+ */
+static HeapTuple
+get_changed_tuple(char *change)
+{
+	HeapTupleData tup_data;
+	HeapTuple	result;
+	char	   *src;
+
+	/*
+	 * Ensure alignment before accessing the fields. (This is why we can't use
+	 * heap_copytuple() instead of this function.)
+	 */
+	src = change + offsetof(ConcurrentChange, tup_data);
+	memcpy(&tup_data, src, sizeof(HeapTupleData));
+
+	result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
+	memcpy(result, &tup_data, sizeof(HeapTupleData));
+	result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
+	src = change + SizeOfConcurrentChange;
+	memcpy(result->t_data, src, result->t_len);
+
+	return result;
+}
+
+/*
+ * Decode logical changes from the WAL sequence up to end_of_wal.
+ */
+void
+repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+								 XLogRecPtr end_of_wal)
+{
+	RepackDecodingState *dstate;
+	ResourceOwner resowner_old;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+	resowner_old = CurrentResourceOwner;
+	CurrentResourceOwner = dstate->resowner;
+
+	PG_TRY();
+	{
+		while (ctx->reader->EndRecPtr < end_of_wal)
+		{
+			XLogRecord *record;
+			XLogSegNo	segno_new;
+			char	   *errm = NULL;
+			XLogRecPtr	end_lsn;
+
+			record = XLogReadRecord(ctx->reader, &errm);
+			if (errm)
+				elog(ERROR, "%s", errm);
+
+			if (record != NULL)
+				LogicalDecodingProcessRecord(ctx, ctx->reader);
+
+			/*
+			 * If WAL segment boundary has been crossed, inform the decoding
+			 * system that the catalog_xmin can advance. (We can confirm more
+			 * often, but a filling a single WAL segment should not take much
+			 * time.)
+			 */
+			end_lsn = ctx->reader->EndRecPtr;
+			XLByteToSeg(end_lsn, segno_new, wal_segment_size);
+			if (segno_new != repack_current_segment)
+			{
+				LogicalConfirmReceivedLocation(end_lsn);
+				elog(DEBUG1, "REPACK: confirmed receive location %X/%X",
+					 (uint32) (end_lsn >> 32), (uint32) end_lsn);
+				repack_current_segment = segno_new;
+			}
+
+			CHECK_FOR_INTERRUPTS();
+		}
+		InvalidateSystemCaches();
+		CurrentResourceOwner = resowner_old;
+	}
+	PG_CATCH();
+	{
+		/* clear all timetravel entries */
+		InvalidateSystemCaches();
+		CurrentResourceOwner = resowner_old;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
+}
+
+/*
+ * Apply changes stored in 'file'.
+ */
+static void
+apply_concurrent_changes(RepackDecodingState *dstate, ChangeDest *dest)
+{
+	Relation	rel = dest->rel;
+	TupleTableSlot *index_slot,
+			   *ident_slot;
+	HeapTuple	tup_old = NULL;
+
+	if (dstate->nchanges == 0)
+		return;
+
+	/* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
+	index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+
+	/* A slot to fetch tuples from identity index. */
+	ident_slot = table_slot_create(rel, NULL);
+
+	while (tuplestore_gettupleslot(dstate->tstore, true, false,
+								   dstate->tsslot))
+	{
+		bool		shouldFree;
+		HeapTuple	tup_change,
+					tup,
+					tup_exist;
+		char	   *change_raw,
+				   *src;
+		ConcurrentChange change;
+		bool		isnull[1];
+		Datum		values[1];
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get the change from the single-column tuple. */
+		tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
+		heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
+		Assert(!isnull[0]);
+
+		/* Make sure we access aligned data. */
+		change_raw = (char *) DatumGetByteaP(values[0]);
+		src = (char *) VARDATA(change_raw);
+		memcpy(&change, src, SizeOfConcurrentChange);
+
+		/*
+		 * Extract the tuple from the change. The tuple is copied here because
+		 * it might be assigned to 'tup_old', in which case it needs to
+		 * survive into the next iteration.
+		 */
+		tup = get_changed_tuple(src);
+
+		if (change.kind == CHANGE_UPDATE_OLD)
+		{
+			Assert(tup_old == NULL);
+			tup_old = tup;
+		}
+		else if (change.kind == CHANGE_INSERT)
+		{
+			Assert(tup_old == NULL);
+
+			apply_concurrent_insert(rel, tup, dest->iistate, index_slot);
+
+			pfree(tup);
+		}
+		else if (change.kind == CHANGE_UPDATE_NEW ||
+				 change.kind == CHANGE_DELETE)
+		{
+			HeapTuple	tup_key;
+
+			if (change.kind == CHANGE_UPDATE_NEW)
+			{
+				tup_key = tup_old != NULL ? tup_old : tup;
+			}
+			else
+			{
+				Assert(tup_old == NULL);
+				tup_key = tup;
+			}
+
+			/*
+			 * Find the tuple to be updated or deleted.
+			 */
+			tup_exist = find_target_tuple(rel, dest, tup_key, ident_slot);
+			if (tup_exist == NULL)
+				elog(ERROR, "failed to find target tuple");
+
+			if (change.kind == CHANGE_UPDATE_NEW)
+				apply_concurrent_update(rel, tup, tup_exist, dest->iistate,
+										index_slot);
+			else
+				apply_concurrent_delete(rel, tup_exist);
+
+			if (tup_old != NULL)
+			{
+				pfree(tup_old);
+				tup_old = NULL;
+			}
+
+			pfree(tup);
+		}
+		else
+			elog(ERROR, "unrecognized kind of change: %d", change.kind);
+
+		/*
+		 * If a change was applied now, increment CID for next writes and
+		 * update the snapshot so it sees the changes we've applied so far.
+		 */
+		if (change.kind != CHANGE_UPDATE_OLD)
+		{
+			CommandCounterIncrement();
+			UpdateActiveSnapshotCommandId();
+		}
+
+		/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
+		Assert(shouldFree);
+		pfree(tup_change);
+	}
+
+	tuplestore_clear(dstate->tstore);
+	dstate->nchanges = 0;
+
+	/* Cleanup. */
+	ExecDropSingleTupleTableSlot(index_slot);
+	ExecDropSingleTupleTableSlot(ident_slot);
+}
+
+static void
+apply_concurrent_insert(Relation rel, HeapTuple tup, IndexInsertState *iistate,
+						TupleTableSlot *index_slot)
+{
+	List	   *recheck;
+
+	/*
+	 * Like simple_heap_insert(), but make sure that the INSERT is not
+	 * logically decoded - see reform_and_rewrite_tuple() for more
+	 * information.
+	 */
+	heap_insert(rel, tup, GetCurrentCommandId(true), HEAP_INSERT_NO_LOGICAL,
+				NULL);
+
+	/*
+	 * Update indexes.
+	 *
+	 * In case functions in the index need the active snapshot and caller
+	 * hasn't set one.
+	 */
+	ExecStoreHeapTuple(tup, index_slot, false);
+	recheck = ExecInsertIndexTuples(iistate->rri,
+									index_slot,
+									iistate->estate,
+									false,	/* update */
+									false,	/* noDupErr */
+									NULL,	/* specConflict */
+									NIL,	/* arbiterIndexes */
+									false	/* onlySummarizing */
+		);
+
+	/*
+	 * If recheck is required, it must have been performed on the source
+	 * relation by now. (All the logical changes we process here are already
+	 * committed.)
+	 */
+	list_free(recheck);
+
+	pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED, 1);
+}
+
+static void
+apply_concurrent_update(Relation rel, HeapTuple tup, HeapTuple tup_target,
+						IndexInsertState *iistate, TupleTableSlot *index_slot)
+{
+	LockTupleMode lockmode;
+	TM_FailureData tmfd;
+	TU_UpdateIndexes update_indexes;
+	TM_Result	res;
+	List	   *recheck;
+
+	/*
+	 * Write the new tuple into the new heap. ('tup' gets the TID assigned
+	 * here.)
+	 *
+	 * Do it like in simple_heap_update(), except for 'wal_logical' (and
+	 * except for 'wait').
+	 */
+	res = heap_update(rel, &tup_target->t_self, tup,
+					  GetCurrentCommandId(true),
+					  InvalidSnapshot,
+					  false,	/* no wait - only we are doing changes */
+					  &tmfd, &lockmode, &update_indexes,
+					  false /* wal_logical */ );
+	if (res != TM_Ok)
+		ereport(ERROR, (errmsg("failed to apply concurrent UPDATE")));
+
+	ExecStoreHeapTuple(tup, index_slot, false);
+
+	if (update_indexes != TU_None)
+	{
+		recheck = ExecInsertIndexTuples(iistate->rri,
+										index_slot,
+										iistate->estate,
+										true,	/* update */
+										false,	/* noDupErr */
+										NULL,	/* specConflict */
+										NIL,	/* arbiterIndexes */
+		/* onlySummarizing */
+										update_indexes == TU_Summarizing);
+		list_free(recheck);
+	}
+
+	pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_UPDATED, 1);
+}
+
+static void
+apply_concurrent_delete(Relation rel, HeapTuple tup_target)
+{
+	TM_Result	res;
+	TM_FailureData tmfd;
+
+	/*
+	 * Delete tuple from the new heap.
+	 *
+	 * Do it like in simple_heap_delete(), except for 'wal_logical' (and
+	 * except for 'wait').
+	 */
+	res = heap_delete(rel, &tup_target->t_self, GetCurrentCommandId(true),
+					  InvalidSnapshot, false,
+					  &tmfd,
+					  false,	/* no wait - only we are doing changes */
+					  false /* wal_logical */ );
+
+	if (res != TM_Ok)
+		ereport(ERROR, (errmsg("failed to apply concurrent DELETE")));
+
+	pgstat_progress_incr_param(PROGRESS_REPACK_HEAP_TUPLES_DELETED, 1);
+}
+
+/*
+ * Find the tuple to be updated or deleted.
+ *
+ * 'tup_key' is a tuple containing the key values for the scan.
+ */
+static HeapTuple
+find_target_tuple(Relation rel, ChangeDest *dest, HeapTuple tup_key,
+				  TupleTableSlot *ident_slot)
+{
+	Relation	ident_index = dest->ident_index;
+	IndexScanDesc scan;
+	Form_pg_index ident_form;
+	int2vector *ident_indkey;
+	HeapTuple	result = NULL;
+
+	/* XXX no instrumentation for now */
+	scan = index_beginscan(rel, ident_index, GetActiveSnapshot(),
+						   NULL, dest->ident_key_nentries, 0);
+
+	/*
+	 * Scan key is passed by caller, so it does not have to be constructed
+	 * multiple times. Key entries have all fields initialized, except for
+	 * sk_argument.
+	 */
+	index_rescan(scan, dest->ident_key, dest->ident_key_nentries, NULL, 0);
+
+	/* Info needed to retrieve key values from heap tuple. */
+	ident_form = ident_index->rd_index;
+	ident_indkey = &ident_form->indkey;
+
+	/* Use the incoming tuple to finalize the scan key. */
+	for (int i = 0; i < scan->numberOfKeys; i++)
+	{
+		ScanKey		entry;
+		bool		isnull;
+		int16		attno_heap;
+
+		entry = &scan->keyData[i];
+		attno_heap = ident_indkey->values[i];
+		entry->sk_argument = heap_getattr(tup_key,
+										  attno_heap,
+										  rel->rd_att,
+										  &isnull);
+		Assert(!isnull);
+	}
+	if (index_getnext_slot(scan, ForwardScanDirection, ident_slot))
+	{
+		bool		shouldFree;
+
+		result = ExecFetchSlotHeapTuple(ident_slot, false, &shouldFree);
+		/* TTSOpsBufferHeapTuple has .get_heap_tuple != NULL. */
+		Assert(!shouldFree);
+	}
+	index_endscan(scan);
+
+	return result;
+}
+
+/*
+ * Decode and apply concurrent changes.
+ */
+static void
+process_concurrent_changes(LogicalDecodingContext *decoding_ctx,
+						   XLogRecPtr end_of_wal, ChangeDest *dest)
+{
+	RepackDecodingState *dstate;
+
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_CATCH_UP);
+
+	dstate = (RepackDecodingState *) decoding_ctx->output_writer_private;
+
+	repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+
+	if (dstate->nchanges == 0)
+		return;
+
+	apply_concurrent_changes(dstate, dest);
+}
+
+/*
+ * Initialize IndexInsertState for index specified by ident_index_id.
+ *
+ * While doing that, also return the identity index in *ident_index_p.
+ */
+static IndexInsertState *
+get_index_insert_state(Relation relation, Oid ident_index_id,
+					   Relation *ident_index_p)
+{
+	EState	   *estate;
+	int			i;
+	IndexInsertState *result;
+	Relation	ident_index = NULL;
+
+	result = (IndexInsertState *) palloc0(sizeof(IndexInsertState));
+	estate = CreateExecutorState();
+
+	result->rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
+	InitResultRelInfo(result->rri, relation, 0, 0, 0);
+	ExecOpenIndices(result->rri, false);
+
+	/*
+	 * Find the relcache entry of the identity index so that we spend no extra
+	 * effort to open / close it.
+	 */
+	for (i = 0; i < result->rri->ri_NumIndices; i++)
+	{
+		Relation	ind_rel;
+
+		ind_rel = result->rri->ri_IndexRelationDescs[i];
+		if (ind_rel->rd_id == ident_index_id)
+			ident_index = ind_rel;
+	}
+	if (ident_index == NULL)
+		elog(ERROR, "failed to open identity index");
+
+	/* Only initialize fields needed by ExecInsertIndexTuples(). */
+	result->estate = estate;
+
+	*ident_index_p = ident_index;
+	return result;
+}
+
+/*
+ * Build scan key to process logical changes.
+ */
+static ScanKey
+build_identity_key(Oid ident_idx_oid, Relation rel_src, int *nentries)
+{
+	Relation	ident_idx_rel;
+	Form_pg_index ident_idx;
+	int			n,
+				i;
+	ScanKey		result;
+
+	Assert(OidIsValid(ident_idx_oid));
+	ident_idx_rel = index_open(ident_idx_oid, AccessShareLock);
+	ident_idx = ident_idx_rel->rd_index;
+	n = ident_idx->indnatts;
+	result = (ScanKey) palloc(sizeof(ScanKeyData) * n);
+	for (i = 0; i < n; i++)
+	{
+		ScanKey		entry;
+		int16		relattno;
+		Form_pg_attribute att;
+		Oid			opfamily,
+					opcintype,
+					opno,
+					opcode;
+
+		entry = &result[i];
+		relattno = ident_idx->indkey.values[i];
+		if (relattno >= 1)
+		{
+			TupleDesc	desc;
+
+			desc = rel_src->rd_att;
+			att = TupleDescAttr(desc, relattno - 1);
+		}
+		else
+			elog(ERROR, "unexpected attribute number %d in index", relattno);
+
+		opfamily = ident_idx_rel->rd_opfamily[i];
+		opcintype = ident_idx_rel->rd_opcintype[i];
+		opno = get_opfamily_member(opfamily, opcintype, opcintype,
+								   BTEqualStrategyNumber);
+
+		if (!OidIsValid(opno))
+			elog(ERROR, "failed to find = operator for type %u", opcintype);
+
+		opcode = get_opcode(opno);
+		if (!OidIsValid(opcode))
+			elog(ERROR, "failed to find = operator for operator %u", opno);
+
+		/* Initialize everything but argument. */
+		ScanKeyInit(entry,
+					i + 1,
+					BTEqualStrategyNumber, opcode,
+					(Datum) NULL);
+		entry->sk_collation = att->attcollation;
+	}
+	index_close(ident_idx_rel, AccessShareLock);
+
+	*nentries = n;
+	return result;
+}
+
+static void
+free_index_insert_state(IndexInsertState *iistate)
+{
+	ExecCloseIndices(iistate->rri);
+	FreeExecutorState(iistate->estate);
+	pfree(iistate->rri);
+	pfree(iistate);
+}
+
+static void
+cleanup_logical_decoding(LogicalDecodingContext *ctx)
+{
+	RepackDecodingState *dstate;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	ExecDropSingleTupleTableSlot(dstate->tsslot);
+	FreeTupleDesc(dstate->tupdesc_change);
+	FreeTupleDesc(dstate->tupdesc);
+	tuplestore_end(dstate->tstore);
+
+	FreeDecodingContext(ctx);
+
+	ReplicationSlotRelease();
+	ReplicationSlotDrop(NameStr(dstate->slotname), false);
+	pfree(dstate);
+}
+
+/*
+ * The final steps of rebuild_relation() for concurrent processing.
+ *
+ * On entry, NewHeap is locked in AccessExclusiveLock mode. OldHeap and its
+ * clustering index (if one is passed) are still locked in a mode that allows
+ * concurrent data changes. On exit, both tables and their indexes are closed,
+ * but locked in AccessExclusiveLock mode.
+ */
+static void
+rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
+								   LogicalDecodingContext *decoding_ctx,
+								   TransactionId frozenXid,
+								   MultiXactId cutoffMulti)
+{
+	LOCKMODE	lockmode_old PG_USED_FOR_ASSERTS_ONLY;
+	List	   *ind_oids_new;
+	Oid			old_table_oid = RelationGetRelid(OldHeap);
+	Oid			new_table_oid = RelationGetRelid(NewHeap);
+	List	   *ind_oids_old = RelationGetIndexList(OldHeap);
+	ListCell   *lc,
+			   *lc2;
+	char		relpersistence;
+	bool		is_system_catalog;
+	Oid			ident_idx_old,
+				ident_idx_new;
+	XLogRecPtr	wal_insert_ptr,
+				end_of_wal;
+	char		dummy_rec_data = '\0';
+	Relation   *ind_refs,
+			   *ind_refs_p;
+	int			nind;
+	ChangeDest	chgdst;
+
+	/* Like in cluster_rel(). */
+	lockmode_old = ShareUpdateExclusiveLock;
+	Assert(CheckRelationLockedByMe(OldHeap, lockmode_old, false));
+	/* This is expected from the caller. */
+	Assert(CheckRelationLockedByMe(NewHeap, AccessExclusiveLock, false));
+
+	ident_idx_old = RelationGetReplicaIndex(OldHeap);
+
+	/*
+	 * Unlike the exclusive case, we build new indexes for the new relation
+	 * rather than swapping the storage and reindexing the old relation. The
+	 * point is that the index build can take some time, so we do it before we
+	 * get AccessExclusiveLock on the old heap and therefore we cannot swap
+	 * the heap storage yet.
+	 *
+	 * index_create() will lock the new indexes using AccessExclusiveLock - no
+	 * need to change that. At the same time, we use ShareUpdateExclusiveLock
+	 * to lock the existing indexes - that should be enough to prevent others
+	 * from changing them while we're repacking the relation. The lock on
+	 * table should prevent others from changing the index column list, but
+	 * might not be enough for commands like ALTER INDEX ... SET ... (Those
+	 * are not necessarily dangerous, but can make user confused if the
+	 * changes they do get lost due to REPACK.)
+	 */
+	ind_oids_new = build_new_indexes(NewHeap, OldHeap, ind_oids_old);
+
+	/*
+	 * Processing shouldn't start w/o valid identity index.
+	 */
+	Assert(OidIsValid(ident_idx_old));
+
+	/* Find "identity index" on the new relation. */
+	ident_idx_new = InvalidOid;
+	forboth(lc, ind_oids_old, lc2, ind_oids_new)
+	{
+		Oid			ind_old = lfirst_oid(lc);
+		Oid			ind_new = lfirst_oid(lc2);
+
+		if (ident_idx_old == ind_old)
+		{
+			ident_idx_new = ind_new;
+			break;
+		}
+	}
+	if (!OidIsValid(ident_idx_new))
+
+		/*
+		 * Should not happen, given our lock on the old relation.
+		 */
+		ereport(ERROR,
+				(errmsg("identity index missing on the new relation")));
+
+	/* Gather information to apply concurrent changes. */
+	chgdst.rel = NewHeap;
+	chgdst.iistate = get_index_insert_state(NewHeap, ident_idx_new,
+											&chgdst.ident_index);
+	chgdst.ident_key = build_identity_key(ident_idx_new, OldHeap,
+										  &chgdst.ident_key_nentries);
+
+	/*
+	 * During testing, wait for another backend to perform concurrent data
+	 * changes which we will process below.
+	 */
+	INJECTION_POINT("repack-concurrently-before-lock", NULL);
+
+	/*
+	 * Flush all WAL records inserted so far (possibly except for the last
+	 * incomplete page, see GetInsertRecPtr), to minimize the amount of data
+	 * we need to flush while holding exclusive lock on the source table.
+	 */
+	wal_insert_ptr = GetInsertRecPtr();
+	XLogFlush(wal_insert_ptr);
+	end_of_wal = GetFlushRecPtr(NULL);
+
+	/*
+	 * Apply concurrent changes first time, to minimize the time we need to
+	 * hold AccessExclusiveLock. (Quite some amount of WAL could have been
+	 * written during the data copying and index creation.)
+	 */
+	process_concurrent_changes(decoding_ctx, end_of_wal, &chgdst);
+
+	/*
+	 * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
+	 * is one), all its indexes, so that we can swap the files.
+	 */
+	LockRelationOid(old_table_oid, AccessExclusiveLock);
+
+	/*
+	 * Lock all indexes now, not only the clustering one: all indexes need to
+	 * have their files swapped. While doing that, store their relation
+	 * references in an array, to handle predicate locks below.
+	 */
+	ind_refs_p = ind_refs = palloc_array(Relation, list_length(ind_oids_old));
+	nind = 0;
+	foreach_oid(ind_oid, ind_oids_old)
+	{
+		Relation	index;
+
+		index = index_open(ind_oid, AccessExclusiveLock);
+
+		/*
+		 * TODO 1) Do we need to check if ALTER INDEX was executed since the
+		 * new index was created in build_new_indexes()? 2) Specifically for
+		 * the clustering index, should check_index_is_clusterable() be called
+		 * here? (Not sure about the latter: ShareUpdateExclusiveLock on the
+		 * table probably blocks all commands that affect the result of
+		 * check_index_is_clusterable().)
+		 */
+		*ind_refs_p = index;
+		ind_refs_p++;
+		nind++;
+	}
+
+	/*
+	 * Lock the OldHeap's TOAST relation exclusively - again, the lock is
+	 * needed to swap the files.
+	 */
+	if (OidIsValid(OldHeap->rd_rel->reltoastrelid))
+		LockRelationOid(OldHeap->rd_rel->reltoastrelid, AccessExclusiveLock);
+
+	/*
+	 * Tuples and pages of the old heap will be gone, but the heap will stay.
+	 */
+	TransferPredicateLocksToHeapRelation(OldHeap);
+	/* The same for indexes. */
+	for (int i = 0; i < nind; i++)
+	{
+		Relation	index = ind_refs[i];
+
+		TransferPredicateLocksToHeapRelation(index);
+
+		/*
+		 * References to indexes on the old relation are not needed anymore,
+		 * however locks stay till the end of the transaction.
+		 */
+		index_close(index, NoLock);
+	}
+	pfree(ind_refs);
+
+	/*
+	 * Flush anything we see in WAL, to make sure that all changes committed
+	 * while we were waiting for the exclusive lock are available for
+	 * decoding. This should not be necessary if all backends had
+	 * synchronous_commit set, but we can't rely on this setting.
+	 *
+	 * Unfortunately, GetInsertRecPtr() may lag behind the actual insert
+	 * position, and GetLastImportantRecPtr() points at the start of the last
+	 * record rather than at the end. Thus the simplest way to determine the
+	 * insert position is to insert a dummy record and use its LSN.
+	 *
+	 * XXX Consider using GetLastImportantRecPtr() and adding the size of the
+	 * last record (plus the total size of all the page headers the record
+	 * spans)?
+	 */
+	XLogBeginInsert();
+	XLogRegisterData(&dummy_rec_data, 1);
+	wal_insert_ptr = XLogInsert(RM_XLOG_ID, XLOG_NOOP);
+	XLogFlush(wal_insert_ptr);
+	end_of_wal = GetFlushRecPtr(NULL);
+
+	/* Apply the concurrent changes again. */
+	process_concurrent_changes(decoding_ctx, end_of_wal, &chgdst);
+
+	/* Remember info about rel before closing OldHeap */
+	relpersistence = OldHeap->rd_rel->relpersistence;
+	is_system_catalog = IsSystemRelation(OldHeap);
+
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_SWAP_REL_FILES);
+
+	/*
+	 * Even ShareUpdateExclusiveLock should have prevented others from
+	 * creating / dropping indexes (even using the CONCURRENTLY option), so we
+	 * do not need to check whether the lists match.
+	 */
+	forboth(lc, ind_oids_old, lc2, ind_oids_new)
+	{
+		Oid			ind_old = lfirst_oid(lc);
+		Oid			ind_new = lfirst_oid(lc2);
+		Oid			mapped_tables[4];
+
+		/* Zero out possible results from swapped_relation_files */
+		memset(mapped_tables, 0, sizeof(mapped_tables));
+
+		swap_relation_files(ind_old, ind_new,
+							(old_table_oid == RelationRelationId),
+							false,	/* swap_toast_by_content */
+							true,
+							InvalidTransactionId,
+							InvalidMultiXactId,
+							mapped_tables);
+
+#ifdef USE_ASSERT_CHECKING
+
+		/*
+		 * Concurrent processing is not supported for system relations, so
+		 * there should be no mapped tables.
+		 */
+		for (int i = 0; i < 4; i++)
+			Assert(mapped_tables[i] == 0);
+#endif
+	}
+
+	/* The new indexes must be visible for deletion. */
+	CommandCounterIncrement();
+
+	/* Close the old heap but keep lock until transaction commit. */
+	table_close(OldHeap, NoLock);
+	/* Close the new heap. (We didn't have to open its indexes). */
+	table_close(NewHeap, NoLock);
+
+	/* Cleanup what we don't need anymore. (And close the identity index.) */
+	pfree(chgdst.ident_key);
+	free_index_insert_state(chgdst.iistate);
+
+	/*
+	 * Swap the relations and their TOAST relations and TOAST indexes. This
+	 * also drops the new relation and its indexes.
+	 *
+	 * (System catalogs are currently not supported.)
+	 */
+	Assert(!is_system_catalog);
+	finish_heap_swap(old_table_oid, new_table_oid,
+					 is_system_catalog,
+					 false,		/* swap_toast_by_content */
+					 false, true, false,
+					 frozenXid, cutoffMulti,
+					 relpersistence);
+}
+
+/*
+ * Build indexes on NewHeap according to those on OldHeap.
+ *
+ * OldIndexes is the list of index OIDs on OldHeap. The contained indexes end
+ * up locked using ShareUpdateExclusiveLock.
+ *
+ * A list of OIDs of the corresponding indexes created on NewHeap is
+ * returned. The order of items does match, so we can use these arrays to swap
+ * index storage.
+ */
+static List *
+build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
+{
+	List	   *result = NIL;
+
+	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
+								 PROGRESS_REPACK_PHASE_REBUILD_INDEX);
+
+	foreach_oid(ind_oid, OldIndexes)
+	{
+		Oid			ind_oid_new;
+		char	   *newName;
+		Relation	ind;
+
+		ind = index_open(ind_oid, ShareUpdateExclusiveLock);
+
+		newName = ChooseRelationName(get_rel_name(ind_oid),
+									 NULL,
+									 "repacknew",
+									 get_rel_namespace(ind->rd_index->indrelid),
+									 false);
+		ind_oid_new = index_create_copy(NewHeap, ind_oid,
+										ind->rd_rel->reltablespace, newName,
+										false);
+		result = lappend_oid(result, ind_oid_new);
+
+		index_close(ind, NoLock);
+	}
+
+	return result;
+}
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 81a55a33ef2..ebc70f5bead 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -892,7 +892,7 @@ refresh_by_match_merge(Oid matviewOid, Oid tempOid, Oid relowner,
 static void
 refresh_by_heap_swap(Oid matviewOid, Oid OIDNewHeap, char relpersistence)
 {
-	finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true,
+	finish_heap_swap(matviewOid, OIDNewHeap, false, false, true, true, true,
 					 RecentXmin, ReadNextMultiXactId(), relpersistence);
 }
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f976c0e5c7e..296387c7889 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6025,6 +6025,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
 			finish_heap_swap(tab->relid, OIDNewHeap,
 							 false, false, true,
 							 !OidIsValid(tab->newTableSpace),
+							 true,
 							 RecentXmin,
 							 ReadNextMultiXactId(),
 							 persistence);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aea998260e1..bce24d0f804 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -126,7 +126,7 @@ static void vac_truncate_clog(TransactionId frozenXID,
 							  TransactionId lastSaneFrozenXid,
 							  MultiXactId lastSaneMinMulti);
 static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
-					   BufferAccessStrategy bstrategy);
+					   BufferAccessStrategy bstrategy, bool isTopLevel);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
@@ -629,7 +629,8 @@ vacuum(List *relations, const VacuumParams params, BufferAccessStrategy bstrateg
 
 			if (params.options & VACOPT_VACUUM)
 			{
-				if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy))
+				if (!vacuum_rel(vrel->oid, vrel->relation, params, bstrategy,
+								isTopLevel))
 					continue;
 			}
 
@@ -1999,7 +2000,7 @@ vac_truncate_clog(TransactionId frozenXID,
  */
 static bool
 vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
-		   BufferAccessStrategy bstrategy)
+		   BufferAccessStrategy bstrategy, bool isTopLevel)
 {
 	LOCKMODE	lmode;
 	Relation	rel;
@@ -2290,7 +2291,7 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
 
 			/* VACUUM FULL is a variant of REPACK; see cluster.c */
 			cluster_rel(REPACK_COMMAND_VACUUMFULL, rel, InvalidOid,
-						&cluster_params);
+						&cluster_params, isTopLevel);
 			/* cluster_rel closes the relation, but keeps lock */
 
 			rel = NULL;
@@ -2333,7 +2334,8 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams params,
 		toast_vacuum_params.options |= VACOPT_PROCESS_MAIN;
 		toast_vacuum_params.toast_parent = relid;
 
-		vacuum_rel(toast_relid, NULL, toast_vacuum_params, bstrategy);
+		vacuum_rel(toast_relid, NULL, toast_vacuum_params, bstrategy,
+				   isTopLevel);
 	}
 
 	/*
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 712a857cdb4..3e43edf48a0 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -194,5 +194,6 @@ pg_test_mod_args = pg_mod_args + {
 subdir('jit/llvm')
 subdir('replication/libpqwalreceiver')
 subdir('replication/pgoutput')
+subdir('replication/pgoutput_repack')
 subdir('snowball')
 subdir('utils/mb/conversion_procs')
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 32af1249610..887873c93ac 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -33,6 +33,7 @@
 #include "access/xlogreader.h"
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
+#include "commands/cluster.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
 #include "replication/message.h"
@@ -420,7 +421,8 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	{
 		case XLOG_HEAP2_MULTI_INSERT:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
-				!ctx->fast_forward)
+				!ctx->fast_forward &&
+				!change_useless_for_repack(buf))
 				DecodeMultiInsert(ctx, buf);
 			break;
 		case XLOG_HEAP2_NEW_CID:
@@ -467,6 +469,15 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	TransactionId xid = XLogRecGetXid(buf->record);
 	SnapBuild  *builder = ctx->snapshot_builder;
 
+	/*
+	 * XXX Should we return here if change_useless_for_repack() returns true,
+	 * instead of calling the function below? Unlike the fast-forward case, we
+	 * shouldn't need the base snapshot for the containing transaction until
+	 * we receive a change that belongs to the table being REPACKed. Thus it
+	 * should be fine to skip SnapBuildProcessChange(), and therefore
+	 * reorderbuffer.c can create the transaction later.
+	 */
+
 	ReorderBufferProcessXid(ctx->reorder, xid, buf->origptr);
 
 	/*
@@ -484,7 +495,8 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	{
 		case XLOG_HEAP_INSERT:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
-				!ctx->fast_forward)
+				!ctx->fast_forward &&
+				!change_useless_for_repack(buf))
 				DecodeInsert(ctx, buf);
 			break;
 
@@ -496,19 +508,22 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_HEAP_HOT_UPDATE:
 		case XLOG_HEAP_UPDATE:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
-				!ctx->fast_forward)
+				!ctx->fast_forward &&
+				!change_useless_for_repack(buf))
 				DecodeUpdate(ctx, buf);
 			break;
 
 		case XLOG_HEAP_DELETE:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
-				!ctx->fast_forward)
+				!ctx->fast_forward &&
+				!change_useless_for_repack(buf))
 				DecodeDelete(ctx, buf);
 			break;
 
 		case XLOG_HEAP_TRUNCATE:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
-				!ctx->fast_forward)
+				!ctx->fast_forward &&
+				!change_useless_for_repack(buf))
 				DecodeTruncate(ctx, buf);
 			break;
 
@@ -524,7 +539,8 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 		case XLOG_HEAP_CONFIRM:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr) &&
-				!ctx->fast_forward)
+				!ctx->fast_forward &&
+				!change_useless_for_repack(buf))
 				DecodeSpecConfirm(ctx, buf);
 			break;
 
@@ -1021,6 +1037,15 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	xlrec = (xl_heap_delete *) XLogRecGetData(r);
 
+	/*
+	 * Ignore changes which are considered useless for logical
+	 * decoding. Currently such changes are created by REPACK (CONCURRENTLY)
+	 * when replays DELETE commands on the new table (which is not yet visible
+	 * to other transactions).
+	 */
+	if (xlrec->flags & XLH_DELETE_NO_LOGICAL)
+		return;
+
 	/* only interested in our database */
 	XLogRecGetBlockTag(r, 0, &target_locator, NULL, NULL);
 	if (target_locator.dbOid != ctx->slot->data.database)
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a738ad8a864..ffe6aa7f7bb 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -486,6 +486,27 @@ SnapBuildInitialSnapshot(SnapBuild *builder)
 	return SnapBuildMVCCFromHistoric(snap, true);
 }
 
+/*
+ * Build an MVCC snapshot for the initial data load performed by REPACK
+ * CONCURRENTLY command.
+ *
+ * The snapshot will only be used to scan one particular relation, which is
+ * treated like a catalog (therefore ->building_full_snapshot is not
+ * important), and the caller should already have a replication slot setup (so
+ * we do not set MyProc->xmin). XXX Do we yet need to add some restrictions?
+ */
+Snapshot
+SnapBuildInitialSnapshotForRepack(SnapBuild *builder)
+{
+	Snapshot	snap;
+
+	Assert(builder->state == SNAPBUILD_CONSISTENT);
+	Assert(builder->building_full_snapshot);
+
+	snap = SnapBuildBuildSnapshot(builder);
+	return SnapBuildMVCCFromHistoric(snap, false);
+}
+
 /*
  * Turn a historic MVCC snapshot into an ordinary MVCC snapshot.
  *
diff --git a/src/backend/replication/pgoutput_repack/Makefile b/src/backend/replication/pgoutput_repack/Makefile
new file mode 100644
index 00000000000..4efeb713b70
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/Makefile
@@ -0,0 +1,32 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/replication/pgoutput_repack
+#
+# IDENTIFICATION
+#    src/backend/replication/pgoutput_repack
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/replication/pgoutput_repack
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	$(WIN32RES) \
+	pgoutput_repack.o
+PGFILEDESC = "pgoutput_repack - logical replication output plugin for REPACK command"
+NAME = pgoutput_repack
+
+all: all-shared-lib
+
+include $(top_srcdir)/src/Makefile.shlib
+
+install: all installdirs install-lib
+
+installdirs: installdirs-lib
+
+uninstall: uninstall-lib
+
+clean distclean: clean-lib
+	rm -f $(OBJS)
diff --git a/src/backend/replication/pgoutput_repack/meson.build b/src/backend/replication/pgoutput_repack/meson.build
new file mode 100644
index 00000000000..133e865a4a0
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/meson.build
@@ -0,0 +1,18 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+pgoutput_repack_sources = files(
+  'pgoutput_repack.c',
+)
+
+if host_system == 'windows'
+  pgoutput_repack_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'pgoutput_repack',
+    '--FILEDESC', 'pgoutput_repack - logical replication output plugin for REPACK command',])
+endif
+
+pgoutput_repack = shared_module('pgoutput_repack',
+  pgoutput_repack_sources,
+  kwargs: pg_mod_args,
+)
+
+backend_targets += pgoutput_repack
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
new file mode 100644
index 00000000000..6b54ea040ac
--- /dev/null
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -0,0 +1,239 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgoutput_repack.c
+ *		Logical Replication output plugin for REPACK command
+ *
+ * Copyright (c) 2012-2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/backend/replication/pgoutput_repack/pgoutput_repack.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/heaptoast.h"
+#include "commands/cluster.h"
+#include "replication/snapbuild.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+static void plugin_startup(LogicalDecodingContext *ctx,
+						   OutputPluginOptions *opt, bool is_init);
+static void plugin_shutdown(LogicalDecodingContext *ctx);
+static void plugin_begin_txn(LogicalDecodingContext *ctx,
+							 ReorderBufferTXN *txn);
+static void plugin_commit_txn(LogicalDecodingContext *ctx,
+							  ReorderBufferTXN *txn, XLogRecPtr commit_lsn);
+static void plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+						  Relation rel, ReorderBufferChange *change);
+static void store_change(LogicalDecodingContext *ctx,
+						 ConcurrentChangeKind kind, HeapTuple tuple);
+
+void
+_PG_output_plugin_init(OutputPluginCallbacks *cb)
+{
+	cb->startup_cb = plugin_startup;
+	cb->begin_cb = plugin_begin_txn;
+	cb->change_cb = plugin_change;
+	cb->commit_cb = plugin_commit_txn;
+	cb->shutdown_cb = plugin_shutdown;
+}
+
+
+/* initialize this plugin */
+static void
+plugin_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
+			   bool is_init)
+{
+	ctx->output_plugin_private = NULL;
+
+	/* Probably unnecessary, as we don't use the SQL interface ... */
+	opt->output_type = OUTPUT_PLUGIN_BINARY_OUTPUT;
+
+	if (ctx->output_plugin_options != NIL)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("This plugin does not expect any options")));
+	}
+}
+
+static void
+plugin_shutdown(LogicalDecodingContext *ctx)
+{
+}
+
+/*
+ * As we don't release the slot during processing of particular table, there's
+ * no room for SQL interface, even for debugging purposes. Therefore we need
+ * neither OutputPluginPrepareWrite() nor OutputPluginWrite() in the plugin
+ * callbacks. (Although we might want to write custom callbacks, this API
+ * seems to be unnecessarily generic for our purposes.)
+ */
+
+/* BEGIN callback */
+static void
+plugin_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
+{
+}
+
+/* COMMIT callback */
+static void
+plugin_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+				  XLogRecPtr commit_lsn)
+{
+}
+
+/*
+ * Callback for individual changed tuples
+ */
+static void
+plugin_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
+			  Relation relation, ReorderBufferChange *change)
+{
+	RepackDecodingState *dstate;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	/* Only interested in one particular relation. */
+	if (relation->rd_id != dstate->relid)
+		return;
+
+	/* Decode entry depending on its type */
+	switch (change->action)
+	{
+		case REORDER_BUFFER_CHANGE_INSERT:
+			{
+				HeapTuple	newtuple;
+
+				newtuple = change->data.tp.newtuple;
+
+				/*
+				 * Identity checks in the main function should have made this
+				 * impossible.
+				 */
+				if (newtuple == NULL)
+					elog(ERROR, "Incomplete insert info.");
+
+				store_change(ctx, CHANGE_INSERT, newtuple);
+			}
+			break;
+		case REORDER_BUFFER_CHANGE_UPDATE:
+			{
+				HeapTuple	oldtuple,
+							newtuple;
+
+				oldtuple = change->data.tp.oldtuple;
+				newtuple = change->data.tp.newtuple;
+
+				if (newtuple == NULL)
+					elog(ERROR, "Incomplete update info.");
+
+				if (oldtuple != NULL)
+					store_change(ctx, CHANGE_UPDATE_OLD, oldtuple);
+
+				store_change(ctx, CHANGE_UPDATE_NEW, newtuple);
+			}
+			break;
+		case REORDER_BUFFER_CHANGE_DELETE:
+			{
+				HeapTuple	oldtuple;
+
+				oldtuple = change->data.tp.oldtuple;
+
+				if (oldtuple == NULL)
+					elog(ERROR, "Incomplete delete info.");
+
+				store_change(ctx, CHANGE_DELETE, oldtuple);
+			}
+			break;
+		default:
+			/*
+			 * Should not come here. This includes TRUNCATE of the table being
+			 * processed. heap_decode() cannot check the file locator easily,
+			 * but we assume that TRUNCATE uses AccessExclusiveLock on the
+			 * table so it should not occur during REPACK (CONCURRENTLY).
+			 */
+			Assert(false);
+			break;
+	}
+}
+
+/* Store concurrent data change. */
+static void
+store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
+			 HeapTuple tuple)
+{
+	RepackDecodingState *dstate;
+	char	   *change_raw;
+	ConcurrentChange change;
+	bool		flattened = false;
+	Size		size;
+	Datum		values[1];
+	bool		isnull[1];
+	char	   *dst;
+
+	dstate = (RepackDecodingState *) ctx->output_writer_private;
+
+	size = VARHDRSZ + SizeOfConcurrentChange;
+
+	/*
+	 * ReorderBufferCommit() stores the TOAST chunks in its private memory
+	 * context and frees them after having called apply_change().  Therefore
+	 * we need flat copy (including TOAST) that we eventually copy into the
+	 * memory context which is available to decode_concurrent_changes().
+	 */
+	if (HeapTupleHasExternal(tuple))
+	{
+		/*
+		 * toast_flatten_tuple_to_datum() might be more convenient but we
+		 * don't want the decompression it does.
+		 */
+		tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
+		flattened = true;
+	}
+
+	size += tuple->t_len;
+	if (size >= MaxAllocSize)
+		elog(ERROR, "Change is too big.");
+
+	/* Construct the change. */
+	change_raw = (char *) palloc0(size);
+	SET_VARSIZE(change_raw, size);
+
+	/*
+	 * Since the varlena alignment might not be sufficient for the structure,
+	 * set the fields in a local instance and remember where it should
+	 * eventually be copied.
+	 */
+	change.kind = kind;
+	dst = (char *) VARDATA(change_raw);
+
+	/*
+	 * Copy the tuple.
+	 *
+	 * Note: change->tup_data.t_data must be fixed on retrieval!
+	 */
+	memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
+	memcpy(dst, &change, SizeOfConcurrentChange);
+	dst += SizeOfConcurrentChange;
+	memcpy(dst, tuple->t_data, tuple->t_len);
+
+	/* The data has been copied. */
+	if (flattened)
+		pfree(tuple);
+
+	/* Store as tuple of 1 bytea column. */
+	values[0] = PointerGetDatum(change_raw);
+	isnull[0] = false;
+	tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
+						 values, isnull);
+
+	/* Accounting. */
+	dstate->nchanges++;
+
+	/* Cleanup. */
+	pfree(change_raw);
+}
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index b49007167b0..2e7f1054e62 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -162,7 +162,7 @@ while (<$lwlocklist>)
 
 die
   "$wait_event_lwlocks[$lwlock_count] defined in wait_event_names.txt but "
-  . " missing from lwlocklist.h"
+  . "missing from lwlocklist.h"
   if $lwlock_count < scalar @wait_event_lwlocks;
 
 die
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3af1b366adf..fdf3427b43f 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -214,7 +214,6 @@ static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
 static void UnregisterSnapshotNoOwner(Snapshot snapshot);
-static void FreeSnapshot(Snapshot snapshot);
 static void SnapshotResetXmin(void);
 
 /* ResourceOwner callbacks to track snapshot references */
@@ -659,7 +658,7 @@ CopySnapshot(Snapshot snapshot)
  * FreeSnapshot
  *		Free the memory associated with a snapshot.
  */
-static void
+void
 FreeSnapshot(Snapshot snapshot)
 {
 	Assert(snapshot->regd_count == 0);
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index 2a1bb47ff03..0ec0f4c4790 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -5121,8 +5121,8 @@ match_previous_words(int pattern_id,
 		 * one word, so the above test is correct.
 		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("ANALYZE", "VERBOSE");
-		else if (TailMatches("ANALYZE", "VERBOSE"))
+			COMPLETE_WITH("ANALYZE", "CONCURRENTLY", "VERBOSE");
+		else if (TailMatches("ANALYZE", "CONCURRENTLY", "VERBOSE"))
 			COMPLETE_WITH("ON", "OFF");
 	}
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3c0961ab36b..f3cf4e1f487 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -361,14 +361,15 @@ extern void heap_multi_insert(Relation relation, TupleTableSlot **slots,
 							  BulkInsertState bistate);
 extern TM_Result heap_delete(Relation relation, const ItemPointerData *tid,
 							 CommandId cid, Snapshot crosscheck, bool wait,
-							 TM_FailureData *tmfd, bool changingPart);
+							 TM_FailureData *tmfd, bool changingPart,
+							 bool wal_logical);
 extern void heap_finish_speculative(Relation relation, const ItemPointerData *tid);
 extern void heap_abort_speculative(Relation relation, const ItemPointerData *tid);
 extern TM_Result heap_update(Relation relation, const ItemPointerData *otid,
 							 HeapTuple newtup,
 							 CommandId cid, Snapshot crosscheck, bool wait,
 							 TM_FailureData *tmfd, LockTupleMode *lockmode,
-							 TU_UpdateIndexes *update_indexes);
+							 TU_UpdateIndexes *update_indexes, bool wal_logical);
 extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 CommandId cid, LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool follow_updates,
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index ce3566ba949..f1f5495556b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -104,6 +104,8 @@
 #define XLH_DELETE_CONTAINS_OLD_KEY				(1<<2)
 #define XLH_DELETE_IS_SUPER						(1<<3)
 #define XLH_DELETE_IS_PARTITION_MOVE			(1<<4)
+/* See heap_delete() */
+#define XLH_DELETE_NO_LOGICAL					(1<<5)
 
 /* convenience macro for checking whether any form of old tuple was logged */
 #define XLH_DELETE_CONTAINS_OLD						\
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 7260b7b3d52..14928cd04a1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,6 +22,7 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
+#include "replication/logical.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -629,6 +630,8 @@ typedef struct TableAmRoutine
 											  Relation OldIndex,
 											  bool use_sort,
 											  TransactionId OldestXmin,
+											  Snapshot snapshot,
+											  LogicalDecodingContext *decoding_ctx,
 											  TransactionId *xid_cutoff,
 											  MultiXactId *multi_cutoff,
 											  double *num_tuples,
@@ -1658,6 +1661,10 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
  *   not needed for the relation's AM
  * - *xid_cutoff - ditto
  * - *multi_cutoff - ditto
+ * - snapshot - if != NULL, ignore data changes done by transactions that this
+ *	 (MVCC) snapshot considers still in-progress or in the future.
+ * - decoding_ctx - logical decoding context, to capture concurrent data
+ *   changes.
  *
  * Output parameters:
  * - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1670,6 +1677,8 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
 								Relation OldIndex,
 								bool use_sort,
 								TransactionId OldestXmin,
+								Snapshot snapshot,
+								LogicalDecodingContext *decoding_ctx,
 								TransactionId *xid_cutoff,
 								MultiXactId *multi_cutoff,
 								double *num_tuples,
@@ -1678,6 +1687,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
 {
 	OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
 													use_sort, OldestXmin,
+													snapshot, decoding_ctx,
 													xid_cutoff, multi_cutoff,
 													num_tuples, tups_vacuumed,
 													tups_recently_dead);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 28741988478..6a5c476294a 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -13,10 +13,15 @@
 #ifndef CLUSTER_H
 #define CLUSTER_H
 
+#include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
 #include "parser/parse_node.h"
+#include "replication/decode.h"
+#include "replication/logical.h"
 #include "storage/lock.h"
 #include "utils/relcache.h"
+#include "utils/resowner.h"
+#include "utils/tuplestore.h"
 
 
 /* flag bits for ClusterParams->options */
@@ -25,6 +30,8 @@
 #define CLUOPT_RECHECK_ISCLUSTERED 0x04 /* recheck relation state for
 										 * indisclustered */
 #define CLUOPT_ANALYZE 0x08		/* do an ANALYZE */
+#define CLUOPT_CONCURRENT 0x10	/* allow concurrent data changes */
+
 
 /* options for CLUSTER */
 typedef struct ClusterParams
@@ -33,10 +40,84 @@ typedef struct ClusterParams
 } ClusterParams;
 
 
+/*
+ * The following definitions are used by REPACK CONCURRENTLY.
+ */
+
+typedef enum
+{
+	CHANGE_INSERT,
+	CHANGE_UPDATE_OLD,
+	CHANGE_UPDATE_NEW,
+	CHANGE_DELETE
+} ConcurrentChangeKind;
+
+typedef struct ConcurrentChange
+{
+	/* See the enum above. */
+	ConcurrentChangeKind kind;
+
+	/*
+	 * The actual tuple.
+	 *
+	 * The tuple data follows the ConcurrentChange structure. Before use make
+	 * sure the tuple is correctly aligned (ConcurrentChange can be stored as
+	 * bytea) and that tuple->t_data is fixed.
+	 */
+	HeapTupleData tup_data;
+} ConcurrentChange;
+
+#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
+								sizeof(HeapTupleData))
+
+/*
+ * Logical decoding state.
+ *
+ * Here we store the data changes that we decode from WAL while the table
+ * contents is being copied to a new storage. Also the necessary metadata
+ * needed to apply these changes to the table is stored here.
+ */
+typedef struct RepackDecodingState
+{
+	/* The relation whose changes we're decoding. */
+	Oid			relid;
+
+	/* Replication slot name. */
+	NameData	slotname;
+
+	/*
+	 * Decoded changes are stored here. Although we try to avoid excessive
+	 * batches, it can happen that the changes need to be stored to disk. The
+	 * tuplestore does this transparently.
+	 */
+	Tuplestorestate *tstore;
+
+	/* The current number of changes in tstore. */
+	double		nchanges;
+
+	/*
+	 * Descriptor to store the ConcurrentChange structure serialized (bytea).
+	 * We can't store the tuple directly because tuplestore only supports
+	 * minimum tuple and we may need to transfer OID system column from the
+	 * output plugin. Also we need to transfer the change kind, so it's better
+	 * to put everything in the structure than to use 2 tuplestores "in
+	 * parallel".
+	 */
+	TupleDesc	tupdesc_change;
+
+	/* Tuple descriptor needed to update indexes. */
+	TupleDesc	tupdesc;
+
+	/* Slot to retrieve data from tstore. */
+	TupleTableSlot *tsslot;
+
+	ResourceOwner resowner;
+} RepackDecodingState;
+
 extern void ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
 
 extern void cluster_rel(RepackCommand command, Relation OldHeap, Oid indexOid,
-						ClusterParams *params);
+						ClusterParams *params, bool isTopLevel);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 									   LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
@@ -48,8 +129,13 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 							 bool swap_toast_by_content,
 							 bool check_constraints,
 							 bool is_internal,
+							 bool reindex,
 							 TransactionId frozenXid,
 							 MultiXactId cutoffMulti,
 							 char newrelpersistence);
 
+extern bool am_decoding_for_repack(void);
+extern bool change_useless_for_repack(XLogRecordBuffer *buf);
+extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
+											 XLogRecPtr end_of_wal);
 #endif							/* CLUSTER_H */
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index f00e39b937d..4445724a463 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -86,10 +86,12 @@
 #define PROGRESS_REPACK_PHASE					1
 #define PROGRESS_REPACK_INDEX_RELID				2
 #define PROGRESS_REPACK_HEAP_TUPLES_SCANNED		3
-#define PROGRESS_REPACK_HEAP_TUPLES_WRITTEN		4
-#define PROGRESS_REPACK_TOTAL_HEAP_BLKS			5
-#define PROGRESS_REPACK_HEAP_BLKS_SCANNED		6
-#define PROGRESS_REPACK_INDEX_REBUILD_COUNT		7
+#define PROGRESS_REPACK_HEAP_TUPLES_INSERTED	4
+#define PROGRESS_REPACK_HEAP_TUPLES_UPDATED		5
+#define PROGRESS_REPACK_HEAP_TUPLES_DELETED		6
+#define PROGRESS_REPACK_TOTAL_HEAP_BLKS			7
+#define PROGRESS_REPACK_HEAP_BLKS_SCANNED		8
+#define PROGRESS_REPACK_INDEX_REBUILD_COUNT		9
 
 /*
  * Phases of repack (as advertised via PROGRESS_REPACK_PHASE).
@@ -98,9 +100,10 @@
 #define PROGRESS_REPACK_PHASE_INDEX_SCAN_HEAP	2
 #define PROGRESS_REPACK_PHASE_SORT_TUPLES		3
 #define PROGRESS_REPACK_PHASE_WRITE_NEW_HEAP	4
-#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES	5
-#define PROGRESS_REPACK_PHASE_REBUILD_INDEX		6
-#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP		7
+#define PROGRESS_REPACK_PHASE_CATCH_UP			5
+#define PROGRESS_REPACK_PHASE_SWAP_REL_FILES	6
+#define PROGRESS_REPACK_PHASE_REBUILD_INDEX		7
+#define PROGRESS_REPACK_PHASE_FINAL_CLEANUP		8
 
 /* Progress parameters for CREATE INDEX */
 /* 3, 4 and 5 reserved for "waitfor" metrics */
diff --git a/src/include/replication/snapbuild.h b/src/include/replication/snapbuild.h
index 34383dea776..5ee267d1c90 100644
--- a/src/include/replication/snapbuild.h
+++ b/src/include/replication/snapbuild.h
@@ -73,6 +73,7 @@ extern void FreeSnapshotBuilder(SnapBuild *builder);
 extern void SnapBuildSnapDecRefcount(Snapshot snap);
 
 extern Snapshot SnapBuildInitialSnapshot(SnapBuild *builder);
+extern Snapshot SnapBuildInitialSnapshotForRepack(SnapBuild *builder);
 extern Snapshot SnapBuildMVCCFromHistoric(Snapshot snapshot, bool in_place);
 extern const char *SnapBuildExportSnapshot(SnapBuild *builder);
 extern void SnapBuildClearExportedSnapshot(void);
diff --git a/src/include/storage/lockdefs.h b/src/include/storage/lockdefs.h
index b73bb5618e6..3785b009808 100644
--- a/src/include/storage/lockdefs.h
+++ b/src/include/storage/lockdefs.h
@@ -36,8 +36,8 @@ typedef int LOCKMODE;
 #define AccessShareLock			1	/* SELECT */
 #define RowShareLock			2	/* SELECT FOR UPDATE/FOR SHARE */
 #define RowExclusiveLock		3	/* INSERT, UPDATE, DELETE */
-#define ShareUpdateExclusiveLock 4	/* VACUUM (non-FULL), ANALYZE, CREATE
-									 * INDEX CONCURRENTLY */
+#define ShareUpdateExclusiveLock 4	/* VACUUM (non-exclusive), ANALYZE, CREATE
+									 * INDEX CONCURRENTLY, REPACK CONCURRENTLY */
 #define ShareLock				5	/* CREATE INDEX (WITHOUT CONCURRENTLY) */
 #define ShareRowExclusiveLock	6	/* like EXCLUSIVE MODE, but allows ROW
 									 * SHARE */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index de824945f0b..0eb8ced76d3 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -64,6 +64,8 @@ extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
 
 extern Snapshot CopySnapshot(Snapshot snapshot);
+extern void FreeSnapshot(Snapshot snapshot);
+
 extern Snapshot GetCatalogSnapshot(Oid relid);
 extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
 extern void InvalidateCatalogSnapshot(void);
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index a41d781f8c9..2cd7d87c533 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,6 +14,8 @@ REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic \
 	    inplace \
+	    repack \
+	    repack_toast \
 	    syscache-update-pruned \
 	    heap_lock_update
 
diff --git a/src/test/modules/injection_points/expected/repack.out b/src/test/modules/injection_points/expected/repack.out
new file mode 100644
index 00000000000..b575e9052ee
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack.out
@@ -0,0 +1,113 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change_existing change_new change_subxact1 change_subxact2 check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step wait_before_lock: 
+	REPACK (CONCURRENTLY) repack_test USING INDEX repack_test_pkey;
+ <waiting ...>
+step change_existing: 
+	UPDATE repack_test SET i=10 where i=1;
+	UPDATE repack_test SET j=20 where i=2;
+	UPDATE repack_test SET i=30 where i=3;
+	UPDATE repack_test SET i=40 where i=30;
+	DELETE FROM repack_test WHERE i=4;
+
+step change_new: 
+	INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+	UPDATE repack_test SET i=50 where i=5;
+	UPDATE repack_test SET j=60 where i=6;
+	DELETE FROM repack_test WHERE i=7;
+
+step change_subxact1: 
+	BEGIN;
+	INSERT INTO repack_test(i, j) VALUES (100, 100);
+	SAVEPOINT s1;
+	UPDATE repack_test SET i=101 where i=100;
+	SAVEPOINT s2;
+	UPDATE repack_test SET i=102 where i=101;
+	COMMIT;
+
+step change_subxact2: 
+	BEGIN;
+	SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 110);
+	ROLLBACK TO SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 111);
+	COMMIT;
+
+step check2: 
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s2(i, j)
+	SELECT i, j FROM repack_test;
+
+  i|  j
+---+---
+  2| 20
+  6| 60
+  8|  8
+ 10|  1
+ 40|  3
+ 50|  5
+102|100
+110|111
+(8 rows)
+
+step wakeup_before_lock: 
+	SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1: 
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT count(DISTINCT node) FROM relfilenodes;
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s1(i, j)
+	SELECT i, j FROM repack_test;
+
+	SELECT count(*)
+	FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+	WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+    2
+(1 row)
+
+  i|  j
+---+---
+  2| 20
+  6| 60
+  8|  8
+ 10|  1
+ 40|  3
+ 50|  5
+102|100
+110|111
+(8 rows)
+
+count
+-----
+    0
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
diff --git a/src/test/modules/injection_points/expected/repack_toast.out b/src/test/modules/injection_points/expected/repack_toast.out
new file mode 100644
index 00000000000..4f866a74e32
--- /dev/null
+++ b/src/test/modules/injection_points/expected/repack_toast.out
@@ -0,0 +1,64 @@
+Parsed test spec with 2 sessions
+
+starting permutation: wait_before_lock change check2 wakeup_before_lock check1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step wait_before_lock: 
+	REPACK (CONCURRENTLY) repack_test;
+ <waiting ...>
+step change: 
+	UPDATE repack_test SET j=get_long_string() where i=2;
+	DELETE FROM repack_test WHERE i=3;
+	INSERT INTO repack_test(i, j) VALUES (4, get_long_string());
+
+step check2: 
+	INSERT INTO relfilenodes(node)
+	SELECT c2.relfilenode
+	FROM pg_class c1 JOIN pg_class c2 ON c2.oid = c1.oid OR c2.oid = c1.reltoastrelid
+	WHERE c1.relname='repack_test';
+
+	INSERT INTO data_s2(i, j)
+	SELECT i, j FROM repack_test;
+
+step wakeup_before_lock: 
+	SELECT injection_points_wakeup('repack-concurrently-before-lock');
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step wait_before_lock: <... completed>
+step check1: 
+	INSERT INTO relfilenodes(node)
+	SELECT c2.relfilenode
+	FROM pg_class c1 JOIN pg_class c2 ON c2.oid = c1.oid OR c2.oid = c1.reltoastrelid
+	WHERE c1.relname='repack_test';
+
+	SELECT count(DISTINCT node) FROM relfilenodes;
+
+	INSERT INTO data_s1(i, j)
+	SELECT i, j FROM repack_test;
+
+	SELECT count(*)
+	FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+	WHERE d1.i ISNULL OR d2.i ISNULL;
+
+count
+-----
+    4
+(1 row)
+
+count
+-----
+    0
+(1 row)
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index fcc85414515..a414abb924b 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -45,6 +45,8 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'repack',
+      'repack_toast',
       'syscache-update-pruned',
       'heap_lock_update',
     ],
diff --git a/src/test/modules/injection_points/specs/repack.spec b/src/test/modules/injection_points/specs/repack.spec
new file mode 100644
index 00000000000..d727a9b056b
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack.spec
@@ -0,0 +1,142 @@
+# REPACK (CONCURRENTLY) ... USING INDEX ...;
+setup
+{
+	CREATE EXTENSION injection_points;
+
+	CREATE TABLE repack_test(i int PRIMARY KEY, j int);
+	INSERT INTO repack_test(i, j) VALUES (1, 1), (2, 2), (3, 3), (4, 4);
+
+	CREATE TABLE relfilenodes(node oid);
+
+	CREATE TABLE data_s1(i int, j int);
+	CREATE TABLE data_s2(i int, j int);
+}
+
+teardown
+{
+	DROP TABLE repack_test;
+	DROP EXTENSION injection_points;
+
+	DROP TABLE relfilenodes;
+	DROP TABLE data_s1;
+	DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+	REPACK (CONCURRENTLY) repack_test USING INDEX repack_test_pkey;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+
+# Have each session write the contents into a table and use FULL JOIN to check
+# if the outputs are identical.
+step check1
+{
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT count(DISTINCT node) FROM relfilenodes;
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s1(i, j)
+	SELECT i, j FROM repack_test;
+
+	SELECT count(*)
+	FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+	WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+	SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+# Change the existing data. UPDATE changes both key and non-key columns. Also
+# update one row twice to test whether tuple version generated by this session
+# can be found.
+step change_existing
+{
+	UPDATE repack_test SET i=10 where i=1;
+	UPDATE repack_test SET j=20 where i=2;
+	UPDATE repack_test SET i=30 where i=3;
+	UPDATE repack_test SET i=40 where i=30;
+	DELETE FROM repack_test WHERE i=4;
+}
+# Insert new rows and UPDATE / DELETE some of them. Again, update both key and
+# non-key column.
+step change_new
+{
+	INSERT INTO repack_test(i, j) VALUES (5, 5), (6, 6), (7, 7), (8, 8);
+	UPDATE repack_test SET i=50 where i=5;
+	UPDATE repack_test SET j=60 where i=6;
+	DELETE FROM repack_test WHERE i=7;
+}
+
+# When applying concurrent data changes, we should see the effects of an
+# in-progress subtransaction.
+#
+# XXX Not sure this test is useful now - it was designed for the patch that
+# preserves tuple visibility and which therefore modifies
+# TransactionIdIsCurrentTransactionId().
+step change_subxact1
+{
+	BEGIN;
+	INSERT INTO repack_test(i, j) VALUES (100, 100);
+	SAVEPOINT s1;
+	UPDATE repack_test SET i=101 where i=100;
+	SAVEPOINT s2;
+	UPDATE repack_test SET i=102 where i=101;
+	COMMIT;
+}
+
+# When applying concurrent data changes, we should not see the effects of a
+# rolled back subtransaction.
+#
+# XXX Is this test useful? See above.
+step change_subxact2
+{
+	BEGIN;
+	SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 110);
+	ROLLBACK TO SAVEPOINT s1;
+	INSERT INTO repack_test(i, j) VALUES (110, 111);
+	COMMIT;
+}
+
+# Check the table from the perspective of s2.
+step check2
+{
+	INSERT INTO relfilenodes(node)
+	SELECT relfilenode FROM pg_class WHERE relname='repack_test';
+
+	SELECT i, j FROM repack_test ORDER BY i, j;
+
+	INSERT INTO data_s2(i, j)
+	SELECT i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+	SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+	wait_before_lock
+	change_existing
+	change_new
+	change_subxact1
+	change_subxact2
+	check2
+	wakeup_before_lock
+	check1
diff --git a/src/test/modules/injection_points/specs/repack_toast.spec b/src/test/modules/injection_points/specs/repack_toast.spec
new file mode 100644
index 00000000000..b48abf21450
--- /dev/null
+++ b/src/test/modules/injection_points/specs/repack_toast.spec
@@ -0,0 +1,105 @@
+# REPACK (CONCURRENTLY);
+#
+# Test handling of TOAST. At the same time, no tuplesort.
+setup
+{
+	CREATE EXTENSION injection_points;
+
+	-- Return a string that needs to be TOASTed.
+	CREATE FUNCTION get_long_string()
+	RETURNS text
+	LANGUAGE sql as $$
+		SELECT string_agg(chr(65 + trunc(25 * random())::int), '')
+		FROM generate_series(1, 2048) s(x);
+	$$;
+
+	CREATE TABLE repack_test(i int PRIMARY KEY, j text);
+	INSERT INTO repack_test(i, j) VALUES (1, get_long_string()),
+		(2, get_long_string()), (3, get_long_string());
+
+	CREATE TABLE relfilenodes(node oid);
+
+	CREATE TABLE data_s1(i int, j text);
+	CREATE TABLE data_s2(i int, j text);
+}
+
+teardown
+{
+	DROP TABLE repack_test;
+	DROP EXTENSION injection_points;
+	DROP FUNCTION get_long_string();
+
+	DROP TABLE relfilenodes;
+	DROP TABLE data_s1;
+	DROP TABLE data_s2;
+}
+
+session s1
+setup
+{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('repack-concurrently-before-lock', 'wait');
+}
+# Perform the initial load and wait for s2 to do some data changes.
+step wait_before_lock
+{
+	REPACK (CONCURRENTLY) repack_test;
+}
+# Check the table from the perspective of s1.
+#
+# Besides the contents, we also check that relfilenode has changed.
+
+# Have each session write the contents into a table and use FULL JOIN to check
+# if the outputs are identical.
+step check1
+{
+	INSERT INTO relfilenodes(node)
+	SELECT c2.relfilenode
+	FROM pg_class c1 JOIN pg_class c2 ON c2.oid = c1.oid OR c2.oid = c1.reltoastrelid
+	WHERE c1.relname='repack_test';
+
+	SELECT count(DISTINCT node) FROM relfilenodes;
+
+	INSERT INTO data_s1(i, j)
+	SELECT i, j FROM repack_test;
+
+	SELECT count(*)
+	FROM data_s1 d1 FULL JOIN data_s2 d2 USING (i, j)
+	WHERE d1.i ISNULL OR d2.i ISNULL;
+}
+teardown
+{
+    SELECT injection_points_detach('repack-concurrently-before-lock');
+}
+
+session s2
+step change
+{
+	UPDATE repack_test SET j=get_long_string() where i=2;
+	DELETE FROM repack_test WHERE i=3;
+	INSERT INTO repack_test(i, j) VALUES (4, get_long_string());
+}
+# Check the table from the perspective of s2.
+step check2
+{
+	INSERT INTO relfilenodes(node)
+	SELECT c2.relfilenode
+	FROM pg_class c1 JOIN pg_class c2 ON c2.oid = c1.oid OR c2.oid = c1.reltoastrelid
+	WHERE c1.relname='repack_test';
+
+	INSERT INTO data_s2(i, j)
+	SELECT i, j FROM repack_test;
+}
+step wakeup_before_lock
+{
+	SELECT injection_points_wakeup('repack-concurrently-before-lock');
+}
+
+# Test if data changes introduced while one session is performing REPACK
+# CONCURRENTLY find their way into the table.
+permutation
+	wait_before_lock
+	change
+	check2
+	wakeup_before_lock
+	check1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 48461550636..470920f0d16 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2014,7 +2014,7 @@ pg_stat_progress_cluster| SELECT pid,
     phase,
     repack_index_relid AS cluster_index_relid,
     heap_tuples_scanned,
-    heap_tuples_written,
+    (heap_tuples_inserted + heap_tuples_updated) AS heap_tuples_written,
     heap_blks_total,
     heap_blks_scanned,
     index_rebuild_count
@@ -2094,17 +2094,20 @@ pg_stat_progress_repack| SELECT s.pid,
             WHEN 2 THEN 'index scanning heap'::text
             WHEN 3 THEN 'sorting tuples'::text
             WHEN 4 THEN 'writing new heap'::text
-            WHEN 5 THEN 'swapping relation files'::text
-            WHEN 6 THEN 'rebuilding index'::text
-            WHEN 7 THEN 'performing final cleanup'::text
+            WHEN 5 THEN 'catch-up'::text
+            WHEN 6 THEN 'swapping relation files'::text
+            WHEN 7 THEN 'rebuilding index'::text
+            WHEN 8 THEN 'performing final cleanup'::text
             ELSE NULL::text
         END AS phase,
     (s.param3)::oid AS repack_index_relid,
     s.param4 AS heap_tuples_scanned,
-    s.param5 AS heap_tuples_written,
-    s.param6 AS heap_blks_total,
-    s.param7 AS heap_blks_scanned,
-    s.param8 AS index_rebuild_count
+    s.param5 AS heap_tuples_inserted,
+    s.param6 AS heap_tuples_updated,
+    s.param7 AS heap_tuples_deleted,
+    s.param8 AS heap_blks_total,
+    s.param9 AS heap_blks_scanned,
+    s.param10 AS index_rebuild_count
    FROM (pg_stat_get_progress_info('REPACK'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_progress_vacuum| SELECT s.pid,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6c4af1c210d..2676823d2a7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -420,6 +420,7 @@ CatCacheHeader
 CatalogId
 CatalogIdMapEntry
 CatalogIndexState
+ChangeDest
 ChangeVarNodes_callback
 ChangeVarNodes_context
 ChannelName
@@ -497,6 +498,8 @@ CompressFileHandle
 CompressionLocation
 CompressorState
 ComputeXidHorizonsResult
+ConcurrentChange
+ConcurrentChangeKind
 ConditionVariable
 ConditionVariableMinimallyPadded
 ConditionalStack
@@ -1278,6 +1281,7 @@ IndexElem
 IndexFetchHeapData
 IndexFetchTableData
 IndexInfo
+IndexInsertState
 IndexList
 IndexOnlyScan
 IndexOnlyScanState
@@ -2576,6 +2580,7 @@ ReorderBufferTupleCidKey
 ReorderBufferUpdateProgressTxnCB
 ReorderTuple
 RepackCommand
+RepackDecodingState
 RepackStmt
 ReparameterizeForeignPathByChild_function
 ReplOriginId
-- 
2.47.3



  [text/x-diff] v33-0005-Use-background-worker-to-do-logical-decoding.patch (65.2K, 6-v33-0005-Use-background-worker-to-do-logical-decoding.patch)
  download | inline diff:
From bef2f440ce946e6c59223f51857396e10fa1b4d0 Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Tue, 27 Jan 2026 11:48:40 +0100
Subject: [PATCH v33 5/5] Use background worker to do logical decoding.

If the backend performing REPACK (CONCURRENTLY) does both data copying and
logical decoding, it has to "travel in time" back and forth and therefore it
has to invalidate system caches quite a few times. (The copying and the
decoding work with different catalog snapshots.) As the decoding worker has
separate caches, the switching is not necessary.

Without the worker, it'd also be difficult to switch between potentially long
running tasks like index build and WAL decoding. (No decoding during that time
at all can suspend archiving / recycling of WAL segments for some time, which
in turn may result in full disk.)

Another problem is that, after having acquired AccessExclusiveLock (in order
to swap the files), the backend needs to both decode and apply the data
changes that took place while it was waiting for the lock. With the decoding
worker, the decoding runs all the time, so the backend only needs to apply the
changes. This can reduce the time the exclusive lock is held for.

Note that the code added in order to handle ERRORs in the background worker
almost duplicates the existing code that does the same for other types of
workers (See ProcessParallelMessages() and
ProcessParallelApplyMessages()). Refactoring of the existing code might be
useful, to reduce the duplication.
---
 src/backend/access/heap/heapam_handler.c      |   44 -
 src/backend/commands/cluster.c                | 1183 +++++++++++++----
 src/backend/libpq/pqmq.c                      |    5 +
 src/backend/postmaster/bgworker.c             |    5 +
 src/backend/replication/logical/logical.c     |    6 +-
 .../pgoutput_repack/pgoutput_repack.c         |   54 +-
 src/backend/storage/ipc/procsignal.c          |    4 +
 src/backend/tcop/postgres.c                   |    4 +
 .../utils/activity/wait_event_names.txt       |    1 +
 src/include/access/tableam.h                  |    7 +-
 src/include/commands/cluster.h                |   71 +-
 src/include/storage/procsignal.h              |    1 +
 src/tools/pgindent/typedefs.list              |    3 +-
 13 files changed, 984 insertions(+), 404 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 908f1ef66c6..8589b3c940e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,7 +33,6 @@
 #include "catalog/index.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
-#include "commands/cluster.h"
 #include "commands/progress.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -688,7 +687,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 								 Relation OldIndex, bool use_sort,
 								 TransactionId OldestXmin,
 								 Snapshot snapshot,
-								 LogicalDecodingContext *decoding_ctx,
 								 TransactionId *xid_cutoff,
 								 MultiXactId *multi_cutoff,
 								 double *num_tuples,
@@ -710,7 +708,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	BufferHeapTupleTableSlot *hslot;
 	BlockNumber prev_cblock = InvalidBlockNumber;
 	bool		concurrent = snapshot != NULL;
-	XLogRecPtr	end_of_wal_prev = GetFlushRecPtr(NULL);
 
 	/* Remember if it's a system catalog */
 	is_system_catalog = IsSystemRelation(OldHeap);
@@ -971,31 +968,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			ct_val[1] = *num_tuples;
 			pgstat_progress_update_multi_param(2, ct_index, ct_val);
 		}
-
-		/*
-		 * Process the WAL produced by the load, as well as by other
-		 * transactions, so that the replication slot can advance and WAL does
-		 * not pile up. Use wal_segment_size as a threshold so that we do not
-		 * introduce the decoding overhead too often.
-		 *
-		 * Of course, we must not apply the changes until the initial load has
-		 * completed.
-		 *
-		 * Note that our insertions into the new table should not be decoded
-		 * as we (intentionally) do not write the logical decoding specific
-		 * information to WAL.
-		 */
-		if (concurrent)
-		{
-			XLogRecPtr	end_of_wal;
-
-			end_of_wal = GetFlushRecPtr(NULL);
-			if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
-			{
-				repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
-				end_of_wal_prev = end_of_wal;
-			}
-		}
 	}
 
 	if (indexScan != NULL)
@@ -1041,22 +1013,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			/* Report n_tuples */
 			pgstat_progress_update_param(PROGRESS_REPACK_HEAP_TUPLES_INSERTED,
 										 n_tuples);
-
-			/*
-			 * Try to keep the amount of not-yet-decoded WAL small, like
-			 * above.
-			 */
-			if (concurrent)
-			{
-				XLogRecPtr	end_of_wal;
-
-				end_of_wal = GetFlushRecPtr(NULL);
-				if ((end_of_wal - end_of_wal_prev) > wal_segment_size)
-				{
-					repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
-					end_of_wal_prev = end_of_wal;
-				}
-			}
 		}
 
 		tuplesort_end(tuplesort);
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 03ccf10b782..e988a7a7296 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -12,12 +12,13 @@
  * In concurrent mode, we lock the table with only ShareUpdateExclusiveLock,
  * then do an initial copy as above.  However, while the tuples are being
  * copied, concurrent transactions could modify the table. To cope with those
- * changes, we rely on logical decoding to obtain them from WAL.  The changes
- * are accumulated in a tuplestore.  Once the initial copy is complete, we
- * read the changes from the tuplestore and re-apply them on the new heap.
- * Then we upgrade our ShareUpdateExclusiveLock to AccessExclusiveLock and
- * swap the relfilenodes.  This way, the time we hold a strong lock on the
- * table is much reduced, and the bloat is eliminated.
+ * changes, we rely on logical decoding to obtain them from WAL.  A bgworker
+ * consumes WAL while the initial copy is ongoing (to prevent excessive WAL
+ * from being reserved), and accumulates the changes in a file.  Once the
+ * initial copy is complete, we read the changes from the file and re-apply
+ * them on the new heap.  Then we upgrade our ShareUpdateExclusiveLock to
+ * AccessExclusiveLock and swap the relfilenodes.  This way, the time we hold
+ * a strong lock on the table is much reduced, and the bloat is eliminated.
  *
  * There is hardly anything left of Paul Brown's original implementation...
  *
@@ -45,6 +46,7 @@
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/xlogwait.h"
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/heap.h"
@@ -61,6 +63,8 @@
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
 #include "executor/executor.h"
+#include "libpq/pqformat.h"
+#include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
 #include "pgstat.h"
@@ -71,6 +75,8 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/procsignal.h"
+#include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
@@ -117,6 +123,12 @@ typedef struct IndexInsertState
 /* The WAL segment being decoded. */
 static XLogSegNo repack_current_segment = 0;
 
+/*
+ * The first file exported by the decoding worker must contain a snapshot, the
+ * following ones contain the data changes.
+ */
+#define WORKER_FILE_SNAPSHOT	0
+
 /*
  * Information needed to apply concurrent data changes.
  */
@@ -136,8 +148,113 @@ typedef struct ChangeDest
 
 	/* Needed to update indexes of rel_dst. */
 	IndexInsertState *iistate;
+
+	/*
+	 * Sequential number of the file containing the changes.
+	 *
+	 * TODO This field makes the structure name less descriptive. Should we
+	 * rename it, e.g. to ChangeApplyInfo?
+	 */
+	int			file_seq;
 } ChangeDest;
 
+/*
+ * Layout of shared memory used for communication between backend and the
+ * worker that performs logical decoding of data changes
+ */
+typedef struct DecodingWorkerShared
+{
+	/* Is the decoding initialized? */
+	bool		initialized;
+
+	/*
+	 * Once the worker has reached this LSN, it should close the current
+	 * output file and either create a new one or exit, according to the field
+	 * 'done'. If the value is InvalidXLogRecPtr, the worker should decode all
+	 * the WAL available and keep checking this field. It is ok if the worker
+	 * had already decoded records whose LSN is >= lsn_upto before this field
+	 * has been set.
+	 */
+	XLogRecPtr	lsn_upto;
+
+	/* Exit after closing the current file? */
+	bool		done;
+
+	/* The output is stored here. */
+	SharedFileSet sfs;
+
+	/* Number of the last file exported by the worker. */
+	int			last_exported;
+
+	/* Synchronize access to the fields above. */
+	slock_t		mutex;
+
+	/* Database to connect to. */
+	Oid			dbid;
+
+	/* Role to connect as. */
+	Oid			roleid;
+
+	/* Decode data changes of this relation. */
+	Oid			relid;
+
+	/* The backend uses this to wait for the worker. */
+	ConditionVariable cv;
+
+	/* Info to signal the backend. */
+	PGPROC	   *backend_proc;
+	pid_t		backend_pid;
+	ProcNumber	backend_proc_number;
+
+	/* Error queue. */
+	shm_mq	   *error_mq;
+
+	/*
+	 * Memory the queue is located in.
+	 *
+	 * For considerations on the value see the comments of
+	 * PARALLEL_ERROR_QUEUE_SIZE.
+	 */
+#define REPACK_ERROR_QUEUE_SIZE			16384
+	char		error_queue[FLEXIBLE_ARRAY_MEMBER];
+} DecodingWorkerShared;
+
+/*
+ * Generate worker's output file name. If relations of the same 'relid' happen
+ * to be processed at the same time, they must be from different databases and
+ * therefore different backends must be involved. (PID is already present in
+ * the fileset name.)
+ */
+static inline void
+DecodingWorkerFileName(char *fname, Oid relid, uint32 seq)
+{
+	snprintf(fname, MAXPGPATH, "%u-%u", relid, seq);
+}
+
+/*
+ * Backend-local information to control the decoding worker.
+ */
+typedef struct DecodingWorker
+{
+	/* The worker. */
+	BackgroundWorkerHandle *handle;
+
+	/* DecodingWorkerShared is in this segment. */
+	dsm_segment *seg;
+
+	/* Handle of the error queue. */
+	shm_mq_handle *error_mqh;
+} DecodingWorker;
+
+/* Pointer to currently running decoding worker. */
+static DecodingWorker *decoding_worker = NULL;
+
+/*
+ * Is there a message sent by a repack worker that the backend needs to
+ * receive?
+ */
+volatile sig_atomic_t RepackMessagePending = false;
+
 static bool cluster_rel_recheck(RepackCommand cmd, Relation OldHeap,
 								Oid indexOid, Oid userid, LOCKMODE lmode,
 								int options);
@@ -145,7 +262,7 @@ static void check_repack_concurrently_requirements(Relation rel);
 static void rebuild_relation(Relation OldHeap, Relation index, bool verbose,
 							 bool concurrent);
 static void copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
-							Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
+							Snapshot snapshot,
 							bool verbose,
 							bool *pSwapToastByContent,
 							TransactionId *pFreezeXid,
@@ -158,12 +275,10 @@ static List *get_tables_to_repack_partitioned(RepackCommand cmd,
 static bool cluster_is_permitted_for_relation(RepackCommand cmd,
 											  Oid relid, Oid userid);
 
-static void begin_concurrent_repack(Relation rel);
-static void end_concurrent_repack(void);
 static LogicalDecodingContext *setup_logical_decoding(Oid relid);
-static HeapTuple get_changed_tuple(char *change);
-static void apply_concurrent_changes(RepackDecodingState *dstate,
-									 ChangeDest *dest);
+static bool decode_concurrent_changes(LogicalDecodingContext *ctx,
+									  DecodingWorkerShared *shared);
+static void apply_concurrent_changes(BufFile *file, ChangeDest *dest);
 static void apply_concurrent_insert(Relation rel, HeapTuple tup,
 									IndexInsertState *iistate,
 									TupleTableSlot *index_slot);
@@ -175,9 +290,9 @@ static void apply_concurrent_delete(Relation rel, HeapTuple tup_target);
 static HeapTuple find_target_tuple(Relation rel, ChangeDest *dest,
 								   HeapTuple tup_key,
 								   TupleTableSlot *ident_slot);
-static void process_concurrent_changes(LogicalDecodingContext *decoding_ctx,
-									   XLogRecPtr end_of_wal,
-									   ChangeDest *dest);
+static void process_concurrent_changes(XLogRecPtr end_of_wal,
+									   ChangeDest *dest,
+									   bool done);
 static IndexInsertState *get_index_insert_state(Relation relation,
 												Oid ident_index_id,
 												Relation *ident_index_p);
@@ -186,7 +301,6 @@ static ScanKey build_identity_key(Oid ident_idx_oid, Relation rel_src,
 static void free_index_insert_state(IndexInsertState *iistate);
 static void cleanup_logical_decoding(LogicalDecodingContext *ctx);
 static void rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
-											   LogicalDecodingContext *decoding_ctx,
 											   TransactionId frozenXid,
 											   MultiXactId cutoffMulti);
 static List *build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes);
@@ -196,6 +310,13 @@ static Relation process_single_relation(RepackStmt *stmt,
 										ClusterParams *params);
 static Oid	determine_clustered_index(Relation rel, bool usingindex,
 									  const char *indexname);
+static void start_decoding_worker(Oid relid);
+static void stop_decoding_worker(void);
+static void repack_worker_internal(dsm_segment *seg);
+static void export_initial_snapshot(Snapshot snapshot,
+									DecodingWorkerShared *shared);
+static Snapshot get_initial_snapshot(DecodingWorker *worker);
+static void ProcessRepackMessage(StringInfo msg);
 static const char *RepackCommandAsString(RepackCommand cmd);
 
 
@@ -619,20 +740,20 @@ cluster_rel(RepackCommand cmd, Relation OldHeap, Oid indexOid,
 	/* rebuild_relation does all the dirty work */
 	PG_TRY();
 	{
-		/*
-		 * For concurrent processing, make sure that our logical decoding
-		 * ignores data changes of other tables than the one we are
-		 * processing.
-		 */
-		if (concurrent)
-			begin_concurrent_repack(OldHeap);
-
 		rebuild_relation(OldHeap, index, verbose, concurrent);
 	}
 	PG_FINALLY();
 	{
 		if (concurrent)
-			end_concurrent_repack();
+		{
+			/*
+			 * Since during normal operation the worker was already asked to
+			 * exit, stopping it explicitly is especially important on ERROR.
+			 * However it still seems a good practice to make sure that the
+			 * worker never survives the REPACK command.
+			 */
+			stop_decoding_worker();
+		}
 	}
 	PG_END_TRY();
 
@@ -929,7 +1050,6 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent
 	bool		swap_toast_by_content;
 	TransactionId frozenXid;
 	MultiXactId cutoffMulti;
-	LogicalDecodingContext *decoding_ctx = NULL;
 	Snapshot	snapshot = NULL;
 #if USE_ASSERT_CHECKING
 	LOCKMODE	lmode;
@@ -943,19 +1063,36 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent
 	if (concurrent)
 	{
 		/*
-		 * Prepare to capture the concurrent data changes.
+		 * The worker needs to be member of the locking group we're the leader
+		 * of. We ought to become the leader before the worker starts. The
+		 * worker will join the group as soon as it starts.
 		 *
-		 * Note that this call waits for all transactions with XID already
-		 * assigned to finish. If some of those transactions is waiting for a
-		 * lock conflicting with ShareUpdateExclusiveLock on our table (e.g.
-		 * it runs CREATE INDEX), we can end up in a deadlock. Not sure this
-		 * risk is worth unlocking/locking the table (and its clustering
-		 * index) and checking again if its still eligible for REPACK
-		 * CONCURRENTLY.
+		 * This is to make sure that the deadlock described below is
+		 * detectable by deadlock.c: if the worker waits for a transaction to
+		 * complete and we are waiting for the worker output, then effectively
+		 * we (i.e. this backend) are waiting for that transaction.
 		 */
-		decoding_ctx = setup_logical_decoding(tableOid);
+		BecomeLockGroupLeader();
+
+		/*
+		 * Start the worker that decodes data changes applied while we're
+		 * copying the table contents.
+		 *
+		 * Note that the worker has to wait for all transactions with XID
+		 * already assigned to finish. If some of those transactions is
+		 * waiting for a lock conflicting with ShareUpdateExclusiveLock on our
+		 * table (e.g.  it runs CREATE INDEX), we can end up in a deadlock.
+		 * Not sure this risk is worth unlocking/locking the table (and its
+		 * clustering index) and checking again if it's still eligible for
+		 * REPACK CONCURRENTLY.
+		 */
+		start_decoding_worker(tableOid);
+
+		/*
+		 * Wait until the worker has the initial snapshot and retrieve it.
+		 */
+		snapshot = get_initial_snapshot(decoding_worker);
 
-		snapshot = SnapBuildInitialSnapshotForRepack(decoding_ctx->snapshot_builder);
 		PushActiveSnapshot(snapshot);
 	}
 
@@ -980,7 +1117,7 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent
 	NewHeap = table_open(OIDNewHeap, NoLock);
 
 	/* Copy the heap data into the new table in the desired order */
-	copy_table_data(NewHeap, OldHeap, index, snapshot, decoding_ctx, verbose,
+	copy_table_data(NewHeap, OldHeap, index, snapshot, verbose,
 					&swap_toast_by_content, &frozenXid, &cutoffMulti);
 
 	/* The historic snapshot won't be needed anymore. */
@@ -1001,14 +1138,11 @@ rebuild_relation(Relation OldHeap, Relation index, bool verbose, bool concurrent
 		if (index)
 			index_close(index, NoLock);
 
-		rebuild_relation_finish_concurrent(NewHeap, OldHeap, decoding_ctx,
-										   frozenXid, cutoffMulti);
+		rebuild_relation_finish_concurrent(NewHeap, OldHeap, frozenXid,
+										   cutoffMulti);
 
 		pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
 									 PROGRESS_REPACK_PHASE_FINAL_CLEANUP);
-
-		/* Done with decoding. */
-		cleanup_logical_decoding(decoding_ctx);
 	}
 	else
 	{
@@ -1179,8 +1313,7 @@ make_new_heap(Oid OIDOldHeap, Oid NewTableSpace, Oid NewAccessMethod,
  */
 static void
 copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
-				Snapshot snapshot, LogicalDecodingContext *decoding_ctx,
-				bool verbose, bool *pSwapToastByContent,
+				Snapshot snapshot, bool verbose, bool *pSwapToastByContent,
 				TransactionId *pFreezeXid, MultiXactId *pCutoffMulti)
 {
 	Relation	relRelation;
@@ -1341,7 +1474,6 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex,
 	 */
 	table_relation_copy_for_cluster(OldHeap, NewHeap, OldIndex, use_sort,
 									cutoffs.OldestXmin, snapshot,
-									decoding_ctx,
 									&cutoffs.FreezeLimit,
 									&cutoffs.MultiXactCutoff,
 									&num_tuples, &tups_vacuumed,
@@ -2371,62 +2503,10 @@ RepackCommandAsString(RepackCommand cmd)
 		case REPACK_COMMAND_CLUSTER:
 			return "CLUSTER";
 	}
-	return "???";	/* keep compiler quiet */
+	return "???";				/* keep compiler quiet */
 }
 
 
-/*
- * Call this function before REPACK CONCURRENTLY starts to setup logical
- * decoding. It makes sure that other users of the table put enough
- * information into WAL.
- *
- * The point is that at various places we expect that the table we're
- * processing is treated like a system catalog. For example, we need to be
- * able to scan it using a "historic snapshot" anytime during the processing
- * (as opposed to scanning only at the start point of the decoding, as logical
- * replication does during initial table synchronization), in order to apply
- * concurrent UPDATE / DELETE commands.
- *
- * Note that TOAST table needs no attention here as it's not scanned using
- * historic snapshot.
- */
-static void
-begin_concurrent_repack(Relation rel)
-{
-	Oid			toastrelid;
-
-	/*
-	 * Avoid logical decoding of other relations by this backend. The lock we
-	 * have guarantees that the actual locator cannot be changed concurrently:
-	 * TRUNCATE needs AccessExclusiveLock.
-	 */
-	Assert(CheckRelationLockedByMe(rel, ShareUpdateExclusiveLock, false));
-	repacked_rel_locator = rel->rd_locator;
-	toastrelid = rel->rd_rel->reltoastrelid;
-	if (OidIsValid(toastrelid))
-	{
-		Relation	toastrel;
-
-		/* Avoid logical decoding of other TOAST relations. */
-		toastrel = table_open(toastrelid, AccessShareLock);
-		repacked_rel_toast_locator = toastrel->rd_locator;
-		table_close(toastrel, AccessShareLock);
-	}
-}
-
-/*
- * Call this when done with REPACK CONCURRENTLY.
- */
-static void
-end_concurrent_repack(void)
-{
-	/*
-	 * Restore normal function of (future) logical decoding for this backend.
-	 */
-	repacked_rel_locator.relNumber = InvalidOid;
-	repacked_rel_toast_locator.relNumber = InvalidOid;
-}
-
 /*
  * Is this backend performing logical decoding on behalf of REPACK
  * (CONCURRENTLY) ?
@@ -2491,9 +2571,10 @@ static LogicalDecodingContext *
 setup_logical_decoding(Oid relid)
 {
 	Relation	rel;
-	TupleDesc	tupdesc;
+	Oid			toastrelid;
 	LogicalDecodingContext *ctx;
-	RepackDecodingState *dstate = palloc0_object(RepackDecodingState);
+	NameData	slotname;
+	RepackDecodingState *dstate;
 
 	/*
 	 * REPACK CONCURRENTLY is not allowed in a transaction block, so this
@@ -2501,20 +2582,21 @@ setup_logical_decoding(Oid relid)
 	 */
 	Assert(!TransactionIdIsValid(GetTopTransactionIdIfAny()));
 
-	/*
-	 * A single backend should not execute multiple REPACK commands at a time,
-	 * so use PID to make the slot unique.
-	 */
-	snprintf(NameStr(dstate->slotname), NAMEDATALEN, "repack_%d", MyProcPid);
-
 	/*
 	 * Make sure we can use logical decoding.
 	 */
 	CheckSlotPermissions();
 	CheckLogicalDecodingRequirements();
-	/* RS_TEMPORARY so that the slot gets cleaned up on ERROR. */
-	ReplicationSlotCreate(NameStr(dstate->slotname), true, RS_TEMPORARY,
-						  false, false, false);
+	/*
+	 * A single backend should not execute multiple REPACK commands at a time,
+	 * so use PID to make the slot unique.
+	 *
+	 * RS_TEMPORARY so that the slot gets cleaned up on ERROR.
+	 */
+	snprintf(NameStr(slotname), NAMEDATALEN, "repack_%d", MyProcPid);
+	ReplicationSlotCreate(NameStr(slotname), true, RS_TEMPORARY, false, false,
+						  false);
+
 	EnsureLogicalDecodingEnabled();
 
 	/*
@@ -2537,104 +2619,109 @@ setup_logical_decoding(Oid relid)
 
 	DecodingContextFindStartpoint(ctx);
 
+	/*
+	 * decode_concurrent_changes() needs non-blocking callback.
+	 */
+	ctx->reader->routine.page_read = read_local_xlog_page_no_wait;
+
+	/*
+	 * read_local_xlog_page_no_wait() needs to be able to indicate the end of
+	 * WAL.
+	 */
+	ctx->reader->private_data = MemoryContextAllocZero(ctx->context,
+													   sizeof(ReadLocalXLogPageNoWaitPrivate));
+
+
 	/* Some WAL records should have been read. */
 	Assert(ctx->reader->EndRecPtr != InvalidXLogRecPtr);
 
+	/*
+	 * Initialize repack_current_segment so that we can notice WAL segment
+	 * boundaries.
+	 */
 	XLByteToSeg(ctx->reader->EndRecPtr, repack_current_segment,
 				wal_segment_size);
 
-	/*
-	 * Setup structures to store decoded changes.
-	 */
+	dstate = palloc0_object(RepackDecodingState);
 	dstate->relid = relid;
-	dstate->tstore = tuplestore_begin_heap(false, false,
-										   maintenance_work_mem);
 
-	/* Caller should already have the table locked. */
-	rel = table_open(relid, NoLock);
-	tupdesc = CreateTupleDescCopy(RelationGetDescr(rel));
-	dstate->tupdesc = tupdesc;
-	table_close(rel, NoLock);
+	/*
+	 * Tuple descriptor may be needed to flatten a tuple before we write it to
+	 * a file. A copy is needed because the decoding worker invalidates system
+	 * caches before it starts to do the actual work.
+	 */
+	rel = table_open(relid, AccessShareLock);
+	dstate->tupdesc = CreateTupleDescCopy(RelationGetDescr(rel));
 
-	/* Initialize the descriptor to store the changes ... */
-	dstate->tupdesc_change = CreateTemplateTupleDesc(1);
+	/* Avoid logical decoding of other relations. */
+	repacked_rel_locator = rel->rd_locator;
+	toastrelid = rel->rd_rel->reltoastrelid;
+	if (OidIsValid(toastrelid))
+	{
+		Relation	toastrel;
 
-	TupleDescInitEntry(dstate->tupdesc_change, 1, NULL, BYTEAOID, -1, 0);
-	/* ... as well as the corresponding slot. */
-	dstate->tsslot = MakeSingleTupleTableSlot(dstate->tupdesc_change,
-											  &TTSOpsMinimalTuple);
+		/* Avoid logical decoding of other TOAST relations. */
+		toastrel = table_open(toastrelid, AccessShareLock);
+		repacked_rel_toast_locator = toastrel->rd_locator;
+		table_close(toastrel, AccessShareLock);
+	}
+	table_close(rel, AccessShareLock);
 
-	dstate->resowner = ResourceOwnerCreate(CurrentResourceOwner,
-										   "logical decoding");
+	/* The file will be set as soon as we have it opened. */
+	dstate->file = NULL;
 
 	ctx->output_writer_private = dstate;
+
 	return ctx;
 }
 
 /*
- * Retrieve tuple from ConcurrentChange structure.
+ * Decode logical changes from the WAL sequence and store them to a file.
  *
- * The input data starts with the structure but it might not be appropriately
- * aligned.
+ * If true is returned, there is no more work for the worker.
  */
-static HeapTuple
-get_changed_tuple(char *change)
-{
-	HeapTupleData tup_data;
-	HeapTuple	result;
-	char	   *src;
-
-	/*
-	 * Ensure alignment before accessing the fields. (This is why we can't use
-	 * heap_copytuple() instead of this function.)
-	 */
-	src = change + offsetof(ConcurrentChange, tup_data);
-	memcpy(&tup_data, src, sizeof(HeapTupleData));
-
-	result = (HeapTuple) palloc(HEAPTUPLESIZE + tup_data.t_len);
-	memcpy(result, &tup_data, sizeof(HeapTupleData));
-	result->t_data = (HeapTupleHeader) ((char *) result + HEAPTUPLESIZE);
-	src = change + SizeOfConcurrentChange;
-	memcpy(result->t_data, src, result->t_len);
-
-	return result;
-}
-
-/*
- * Decode logical changes from the WAL sequence up to end_of_wal.
- */
-void
-repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
-								 XLogRecPtr end_of_wal)
+static bool
+decode_concurrent_changes(LogicalDecodingContext *ctx,
+						  DecodingWorkerShared *shared)
 {
 	RepackDecodingState *dstate;
-	ResourceOwner resowner_old;
+	XLogRecPtr	lsn_upto;
+	bool		done;
+	char		fname[MAXPGPATH];
 
 	dstate = (RepackDecodingState *) ctx->output_writer_private;
-	resowner_old = CurrentResourceOwner;
-	CurrentResourceOwner = dstate->resowner;
 
-	PG_TRY();
+	/* Open the output file. */
+	DecodingWorkerFileName(fname, shared->relid, shared->last_exported + 1);
+	dstate->file = BufFileCreateFileSet(&shared->sfs.fs, fname);
+
+	SpinLockAcquire(&shared->mutex);
+	lsn_upto = shared->lsn_upto;
+	done = shared->done;
+	SpinLockRelease(&shared->mutex);
+
+	while (true)
 	{
-		while (ctx->reader->EndRecPtr < end_of_wal)
+		XLogRecord *record;
+		XLogSegNo	segno_new;
+		char	   *errm = NULL;
+		XLogRecPtr	end_lsn;
+
+		CHECK_FOR_INTERRUPTS();
+
+		record = XLogReadRecord(ctx->reader, &errm);
+		if (record)
 		{
-			XLogRecord *record;
-			XLogSegNo	segno_new;
-			char	   *errm = NULL;
-			XLogRecPtr	end_lsn;
-
-			record = XLogReadRecord(ctx->reader, &errm);
-			if (errm)
-				elog(ERROR, "%s", errm);
-
-			if (record != NULL)
-				LogicalDecodingProcessRecord(ctx, ctx->reader);
+			LogicalDecodingProcessRecord(ctx, ctx->reader);
 
 			/*
 			 * If WAL segment boundary has been crossed, inform the decoding
-			 * system that the catalog_xmin can advance. (We can confirm more
-			 * often, but a filling a single WAL segment should not take much
-			 * time.)
+			 * system that the catalog_xmin can advance.
+			 *
+			 * TODO Does it make sense to confirm more often? Segment size
+			 * seems appropriate for restart_lsn (because less than a segment
+			 * cannot be recycled anyway), however more frequent checks might
+			 * be beneficial for catalog_xmin.
 			 */
 			end_lsn = ctx->reader->EndRecPtr;
 			XLByteToSeg(end_lsn, segno_new, wal_segment_size);
@@ -2645,80 +2732,137 @@ repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
 					 (uint32) (end_lsn >> 32), (uint32) end_lsn);
 				repack_current_segment = segno_new;
 			}
-
-			CHECK_FOR_INTERRUPTS();
 		}
-		InvalidateSystemCaches();
-		CurrentResourceOwner = resowner_old;
+		else
+		{
+			ReadLocalXLogPageNoWaitPrivate *priv;
+
+			if (errm)
+				ereport(ERROR, (errmsg("%s", errm)));
+
+			/*
+			 * In the decoding loop we do not want to get blocked when there
+			 * is no more WAL available, otherwise the loop would become
+			 * uninterruptible.
+			 */
+			priv = (ReadLocalXLogPageNoWaitPrivate *)
+				ctx->reader->private_data;
+			if (priv->end_of_wal)
+				/* Do not miss the end of WAL condition next time. */
+				priv->end_of_wal = false;
+			else
+				ereport(ERROR, (errmsg("could not read WAL record")));
+		}
+
+		/*
+		 * Whether we could read new record or not, keep checking if
+		 * 'lsn_upto' was specified.
+		 */
+		if (XLogRecPtrIsInvalid(lsn_upto))
+		{
+			SpinLockAcquire(&shared->mutex);
+			lsn_upto = shared->lsn_upto;
+			/* 'done' should be set at the same time as 'lsn_upto' */
+			done = shared->done;
+			SpinLockRelease(&shared->mutex);
+		}
+		if (!XLogRecPtrIsInvalid(lsn_upto) &&
+			ctx->reader->EndRecPtr >= lsn_upto)
+			break;
+
+		if (record == NULL)
+		{
+			int64		timeout = 0;
+			WaitLSNResult res;
+
+			/*
+			 * Before we retry reading, wait until new WAL is flushed.
+			 *
+			 * There is a race condition such that the backend executing
+			 * REPACK determines 'lsn_upto', but before it sets the shared
+			 * variable, we reach the end of WAL. In that case we'd need to
+			 * wait until the next WAL flush (unrelated to REPACK). Although
+			 * that should not be a problem in a busy system, it might be
+			 * noticeable in other cases, including regression tests (which
+			 * are not necessarily executed in parallel). Therefore it makes
+			 * sense to use timeout.
+			 *
+			 * If lsn_upto is valid, WAL records having LSN lower than that
+			 * should already have been flushed to disk.
+			 */
+			if (XLogRecPtrIsInvalid(lsn_upto))
+				timeout = 100L;
+			res = WaitForLSN(WAIT_LSN_TYPE_PRIMARY_FLUSH,
+							 ctx->reader->EndRecPtr + 1,
+							 timeout);
+			if (res != WAIT_LSN_RESULT_SUCCESS &&
+				res != WAIT_LSN_RESULT_TIMEOUT)
+				ereport(ERROR, (errmsg("waiting for WAL failed")));
+		}
 	}
-	PG_CATCH();
-	{
-		/* clear all timetravel entries */
-		InvalidateSystemCaches();
-		CurrentResourceOwner = resowner_old;
-		PG_RE_THROW();
-	}
-	PG_END_TRY();
+
+	/*
+	 * Close the file so we can make it available to the backend.
+	 */
+	BufFileClose(dstate->file);
+	dstate->file = NULL;
+	SpinLockAcquire(&shared->mutex);
+	shared->lsn_upto = InvalidXLogRecPtr;
+	shared->last_exported++;
+	SpinLockRelease(&shared->mutex);
+	ConditionVariableSignal(&shared->cv);
+
+	return done;
 }
 
 /*
  * Apply changes stored in 'file'.
  */
 static void
-apply_concurrent_changes(RepackDecodingState *dstate, ChangeDest *dest)
+apply_concurrent_changes(BufFile *file, ChangeDest *dest)
 {
+	char		kind;
+	uint32		t_len;
 	Relation	rel = dest->rel;
 	TupleTableSlot *index_slot,
 			   *ident_slot;
 	HeapTuple	tup_old = NULL;
 
-	if (dstate->nchanges == 0)
-		return;
-
 	/* TupleTableSlot is needed to pass the tuple to ExecInsertIndexTuples(). */
-	index_slot = MakeSingleTupleTableSlot(dstate->tupdesc, &TTSOpsHeapTuple);
+	index_slot = MakeSingleTupleTableSlot(RelationGetDescr(rel),
+										  &TTSOpsHeapTuple);
 
 	/* A slot to fetch tuples from identity index. */
 	ident_slot = table_slot_create(rel, NULL);
 
-	while (tuplestore_gettupleslot(dstate->tstore, true, false,
-								   dstate->tsslot))
+	while (true)
 	{
-		bool		shouldFree;
-		HeapTuple	tup_change,
-					tup,
+		size_t		nread;
+		HeapTuple	tup,
 					tup_exist;
-		char	   *change_raw,
-				   *src;
-		ConcurrentChange change;
-		bool		isnull[1];
-		Datum		values[1];
 
 		CHECK_FOR_INTERRUPTS();
 
-		/* Get the change from the single-column tuple. */
-		tup_change = ExecFetchSlotHeapTuple(dstate->tsslot, false, &shouldFree);
-		heap_deform_tuple(tup_change, dstate->tupdesc_change, values, isnull);
-		Assert(!isnull[0]);
+		nread = BufFileReadMaybeEOF(file, &kind, 1, true);
+		/* Are we done with the file? */
+		if (nread == 0)
+			break;
 
-		/* Make sure we access aligned data. */
-		change_raw = (char *) DatumGetByteaP(values[0]);
-		src = (char *) VARDATA(change_raw);
-		memcpy(&change, src, SizeOfConcurrentChange);
+		/* Read the tuple. */
+		BufFileReadExact(file, &t_len, sizeof(t_len));
+		tup = (HeapTuple) palloc(HEAPTUPLESIZE + t_len);
+		tup->t_data = (HeapTupleHeader) ((char *) tup + HEAPTUPLESIZE);
+		BufFileReadExact(file, tup->t_data, t_len);
+		tup->t_len = t_len;
+		ItemPointerSetInvalid(&tup->t_self);
+		tup->t_tableOid = RelationGetRelid(dest->rel);
 
-		/*
-		 * Extract the tuple from the change. The tuple is copied here because
-		 * it might be assigned to 'tup_old', in which case it needs to
-		 * survive into the next iteration.
-		 */
-		tup = get_changed_tuple(src);
-
-		if (change.kind == CHANGE_UPDATE_OLD)
+		if (kind == CHANGE_UPDATE_OLD)
 		{
 			Assert(tup_old == NULL);
 			tup_old = tup;
 		}
-		else if (change.kind == CHANGE_INSERT)
+		else if (kind == CHANGE_INSERT)
 		{
 			Assert(tup_old == NULL);
 
@@ -2726,12 +2870,11 @@ apply_concurrent_changes(RepackDecodingState *dstate, ChangeDest *dest)
 
 			pfree(tup);
 		}
-		else if (change.kind == CHANGE_UPDATE_NEW ||
-				 change.kind == CHANGE_DELETE)
+		else if (kind == CHANGE_UPDATE_NEW || kind == CHANGE_DELETE)
 		{
 			HeapTuple	tup_key;
 
-			if (change.kind == CHANGE_UPDATE_NEW)
+			if (kind == CHANGE_UPDATE_NEW)
 			{
 				tup_key = tup_old != NULL ? tup_old : tup;
 			}
@@ -2748,7 +2891,7 @@ apply_concurrent_changes(RepackDecodingState *dstate, ChangeDest *dest)
 			if (tup_exist == NULL)
 				elog(ERROR, "failed to find target tuple");
 
-			if (change.kind == CHANGE_UPDATE_NEW)
+			if (kind == CHANGE_UPDATE_NEW)
 				apply_concurrent_update(rel, tup, tup_exist, dest->iistate,
 										index_slot);
 			else
@@ -2763,26 +2906,19 @@ apply_concurrent_changes(RepackDecodingState *dstate, ChangeDest *dest)
 			pfree(tup);
 		}
 		else
-			elog(ERROR, "unrecognized kind of change: %d", change.kind);
+			elog(ERROR, "unrecognized kind of change: %d", kind);
 
 		/*
 		 * If a change was applied now, increment CID for next writes and
 		 * update the snapshot so it sees the changes we've applied so far.
 		 */
-		if (change.kind != CHANGE_UPDATE_OLD)
+		if (kind != CHANGE_UPDATE_OLD)
 		{
 			CommandCounterIncrement();
 			UpdateActiveSnapshotCommandId();
 		}
-
-		/* TTSOpsMinimalTuple has .get_heap_tuple==NULL. */
-		Assert(shouldFree);
-		pfree(tup_change);
 	}
 
-	tuplestore_clear(dstate->tstore);
-	dstate->nchanges = 0;
-
 	/* Cleanup. */
 	ExecDropSingleTupleTableSlot(index_slot);
 	ExecDropSingleTupleTableSlot(ident_slot);
@@ -2957,25 +3093,59 @@ find_target_tuple(Relation rel, ChangeDest *dest, HeapTuple tup_key,
 }
 
 /*
- * Decode and apply concurrent changes.
+ * Decode and apply concurrent changes, up to (and including) the record whose
+ * LSN is 'end_of_wal'.
  */
 static void
-process_concurrent_changes(LogicalDecodingContext *decoding_ctx,
-						   XLogRecPtr end_of_wal, ChangeDest *dest)
+process_concurrent_changes(XLogRecPtr end_of_wal, ChangeDest *dest, bool done)
 {
-	RepackDecodingState *dstate;
+	DecodingWorkerShared *shared;
+	char		fname[MAXPGPATH];
+	BufFile    *file;
 
 	pgstat_progress_update_param(PROGRESS_REPACK_PHASE,
 								 PROGRESS_REPACK_PHASE_CATCH_UP);
 
-	dstate = (RepackDecodingState *) decoding_ctx->output_writer_private;
+	/* Ask the worker for the file. */
+	shared = (DecodingWorkerShared *) dsm_segment_address(decoding_worker->seg);
+	SpinLockAcquire(&shared->mutex);
+	shared->lsn_upto = end_of_wal;
+	shared->done = done;
+	SpinLockRelease(&shared->mutex);
 
-	repack_decode_concurrent_changes(decoding_ctx, end_of_wal);
+	/*
+	 * The worker needs to finish processing of the current WAL record. Even
+	 * if it's idle, it'll need to close the output file. Thus we're likely to
+	 * wait, so prepare for sleep.
+	 */
+	ConditionVariablePrepareToSleep(&shared->cv);
+	for (;;)
+	{
+		int			last_exported;
 
-	if (dstate->nchanges == 0)
-		return;
+		SpinLockAcquire(&shared->mutex);
+		last_exported = shared->last_exported;
+		SpinLockRelease(&shared->mutex);
 
-	apply_concurrent_changes(dstate, dest);
+		/*
+		 * Has the worker exported the file we are waiting for?
+		 */
+		if (last_exported == dest->file_seq)
+			break;
+
+		ConditionVariableSleep(&shared->cv, WAIT_EVENT_REPACK_WORKER_EXPORT);
+	}
+	ConditionVariableCancelSleep();
+
+	/* Open the file. */
+	DecodingWorkerFileName(fname, shared->relid, dest->file_seq);
+	file = BufFileOpenFileSet(&shared->sfs.fs, fname, O_RDONLY, false);
+	apply_concurrent_changes(file, dest);
+
+	BufFileClose(file);
+
+	/* Get ready for the next file. */
+	dest->file_seq++;
 }
 
 /*
@@ -3101,15 +3271,10 @@ cleanup_logical_decoding(LogicalDecodingContext *ctx)
 
 	dstate = (RepackDecodingState *) ctx->output_writer_private;
 
-	ExecDropSingleTupleTableSlot(dstate->tsslot);
-	FreeTupleDesc(dstate->tupdesc_change);
 	FreeTupleDesc(dstate->tupdesc);
-	tuplestore_end(dstate->tstore);
-
 	FreeDecodingContext(ctx);
 
-	ReplicationSlotRelease();
-	ReplicationSlotDrop(NameStr(dstate->slotname), false);
+	ReplicationSlotDropAcquired();
 	pfree(dstate);
 }
 
@@ -3123,7 +3288,6 @@ cleanup_logical_decoding(LogicalDecodingContext *ctx)
  */
 static void
 rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
-								   LogicalDecodingContext *decoding_ctx,
 								   TransactionId frozenXid,
 								   MultiXactId cutoffMulti)
 {
@@ -3204,6 +3368,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
 											&chgdst.ident_index);
 	chgdst.ident_key = build_identity_key(ident_idx_new, OldHeap,
 										  &chgdst.ident_key_nentries);
+	chgdst.file_seq = WORKER_FILE_SNAPSHOT + 1;
 
 	/*
 	 * During testing, wait for another backend to perform concurrent data
@@ -3225,7 +3390,7 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
 	 * hold AccessExclusiveLock. (Quite some amount of WAL could have been
 	 * written during the data copying and index creation.)
 	 */
-	process_concurrent_changes(decoding_ctx, end_of_wal, &chgdst);
+	process_concurrent_changes(end_of_wal, &chgdst, false);
 
 	/*
 	 * Acquire AccessExclusiveLock on the table, its TOAST relation (if there
@@ -3306,8 +3471,11 @@ rebuild_relation_finish_concurrent(Relation NewHeap, Relation OldHeap,
 	XLogFlush(wal_insert_ptr);
 	end_of_wal = GetFlushRecPtr(NULL);
 
-	/* Apply the concurrent changes again. */
-	process_concurrent_changes(decoding_ctx, end_of_wal, &chgdst);
+	/*
+	 * Apply the concurrent changes again. Indicate that the decoding worker
+	 * won't be needed anymore.
+	 */
+	process_concurrent_changes(end_of_wal, &chgdst, true);
 
 	/* Remember info about rel before closing OldHeap */
 	relpersistence = OldHeap->rd_rel->relpersistence;
@@ -3417,3 +3585,510 @@ build_new_indexes(Relation NewHeap, Relation OldHeap, List *OldIndexes)
 
 	return result;
 }
+
+/*
+ * Try to start a background worker to perform logical decoding of data
+ * changes applied to relation while REPACK CONCURRENTLY is copying its
+ * contents to a new table.
+ */
+static void
+start_decoding_worker(Oid relid)
+{
+	Size		size;
+	dsm_segment *seg;
+	DecodingWorkerShared *shared;
+	shm_mq	   *mq;
+	shm_mq_handle *mqh;
+	BackgroundWorker bgw;
+
+	/* Setup shared memory. */
+	size = BUFFERALIGN(offsetof(DecodingWorkerShared, error_queue)) +
+		BUFFERALIGN(REPACK_ERROR_QUEUE_SIZE);
+	seg = dsm_create(size, 0);
+	shared = (DecodingWorkerShared *) dsm_segment_address(seg);
+	shared->lsn_upto = InvalidXLogRecPtr;
+	shared->done = false;
+	SharedFileSetInit(&shared->sfs, seg);
+	shared->last_exported = -1;
+	SpinLockInit(&shared->mutex);
+	shared->dbid = MyDatabaseId;
+
+	/*
+	 * This is the UserId set in cluster_rel(). Security context shouldn't be
+	 * needed for decoding worker.
+	 */
+	shared->roleid = GetUserId();
+	shared->relid = relid;
+	ConditionVariableInit(&shared->cv);
+	shared->backend_proc = MyProc;
+	shared->backend_pid = MyProcPid;
+	shared->backend_proc_number = MyProcNumber;
+
+	mq = shm_mq_create((char *) BUFFERALIGN(shared->error_queue),
+					   REPACK_ERROR_QUEUE_SIZE);
+	shm_mq_set_receiver(mq, MyProc);
+	mqh = shm_mq_attach(mq, seg, NULL);
+
+	memset(&bgw, 0, sizeof(bgw));
+	snprintf(bgw.bgw_name, BGW_MAXLEN,
+			 "REPACK decoding worker for relation \"%s\"",
+			 get_rel_name(relid));
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "REPACK decoding worker");
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	bgw.bgw_restart_time = BGW_NEVER_RESTART;
+	snprintf(bgw.bgw_library_name, MAXPGPATH, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "RepackWorkerMain");
+	bgw.bgw_main_arg = UInt32GetDatum(dsm_segment_handle(seg));
+	bgw.bgw_notify_pid = MyProcPid;
+
+	decoding_worker = palloc0_object(DecodingWorker);
+	if (!RegisterDynamicBackgroundWorker(&bgw, &decoding_worker->handle))
+		ereport(ERROR,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of background worker slots"),
+				 errhint("You might need to increase \"%s\".", "max_worker_processes")));
+
+	decoding_worker->seg = seg;
+	decoding_worker->error_mqh = mqh;
+
+	/*
+	 * The decoding setup must be done before the caller can have XID assigned
+	 * for any reason, otherwise the worker might end up in a deadlock,
+	 * waiting for the caller's transaction to end. Therefore wait here until
+	 * the worker indicates that it has the logical decoding initialized.
+	 */
+	ConditionVariablePrepareToSleep(&shared->cv);
+	for (;;)
+	{
+		bool		initialized;
+
+		SpinLockAcquire(&shared->mutex);
+		initialized = shared->initialized;
+		SpinLockRelease(&shared->mutex);
+
+		if (initialized)
+			break;
+
+		ConditionVariableSleep(&shared->cv, WAIT_EVENT_REPACK_WORKER_EXPORT);
+	}
+	ConditionVariableCancelSleep();
+}
+
+/*
+ * Stop the decoding worker and cleanup the related resources.
+ *
+ * The worker stops on its own when it knows there is no more work to do, but
+ * we need to stop it explicitly at least on ERROR in the launching backend.
+ */
+static void
+stop_decoding_worker(void)
+{
+	BgwHandleStatus status;
+
+	/* Haven't reached the worker startup? */
+	if (decoding_worker == NULL)
+		return;
+
+	/* Could not register the worker? */
+	if (decoding_worker->handle == NULL)
+		return;
+
+	TerminateBackgroundWorker(decoding_worker->handle);
+	/* The worker should really exit before the REPACK command does. */
+	HOLD_INTERRUPTS();
+	status = WaitForBackgroundWorkerShutdown(decoding_worker->handle);
+	RESUME_INTERRUPTS();
+
+	if (status == BGWH_POSTMASTER_DIED)
+		ereport(FATAL,
+				(errcode(ERRCODE_ADMIN_SHUTDOWN),
+				 errmsg("postmaster exited during REPACK command")));
+
+	shm_mq_detach(decoding_worker->error_mqh);
+
+	/*
+	 * If we could not cancel the current sleep due to ERROR, do that before
+	 * we detach from the shared memory the condition variable is located in.
+	 * If we did not, the bgworker ERROR handling code would try and fail
+	 * badly.
+	 */
+	ConditionVariableCancelSleep();
+
+	dsm_detach(decoding_worker->seg);
+	pfree(decoding_worker);
+	decoding_worker = NULL;
+}
+
+/* Is this process a REPACK worker? */
+static bool is_repack_worker = false;
+
+static pid_t backend_pid;
+static ProcNumber backend_proc_number;
+
+/*
+ * See ParallelWorkerShutdown for details.
+ */
+static void
+RepackWorkerShutdown(int code, Datum arg)
+{
+	SendProcSignal(backend_pid,
+				   PROCSIG_REPACK_MESSAGE,
+				   backend_proc_number);
+
+	dsm_detach((dsm_segment *) DatumGetPointer(arg));
+}
+
+/* REPACK decoding worker entry point */
+void
+RepackWorkerMain(Datum main_arg)
+{
+	dsm_segment *seg;
+	DecodingWorkerShared *shared;
+	shm_mq	   *mq;
+	shm_mq_handle *mqh;
+
+	is_repack_worker = true;
+
+	/*
+	 * Override the default bgworker_die() with die() so we can use
+	 * CHECK_FOR_INTERRUPTS().
+	 */
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
+
+	seg = dsm_attach(DatumGetUInt32(main_arg));
+	if (seg == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("could not map dynamic shared memory segment")));
+
+	shared = (DecodingWorkerShared *) dsm_segment_address(seg);
+
+	/* Arrange to signal the leader if we exit. */
+	backend_pid = shared->backend_pid;
+	backend_proc_number = shared->backend_proc_number;
+	before_shmem_exit(RepackWorkerShutdown, PointerGetDatum(seg));
+
+	/*
+	 * Join locking group - see the comments around the call of
+	 * start_decoding_worker().
+	 */
+	if (!BecomeLockGroupMember(shared->backend_proc, backend_pid))
+		/* The leader is not running anymore. */
+		return;
+
+	/*
+	 * Setup a queue to send error messages to the backend that launched this
+	 * worker.
+	 */
+	mq = (shm_mq *) (char *) BUFFERALIGN(shared->error_queue);
+	shm_mq_set_sender(mq, MyProc);
+	mqh = shm_mq_attach(mq, seg, NULL);
+	pq_redirect_to_shm_mq(seg, mqh);
+	pq_set_parallel_leader(shared->backend_pid,
+						   shared->backend_proc_number);
+
+	/* Connect to the database. */
+	BackgroundWorkerInitializeConnectionByOid(shared->dbid, shared->roleid, 0);
+
+	repack_worker_internal(seg);
+}
+
+static void
+repack_worker_internal(dsm_segment *seg)
+{
+	DecodingWorkerShared *shared;
+	LogicalDecodingContext *decoding_ctx;
+	SharedFileSet *sfs;
+	Snapshot	snapshot;
+
+	/*
+	 * Transaction is needed to open relation, and it also provides us with a
+	 * resource owner.
+	 */
+	StartTransactionCommand();
+
+	shared = (DecodingWorkerShared *) dsm_segment_address(seg);
+
+	/*
+	 * Not sure the spinlock is needed here - the backend should not change
+	 * anything in the shared memory until we have serialized the snapshot.
+	 */
+	SpinLockAcquire(&shared->mutex);
+	Assert(XLogRecPtrIsInvalid(shared->lsn_upto));
+	sfs = &shared->sfs;
+	SpinLockRelease(&shared->mutex);
+
+	SharedFileSetAttach(sfs, seg);
+
+	/*
+	 * Prepare to capture the concurrent data changes ourselves.
+	 */
+	decoding_ctx = setup_logical_decoding(shared->relid);
+
+	/* Announce that we're ready. */
+	SpinLockAcquire(&shared->mutex);
+	shared->initialized = true;
+	SpinLockRelease(&shared->mutex);
+	ConditionVariableSignal(&shared->cv);
+
+	/* Build the initial snapshot and export it. */
+	snapshot = SnapBuildInitialSnapshotForRepack(decoding_ctx->snapshot_builder);
+	export_initial_snapshot(snapshot, shared);
+
+	/*
+	 * Only historic snapshots should be used now. Do not let us restrict the
+	 * progress of xmin horizon.
+	 */
+	InvalidateCatalogSnapshot();
+
+	while (!decode_concurrent_changes(decoding_ctx, shared))
+		;
+
+	/* Cleanup. */
+	cleanup_logical_decoding(decoding_ctx);
+	CommitTransactionCommand();
+}
+
+/*
+ * Make snapshot available to the backend that launched the decoding worker.
+ */
+static void
+export_initial_snapshot(Snapshot snapshot, DecodingWorkerShared *shared)
+{
+	char		fname[MAXPGPATH];
+	BufFile    *file;
+	Size		snap_size;
+	char	   *snap_space;
+
+	snap_size = EstimateSnapshotSpace(snapshot);
+	snap_space = (char *) palloc(snap_size);
+	SerializeSnapshot(snapshot, snap_space);
+	FreeSnapshot(snapshot);
+
+	DecodingWorkerFileName(fname, shared->relid, shared->last_exported + 1);
+	file = BufFileCreateFileSet(&shared->sfs.fs, fname);
+	/* To make restoration easier, write the snapshot size first. */
+	BufFileWrite(file, &snap_size, sizeof(snap_size));
+	BufFileWrite(file, snap_space, snap_size);
+	pfree(snap_space);
+	BufFileClose(file);
+
+	/* Increase the counter to tell the backend that the file is available. */
+	SpinLockAcquire(&shared->mutex);
+	shared->last_exported++;
+	SpinLockRelease(&shared->mutex);
+	ConditionVariableSignal(&shared->cv);
+}
+
+/*
+ * Get the initial snapshot from the decoding worker.
+ */
+static Snapshot
+get_initial_snapshot(DecodingWorker *worker)
+{
+	DecodingWorkerShared *shared;
+	char		fname[MAXPGPATH];
+	BufFile    *file;
+	Size		snap_size;
+	char	   *snap_space;
+	Snapshot	snapshot;
+
+	shared = (DecodingWorkerShared *) dsm_segment_address(worker->seg);
+
+	/*
+	 * The worker needs to initialize the logical decoding, which usually
+	 * takes some time. Therefore it makes sense to prepare for the sleep
+	 * first.
+	 */
+	ConditionVariablePrepareToSleep(&shared->cv);
+	for (;;)
+	{
+		int			last_exported;
+
+		SpinLockAcquire(&shared->mutex);
+		last_exported = shared->last_exported;
+		SpinLockRelease(&shared->mutex);
+
+		/*
+		 * Has the worker exported the file we are waiting for?
+		 */
+		if (last_exported == WORKER_FILE_SNAPSHOT)
+			break;
+
+		ConditionVariableSleep(&shared->cv, WAIT_EVENT_REPACK_WORKER_EXPORT);
+	}
+	ConditionVariableCancelSleep();
+
+	/* Read the snapshot from a file. */
+	DecodingWorkerFileName(fname, shared->relid, WORKER_FILE_SNAPSHOT);
+	file = BufFileOpenFileSet(&shared->sfs.fs, fname, O_RDONLY, false);
+	BufFileReadExact(file, &snap_size, sizeof(snap_size));
+	snap_space = (char *) palloc(snap_size);
+	BufFileReadExact(file, snap_space, snap_size);
+	BufFileClose(file);
+
+	/* Restore it. */
+	snapshot = RestoreSnapshot(snap_space);
+	pfree(snap_space);
+
+	return snapshot;
+}
+
+bool
+IsRepackWorker(void)
+{
+	return is_repack_worker;
+}
+
+/*
+ * Handle receipt of an interrupt indicating a repack worker message.
+ *
+ * Note: this is called within a signal handler!  All we can do is set
+ * a flag that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessRepackMessages().
+ */
+void
+HandleRepackMessageInterrupt(void)
+{
+	InterruptPending = true;
+	RepackMessagePending = true;
+	SetLatch(MyLatch);
+}
+
+/*
+ * Process any queued protocol messages received from parallel workers.
+ */
+void
+ProcessRepackMessages(void)
+{
+	MemoryContext oldcontext;
+
+	static MemoryContext hpm_context = NULL;
+
+	/*
+	 * Nothing to do if we haven't launched the worker yet or have already
+	 * terminated it.
+	 */
+	if (decoding_worker == NULL)
+		return;
+
+	/*
+	 * This is invoked from ProcessInterrupts(), and since some of the
+	 * functions it calls contain CHECK_FOR_INTERRUPTS(), there is a potential
+	 * for recursive calls if more signals are received while this runs.  It's
+	 * unclear that recursive entry would be safe, and it doesn't seem useful
+	 * even if it is safe, so let's block interrupts until done.
+	 */
+	HOLD_INTERRUPTS();
+
+	/*
+	 * Moreover, CurrentMemoryContext might be pointing almost anywhere.  We
+	 * don't want to risk leaking data into long-lived contexts, so let's do
+	 * our work here in a private context that we can reset on each use.
+	 */
+	if (hpm_context == NULL)	/* first time through? */
+		hpm_context = AllocSetContextCreate(TopMemoryContext,
+											"ProcessRepackMessages",
+											ALLOCSET_DEFAULT_SIZES);
+	else
+		MemoryContextReset(hpm_context);
+
+	oldcontext = MemoryContextSwitchTo(hpm_context);
+
+	/* OK to process messages.  Reset the flag saying there are more to do. */
+	RepackMessagePending = false;
+
+	/*
+	 * Read as many messages as we can from each worker, but stop when no more
+	 * messages can be read from the worker without blocking.
+	 */
+	while (true)
+	{
+		shm_mq_result res;
+		Size		nbytes;
+		void	   *data;
+
+		res = shm_mq_receive(decoding_worker->error_mqh, &nbytes,
+							 &data, true);
+		if (res == SHM_MQ_WOULD_BLOCK)
+			break;
+		else if (res == SHM_MQ_SUCCESS)
+		{
+			StringInfoData msg;
+
+			initStringInfo(&msg);
+			appendBinaryStringInfo(&msg, data, nbytes);
+			ProcessRepackMessage(&msg);
+			pfree(msg.data);
+		}
+		else
+		{
+			/*
+			 * The decoding worker is special in that it exits as soon as it
+			 * has its work done. Thus the DETACHED result code is fine.
+			 */
+			Assert(res == SHM_MQ_DETACHED);
+
+			break;
+		}
+	}
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Might as well clear the context on our way out */
+	MemoryContextReset(hpm_context);
+
+	RESUME_INTERRUPTS();
+}
+
+/*
+ * Process a single protocol message received from a single parallel worker.
+ */
+static void
+ProcessRepackMessage(StringInfo msg)
+{
+	char		msgtype;
+
+	msgtype = pq_getmsgbyte(msg);
+
+	switch (msgtype)
+	{
+		case PqMsg_ErrorResponse:
+		case PqMsg_NoticeResponse:
+			{
+				ErrorData	edata;
+
+				/* Parse ErrorResponse or NoticeResponse. */
+				pq_parse_errornotice(msg, &edata);
+
+				/* Death of a worker isn't enough justification for suicide. */
+				edata.elevel = Min(edata.elevel, ERROR);
+
+				/*
+				 * If desired, add a context line to show that this is a
+				 * message propagated from a parallel worker.  Otherwise, it
+				 * can sometimes be confusing to understand what actually
+				 * happened.
+				 */
+				if (edata.context)
+					edata.context = psprintf("%s\n%s", edata.context,
+											 _("decoding worker"));
+				else
+					edata.context = pstrdup(_("decoding worker"));
+
+				/* Rethrow error or print notice. */
+				ThrowErrorData(&edata);
+
+				break;
+			}
+
+		default:
+			{
+				elog(ERROR, "unrecognized message type received from decoding worker: %c (message length %d bytes)",
+					 msgtype, msg->len);
+			}
+	}
+}
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 6e4bbfb5aa1..42f6fa472c5 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/parallel.h"
+#include "commands/cluster.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
@@ -175,6 +176,10 @@ mq_putmessage(char msgtype, const char *s, size_t len)
 				SendProcSignal(pq_mq_parallel_leader_pid,
 							   PROCSIG_PARALLEL_APPLY_MESSAGE,
 							   pq_mq_parallel_leader_proc_number);
+			else if (IsRepackWorker())
+				SendProcSignal(pq_mq_parallel_leader_pid,
+							   PROCSIG_REPACK_MESSAGE,
+							   pq_mq_parallel_leader_proc_number);
 			else
 			{
 				Assert(IsParallelWorker());
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 261ccd3f59c..09c58371e8e 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -13,6 +13,7 @@
 #include "postgres.h"
 
 #include "access/parallel.h"
+#include "commands/cluster.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -142,6 +143,10 @@ static const struct
 	{
 		.fn_name = "SequenceSyncWorkerMain",
 		.fn_addr = SequenceSyncWorkerMain
+	},
+	{
+		.fn_name = "RepackWorkerMain",
+		.fn_addr = RepackWorkerMain
 	}
 };
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 603a2b94d05..7651b187418 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -194,7 +194,11 @@ StartupDecodingContext(List *output_plugin_options,
 
 	ctx->slot = slot;
 
-	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, xl_routine, ctx);
+	/*
+	 * TODO A separate patch for PG core, unless there's really a reason to
+	 * pass ctx for private_data (May extensions expect ctx?).
+	 */
+	ctx->reader = XLogReaderAllocate(wal_segment_size, NULL, xl_routine, NULL);
 	if (!ctx->reader)
 		ereport(ERROR,
 				(errcode(ERRCODE_OUT_OF_MEMORY),
diff --git a/src/backend/replication/pgoutput_repack/pgoutput_repack.c b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
index 6b54ea040ac..3a109dbfaff 100644
--- a/src/backend/replication/pgoutput_repack/pgoutput_repack.c
+++ b/src/backend/replication/pgoutput_repack/pgoutput_repack.c
@@ -167,17 +167,13 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
 			 HeapTuple tuple)
 {
 	RepackDecodingState *dstate;
-	char	   *change_raw;
-	ConcurrentChange change;
+	char		kind_byte = (char) kind;
 	bool		flattened = false;
-	Size		size;
-	Datum		values[1];
-	bool		isnull[1];
-	char	   *dst;
 
 	dstate = (RepackDecodingState *) ctx->output_writer_private;
 
-	size = VARHDRSZ + SizeOfConcurrentChange;
+	/* Store the change kind. */
+	BufFileWrite(dstate->file, &kind_byte, 1);
 
 	/*
 	 * ReorderBufferCommit() stores the TOAST chunks in its private memory
@@ -194,46 +190,12 @@ store_change(LogicalDecodingContext *ctx, ConcurrentChangeKind kind,
 		tuple = toast_flatten_tuple(tuple, dstate->tupdesc);
 		flattened = true;
 	}
+	/* Store the tuple size ... */
+	BufFileWrite(dstate->file, &tuple->t_len, sizeof(tuple->t_len));
+	/* ... and the tuple itself. */
+	BufFileWrite(dstate->file, tuple->t_data, tuple->t_len);
 
-	size += tuple->t_len;
-	if (size >= MaxAllocSize)
-		elog(ERROR, "Change is too big.");
-
-	/* Construct the change. */
-	change_raw = (char *) palloc0(size);
-	SET_VARSIZE(change_raw, size);
-
-	/*
-	 * Since the varlena alignment might not be sufficient for the structure,
-	 * set the fields in a local instance and remember where it should
-	 * eventually be copied.
-	 */
-	change.kind = kind;
-	dst = (char *) VARDATA(change_raw);
-
-	/*
-	 * Copy the tuple.
-	 *
-	 * Note: change->tup_data.t_data must be fixed on retrieval!
-	 */
-	memcpy(&change.tup_data, tuple, sizeof(HeapTupleData));
-	memcpy(dst, &change, SizeOfConcurrentChange);
-	dst += SizeOfConcurrentChange;
-	memcpy(dst, tuple->t_data, tuple->t_len);
-
-	/* The data has been copied. */
+	/* Free the flat copy if created above. */
 	if (flattened)
 		pfree(tuple);
-
-	/* Store as tuple of 1 bytea column. */
-	values[0] = PointerGetDatum(change_raw);
-	isnull[0] = false;
-	tuplestore_putvalues(dstate->tstore, dstate->tupdesc_change,
-						 values, isnull);
-
-	/* Accounting. */
-	dstate->nchanges++;
-
-	/* Cleanup. */
-	pfree(change_raw);
 }
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 8e56922dcea..6f9e7a7aab7 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -19,6 +19,7 @@
 
 #include "access/parallel.h"
 #include "commands/async.h"
+#include "commands/cluster.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
@@ -697,6 +698,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_REPACK_MESSAGE))
+		HandleRepackMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 02e9aaa6bca..e08bf56f11e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -36,6 +36,7 @@
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
+#include "commands/cluster.h"
 #include "commands/event_trigger.h"
 #include "commands/explain_state.h"
 #include "commands/prepare.h"
@@ -3520,6 +3521,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (RepackMessagePending)
+		ProcessRepackMessages();
 }
 
 /*
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4aa864fe3c3..b00bd794759 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -154,6 +154,7 @@ RECOVERY_CONFLICT_SNAPSHOT	"Waiting for recovery conflict resolution for a vacuu
 RECOVERY_CONFLICT_TABLESPACE	"Waiting for recovery conflict resolution for dropping a tablespace."
 RECOVERY_END_COMMAND	"Waiting for <xref linkend="guc-recovery-end-command"/> to complete."
 RECOVERY_PAUSE	"Waiting for recovery to be resumed."
+REPACK_WORKER_EXPORT	"Waiting for decoding worker to export a new output file."
 REPLICATION_ORIGIN_DROP	"Waiting for a replication origin to become inactive so it can be dropped."
 REPLICATION_SLOT_DROP	"Waiting for a replication slot to become inactive so it can be dropped."
 RESTORE_COMMAND	"Waiting for <xref linkend="guc-restore-command"/> to complete."
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 14928cd04a1..f1005afde9d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -22,7 +22,6 @@
 #include "access/xact.h"
 #include "commands/vacuum.h"
 #include "executor/tuptable.h"
-#include "replication/logical.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
@@ -631,7 +630,6 @@ typedef struct TableAmRoutine
 											  bool use_sort,
 											  TransactionId OldestXmin,
 											  Snapshot snapshot,
-											  LogicalDecodingContext *decoding_ctx,
 											  TransactionId *xid_cutoff,
 											  MultiXactId *multi_cutoff,
 											  double *num_tuples,
@@ -1663,8 +1661,6 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
  * - *multi_cutoff - ditto
  * - snapshot - if != NULL, ignore data changes done by transactions that this
  *	 (MVCC) snapshot considers still in-progress or in the future.
- * - decoding_ctx - logical decoding context, to capture concurrent data
- *   changes.
  *
  * Output parameters:
  * - *xid_cutoff - rel's new relfrozenxid value, may be invalid
@@ -1678,7 +1674,6 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
 								bool use_sort,
 								TransactionId OldestXmin,
 								Snapshot snapshot,
-								LogicalDecodingContext *decoding_ctx,
 								TransactionId *xid_cutoff,
 								MultiXactId *multi_cutoff,
 								double *num_tuples,
@@ -1687,7 +1682,7 @@ table_relation_copy_for_cluster(Relation OldTable, Relation NewTable,
 {
 	OldTable->rd_tableam->relation_copy_for_cluster(OldTable, NewTable, OldIndex,
 													use_sort, OldestXmin,
-													snapshot, decoding_ctx,
+													snapshot,
 													xid_cutoff, multi_cutoff,
 													num_tuples, tups_vacuumed,
 													tups_recently_dead);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 6a5c476294a..1b05d5d418b 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -17,11 +17,13 @@
 #include "nodes/parsenodes.h"
 #include "parser/parse_node.h"
 #include "replication/decode.h"
+#include "postmaster/bgworker.h"
 #include "replication/logical.h"
+#include "storage/buffile.h"
 #include "storage/lock.h"
+#include "storage/shm_mq.h"
 #include "utils/relcache.h"
 #include "utils/resowner.h"
-#include "utils/tuplestore.h"
 
 
 /* flag bits for ClusterParams->options */
@@ -44,6 +46,9 @@ typedef struct ClusterParams
  * The following definitions are used by REPACK CONCURRENTLY.
  */
 
+/*
+ * Stored as a single byte in the output file.
+ */
 typedef enum
 {
 	CHANGE_INSERT,
@@ -52,68 +57,30 @@ typedef enum
 	CHANGE_DELETE
 } ConcurrentChangeKind;
 
-typedef struct ConcurrentChange
-{
-	/* See the enum above. */
-	ConcurrentChangeKind kind;
-
-	/*
-	 * The actual tuple.
-	 *
-	 * The tuple data follows the ConcurrentChange structure. Before use make
-	 * sure the tuple is correctly aligned (ConcurrentChange can be stored as
-	 * bytea) and that tuple->t_data is fixed.
-	 */
-	HeapTupleData tup_data;
-} ConcurrentChange;
-
-#define SizeOfConcurrentChange (offsetof(ConcurrentChange, tup_data) + \
-								sizeof(HeapTupleData))
-
 /*
  * Logical decoding state.
  *
- * Here we store the data changes that we decode from WAL while the table
- * contents is being copied to a new storage. Also the necessary metadata
- * needed to apply these changes to the table is stored here.
+ * The output plugin uses it to store the data changes that it decodes from
+ * WAL while the table contents is being copied to a new storage.
  */
 typedef struct RepackDecodingState
 {
 	/* The relation whose changes we're decoding. */
 	Oid			relid;
 
-	/* Replication slot name. */
-	NameData	slotname;
-
-	/*
-	 * Decoded changes are stored here. Although we try to avoid excessive
-	 * batches, it can happen that the changes need to be stored to disk. The
-	 * tuplestore does this transparently.
-	 */
-	Tuplestorestate *tstore;
-
-	/* The current number of changes in tstore. */
-	double		nchanges;
-
-	/*
-	 * Descriptor to store the ConcurrentChange structure serialized (bytea).
-	 * We can't store the tuple directly because tuplestore only supports
-	 * minimum tuple and we may need to transfer OID system column from the
-	 * output plugin. Also we need to transfer the change kind, so it's better
-	 * to put everything in the structure than to use 2 tuplestores "in
-	 * parallel".
-	 */
-	TupleDesc	tupdesc_change;
-
-	/* Tuple descriptor needed to update indexes. */
+	/* Tuple descriptor of the relation being processed. */
 	TupleDesc	tupdesc;
 
-	/* Slot to retrieve data from tstore. */
-	TupleTableSlot *tsslot;
-
-	ResourceOwner resowner;
+	/* The current output file. */
+	BufFile    *file;
 } RepackDecodingState;
 
+extern PGDLLIMPORT volatile sig_atomic_t RepackMessagePending;
+
+extern bool IsRepackWorker(void);
+extern void HandleRepackMessageInterrupt(void);
+extern void ProcessRepackMessages(void);
+
 extern void ExecRepack(ParseState *pstate, RepackStmt *stmt, bool isTopLevel);
 
 extern void cluster_rel(RepackCommand command, Relation OldHeap, Oid indexOid,
@@ -136,6 +103,6 @@ extern void finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 
 extern bool am_decoding_for_repack(void);
 extern bool change_useless_for_repack(XLogRecordBuffer *buf);
-extern void repack_decode_concurrent_changes(LogicalDecodingContext *ctx,
-											 XLogRecPtr end_of_wal);
+
+extern void RepackWorkerMain(Datum main_arg);
 #endif							/* CLUSTER_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index e52b8eb7697..3ef35ca6b80 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_REPACK_MESSAGE,		/* Message from repack worker */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2676823d2a7..84c7b680dc5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -498,7 +498,6 @@ CompressFileHandle
 CompressionLocation
 CompressorState
 ComputeXidHorizonsResult
-ConcurrentChange
 ConcurrentChangeKind
 ConditionVariable
 ConditionVariableMinimallyPadded
@@ -638,6 +637,8 @@ DeclareCursorStmt
 DecodedBkpBlock
 DecodedXLogRecord
 DecodingOutputState
+DecodingWorker
+DecodingWorkerShared
 DefElem
 DefElemAction
 DefaultACLInfo
-- 
2.47.3



view thread (31+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Adding REPACK [concurrently]
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox