Re: Patch: dumping tables data in multiple chunks in pg_dump

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Hannu Krosing <[email protected]>
To: Michael Banck <[email protected]>
Cc: David Rowley <[email protected]>
Cc: Ashutosh Bapat <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Nathan Bossart <[email protected]>
Subject: Re: Patch: dumping tables data in multiple chunks in pg_dump
Date: Sun, 29 Mar 2026 23:49:38 +0200
Message-ID: <CAMT0RQSfEvhZB_3_UZamm+Oi9O7+4+XWi-OOL0NnctyvLZRS0g@mail.gmail.com> (raw)
In-Reply-To: <CAMT0RQTe4Zr=rdcKMJj-=c7CH0PJh=ZPk=xOU98+M7p9-D+Yew@mail.gmail.com>
References: <CAMT0RQT_0qVxcTT6ycM20QUN-pEQ6iMLbz6gLWgLpeF0NmNOUA@mail.gmail.com>
	<CAExHW5t54GPKFbW3KLzintJ6jMMRYwb-t2Fjm4JTxEcZbGDomA@mail.gmail.com>
	<CAMT0RQTHoL8S7OonFWC_aDSC-2oX7BGBBLAQ+OOBhRPcxV2eiw@mail.gmail.com>
	<CAMT0RQQAH1a8kY-mx7B07Uzn3T_zeaU9detqFFtW36_k67Su+A@mail.gmail.com>
	<CAMT0RQQr7KtPAY903+F42csiHc1EPHo70Xji-znkxEhwdoKa6w@mail.gmail.com>
	<CAMT0RQSNHFffbCmDNxQogVBD8H5gTDJNwhUR2btCVE+Lq1sGGw@mail.gmail.com>
	<CAMT0RQTEFGctCfgVx3u2XgVRCAj_QURV2tfdzL0HOQi=u0sV2A@mail.gmail.com>
	<CAApHDvr8ay+31Wd0TptDGp8cAg2-NOnWddx8csnUE3R03EbvZw@mail.gmail.com>
	<[email protected]>
	<CAMT0RQRtLwi_CrOcD7KxYL0Gm1nGXb-HWmerVg=ajEs6JP7m+w@mail.gmail.com>
	<CAMT0RQTe4Zr=rdcKMJj-=c7CH0PJh=ZPk=xOU98+M7p9-D+Yew@mail.gmail.com>

Fixing a off-by-one error in copying over dependencies


On Sat, Mar 28, 2026 at 4:33 PM Hannu Krosing <[email protected]> wrote:
>
> The above
>
> "Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."
>
> should read
>
> "Or it can be almost 200 GB *for a single page* if the page has just
> pointers to 1GB TOAST items."
>
>
> On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <[email protected]> wrote:
> >
> > The issue is that currently the value is given in "main table pages"
> > and it would be somewhat deceptive, or at least confusing, to try to
> > express this in any other unit.
> >
> > As I explained in the commit message:
> >
> > ---------8<-------------------8<-------------------8<----------------
> > This --max-table-segment-pages number specifically applies to main table
> > pages which does not guarantee anything about output size.
> > The output could be empty if there are no live tuples in the page range.
> > Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> > ---------8<-------------------8<-------------------8<----------------
> >
> > And I can think of no cheap and reliable way to change that equation.
> >
> > I'll be very happy if you have any good ideas for either improving the
> > flag name, or even propose a way to better estimate the resulting dump
> > file size so we could give the chunk size in better units
> >
> > ---
> > Hannu
> >
> >
> >
> >
> >
> > On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > > Perhaps --max-table-segment-pages is a better name than
> > > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > > number of pages required to make a table "huge".
> > >
> > > I'm not sure that's better - without looking at the documentation,
> > > people might confuse segment here with the 1GB split of tables into
> > > segments. As pg_dump is a very common and basic user tool, I don't think
> > > implementation details like pages/page sizes and blocks should be part
> > > of its UX.
> > >
> > > Can't we just make it a storage size, like '10GB' and then rename it to
> > > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > > I personally don't like either --max-table-segment-pages or
> > > --huge-table-chunk-pages.
> > >
> > >
> > > Michael


Attachments:

  [application/x-patch] v15-0001-Add-max-table-segment-pages-option-to-pg.patch (27.9K, 2-v15-0001-Add-max-table-segment-pages-option-to-pg.patch)
  download | inline diff:
From d9442eb6476ba27e0f3dee085e48de2efbb445d6 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Sat, 28 Mar 2026 11:53:39 +0100
Subject: [PATCH v14] SUBJECT: Add --max-table-segment-pages option to pg_dump
 for parallel table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid < '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |  84 +++++++++-
 src/bin/pg_dump/pg_backup_archiver.h      |  12 +-
 src/bin/pg_dump/pg_dump.c                 | 177 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 364 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7f538e90194..5f056bb4af6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1066,6 +1066,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index fda912ba0a9..11863a1915f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 271a2c3e481..384add0713b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
@@ -1995,6 +1997,28 @@ _moveBefore(TocEntry *pos, TocEntry *te)
 	pos->prev = te;
 }
 
+/*
+ * Add a dependency id to a DependencyList object
+ * This is currently used for collecting reverse 
+ * dependencies for chunked data dump 
+ *
+ * Note: duplicate dependencies are currently not eliminated
+ */
+void
+addStandaloneDependency(DependencyList *dobj, DumpId refId)
+{
+	pg_log_warning("Adding dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	if (dobj->nDeps >= dobj->allocDeps)
+	{
+		dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
+		dobj->dependencies = pg_realloc_array(dobj->dependencies,
+											  DumpId, dobj->allocDeps);
+		pg_log_warning("Realloced list %p to size %d", (void *) dobj->dependencies, dobj->allocDeps);
+	}
+	pg_log_warning("Added dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	dobj->dependencies[dobj->nDeps++] = refId;
+}
+
 /*
  * Build index arrays for the TOC list
  *
@@ -2014,6 +2038,7 @@ buildTocEntryArrays(ArchiveHandle *AH)
 
 	AH->tocsByDumpId = pg_malloc0_array(TocEntry *, (maxDumpId + 1));
 	AH->tableDataId = pg_malloc0_array(DumpId, (maxDumpId + 1));
+	AH->tableDataChunkIds = pg_malloc0_array(DependencyList, (maxDumpId + 1));
 
 	for (te = AH->toc->next; te != AH->toc; te = te->next)
 	{
@@ -2029,8 +2054,12 @@ buildTocEntryArrays(ArchiveHandle *AH)
 		 * TOC entry that has a DATA item.  We compute this by reversing the
 		 * TABLE DATA item's dependency, knowing that a TABLE DATA item has
 		 * just one dependency and it is the TABLE item.
+		 *
+		 * For chunked table data, the TABLE DATA item has a description like
+		 * "TABLE DATA (pages 100:199)", and we collect all such items as
+		 * reverse dependencies for the parent table's entry in tableDataChunkIds.
 		 */
-		if (strcmp(te->desc, "TABLE DATA") == 0 && te->nDeps > 0)
+		if (strncmp(te->desc, "TABLE DATA", 10) == 0 && te->nDeps > 0)
 		{
 			DumpId		tableId = te->dependencies[0];
 
@@ -2042,7 +2071,14 @@ buildTocEntryArrays(ArchiveHandle *AH)
 			if (tableId <= 0 || tableId > maxDumpId)
 				pg_fatal("bad table dumpId for TABLE DATA item");
 
-			AH->tableDataId[tableId] = te->dumpId;
+			if (te->desc[10] == '\0') /* te->desc == "TABLE DATA" */
+				AH->tableDataId[tableId] = te->dumpId;
+			else
+			{
+				/* Chunked table data, the description is "TABLE DATA (pages %u:%u)" */
+				addStandaloneDependency(&(AH->tableDataChunkIds[tableId]), te->dumpId);
+				pg_log_debug("Added chunked table data dependency: tableId %u + chunkId %u",
+							 tableId, te->dumpId);}
 		}
 	}
 }
@@ -5017,6 +5053,12 @@ fix_dependencies(ArchiveHandle *AH)
  * that parallel restore will prioritize larger jobs (index builds, FK
  * constraint checks, etc) over smaller ones, avoiding situations where we
  * end a restore with only one active job working on a large table.
+ *
+ * In case of chunked dumps, we change the depenency on table with depedency
+ * on the first chunk of data and add the remaingi chunk ids, if any, to the 
+ * end of depencency list
+ * we also calculate the fullDataLength as the sum of the lengths of chunk
+ * data items and use that to set the item's dataLength.
  */
 static void
 repoint_table_dependencies(ArchiveHandle *AH)
@@ -5032,8 +5074,9 @@ repoint_table_dependencies(ArchiveHandle *AH)
 		for (i = 0; i < te->nDeps; i++)
 		{
 			olddep = te->dependencies[i];
-			if (olddep <= AH->maxDumpId &&
-				AH->tableDataId[olddep] != 0)
+			if (olddep > AH->maxDumpId)
+				continue;
+			if (AH->tableDataId[olddep] != 0)
 			{
 				DumpId		tabledataid = AH->tableDataId[olddep];
 				TocEntry   *tabledatate = AH->tocsByDumpId[tabledataid];
@@ -5043,6 +5086,39 @@ repoint_table_dependencies(ArchiveHandle *AH)
 				pg_log_debug("transferring dependency %d -> %d to %d",
 							 te->dumpId, olddep, tabledataid);
 			}
+			else if (AH->tableDataChunkIds[olddep].nDeps > 0)
+			{
+				int			j;
+				DumpId		chunkdataid;
+				uint64		fullDataLength;
+				DependencyList *deplist = &AH->tableDataChunkIds[olddep];
+
+				/* first in list replaces the dependency on table */
+				chunkdataid = deplist->dependencies[0];
+				te->dependencies[i] = chunkdataid;
+				fullDataLength = AH->tocsByDumpId[chunkdataid]->dataLength;
+				pg_log_debug("transferring chunk list %d -> %d to %d",
+							 te->dumpId, olddep, chunkdataid);
+
+				if (deplist->nDeps > 1)
+				{
+					/* make space */
+					te->dependencies = pg_realloc_array(te->dependencies,
+												  DumpId,
+												  te->nDeps + deplist->nDeps - 1);
+
+					/* the rest are appended to dependencies */
+					for (j = 1; j < deplist->nDeps; j++)
+					{
+						chunkdataid = deplist->dependencies[j];
+						te->dependencies[te->nDeps++] = chunkdataid;
+						fullDataLength += AH->tocsByDumpId[chunkdataid]->dataLength;
+						pg_log_debug("adding chunk list %d -> %d to %d",
+									te->dumpId, olddep, chunkdataid);
+					}
+				}
+				te->dataLength = Max(te->dataLength, fullDataLength);
+			}
 		}
 	}
 }
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 365073b3eae..cfa3ea1bbd6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -179,6 +179,13 @@ typedef enum
 	OUTPUT_OTHERDATA,			/* writing data as INSERT commands */
 } ArchiverOutput;
 
+typedef struct _DependencyList
+{
+	DumpId	   *dependencies;	/* dumpIds of objects this one depends on */
+	int			nDeps;			/* number of valid dependencies */
+	int			allocDeps;		/* allocated size of dependencies[] */
+} DependencyList;
+
 /*
  * For historical reasons, ACL items are interspersed with everything else in
  * a dump file's TOC; typically they're right after the object they're for.
@@ -311,6 +318,7 @@ struct _archiveHandle
 	/* arrays created after the TOC list is complete: */
 	struct _tocEntry **tocsByDumpId;	/* TOCs indexed by dumpId */
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
+	DependencyList *tableDataChunkIds; /* dependencies indexed by dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
 	pg_compress_specification compression_spec; /* Requested specification for
@@ -377,7 +385,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
@@ -437,6 +445,8 @@ extern int	TocIDRequired(ArchiveHandle *AH, DumpId id);
 TocEntry   *getTocEntryByDumpId(ArchiveHandle *AH, DumpId id);
 extern bool checkSeek(FILE *fp);
 
+extern void addStandaloneDependency(DependencyList *dobj, DumpId refId);
+
 #define appendStringLiteralAHX(buf,str,AH) \
 	appendStringLiteral(buf, str, (AH)->public.encoding, (AH)->public.std_strings)
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5d1f7682f11..1e7d9a3f7f3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -535,6 +535,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -799,6 +800,12 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1344,6 +1351,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2396,7 +2406,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
 	 * cases a simple COPY suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 		(fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
 		 tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
 	{
@@ -2414,9 +2424,37 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		/* If it's a segment, we need to add a filter condition to select the
+		 * right page range 
+		 * - for first segment we add "ctid < (endPage+1, 0)" 
+		 *   first segment is the one with startPage == 0
+		 * - for last segment we add "ctid >= (startPage, 1)"
+		 *   last segment is the one with endPage == InvalidBlockNumber
+		 *   we leave to upper bound open for the case where more pages 
+		 *   were added after we measured 
+		 * - for middle segments we add 
+		 *   "ctid >= (startPage, 1) AND ctid < (endPage+1, 0)"
+		 *
+		 * "ctid < (endPage+1, 0)" instead of "ctid <= (endPage, maxtuple)"
+		 * was chosen as range end so that we do not have to estimate the maxtuple
+		 * 
+		 */
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid < '(%u,0)'", tdinfo->endPage+1);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid >= '(%u,1)' AND ctid < '(%u,0)'",
+								 tdinfo->startPage, tdinfo->endPage+1);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2424,6 +2462,10 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	if (is_segment(tdinfo))
+		pg_log_debug("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2919,42 +2961,89 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId);
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3081,6 +3170,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7347,8 +7438,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7589,7 +7688,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5a6726d8b12..84e682d585f 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pg_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) ((tdiptr)->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -449,7 +467,7 @@ typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
 	Oid			relid;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index 8d0659c1164..a516d8c86a9 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index d975db77af2..67fd3650d7a 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 extern void check_mut_excl_opts_internal(int n,...);
-- 
2.43.0

view thread (34+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Patch: dumping tables data in multiple chunks in pg_dump
  In-Reply-To: <CAMT0RQSfEvhZB_3_UZamm+Oi9O7+4+XWi-OOL0NnctyvLZRS0g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox