public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Patch: dumping tables data in multiple chunks in pg_dump
24+ messages / 4 participants
[nested] [flat]

* Re: Patch: dumping tables data in multiple chunks in pg_dump
@ 2026-01-13 02:27 David Rowley <[email protected]>
  2026-01-14 10:52 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-28 15:32 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 3 replies; 24+ messages in thread

From: David Rowley @ 2026-01-13 02:27 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <[email protected]> wrote:
> Added to https://commitfest.postgresql.org/patch/6219/

I think this could be useful, but I think you'll need to find a way to
not do this for non-heap tables. Per the comments in TableAmRoutine,
both scan_set_tidrange and scan_getnextslot_tidrange are optional
callback functions and the planner won't produce TIDRangePaths if
either of those don't exist. Maybe that means you need to consult
pg_class.relam to ensure the amname is 'heap' or at least the relam =
2. On testing Citus's columnar AM, I get:

postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
ERROR:  UPDATE and CTID scans not supported for ColumnarScan

1. For the patch, I think you should tighten the new option up to mean
the maximum segment size that a table will be dumped in. I see you
have comments like:

/* TODO: add hysteresis here, maybe < 1.1 * huge_table_chunk_pages */

You *have* to put the cutoff *somewhere*, so I think it very much
should be exactly the specified threshold. If anyone is unhappy that
some segments consist of a single page, then that's on them to adjust
the parameter accordingly. Otherwise, someone complaints that they got
a 1-page segment when the table was 10.0001% bigger than the cutoff
and then we're tempted to add a new setting to control the 1.1 factor,
which is just silly. If there's a 1-page segment, so what? It's not a
big deal.

Perhaps --max-table-segment-pages is a better name than
--huge-table-chunk-pages as it's quite subjective what the minimum
number of pages required to make a table "huge".

2. I'm not sure if you're going to get away with using relpages for
this. Is it really that bad to query pg_relation_size() when this
option is set? If it really is a problem, then maybe let the user
choose with another option. I understand we're using relpages for
sorting table sizes so we prefer dumping larger tables first, but that
just seems way less important if it's not perfectly accurate.

3. You should be able to simplify the code in dumpTableData() so
you're not adding any extra cases. You could use InvalidBlockNumber to
indicate an unbounded ctid range and only add ctid qual to the WHERE
clause when you have a bounded range (i.e not InvalidBlockNumber).
That way the first segment will need WHERE ctid <= '...' and the final
one will need WHERE ctid >= '...'.  Everything in between will have an
upper and lower bound. That results in no ctid quals being added when
both ranges are set to InvalidBlockNumber, which you should use for
all tables not large enough to be segmented, thus no special case.

TID Range scans are perfectly capable of working when only bounded at one side.

4. I think using "int" here is a future complaint waiting to happen.

+ if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
+   &dopt.huge_table_chunk_pages))

I bet we'll eventually see a complaint that someone can't make the
segment size larger than 16TB. I think option_parse_uint32() might be
called for.

David






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
@ 2026-01-14 10:52 ` Hannu Krosing <[email protected]>
  2026-01-14 21:10   ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-14 10:52 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Tue, Jan 13, 2026 at 3:27 AM David Rowley <[email protected]> wrote:
>
> On Fri, 14 Nov 2025 at 09:34, Hannu Krosing <[email protected]> wrote:
> > Added to https://commitfest.postgresql.org/patch/6219/
>
> I think this could be useful, but I think you'll need to find a way to
> not do this for non-heap tables. Per the comments in TableAmRoutine,
> both scan_set_tidrange and scan_getnextslot_tidrange are optional
> callback functions and the planner won't produce TIDRangePaths if
> either of those don't exist. Maybe that means you need to consult
> pg_class.relam to ensure the amname is 'heap' or at least the relam =
> 2.

Makes sense, will add.

> On testing Citus's columnar AM, I get:
> postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
> ERROR:  UPDATE and CTID scans not supported for ColumnarScan

Should we just silently not chunk tables that have some storage
architecture that does not have tids, or should pg_dump just error out
in thiscase ?

I imagine the Citus columnar is often used with huge tables where
chunking would be most useful.

Later it likely makes sense to have another option for chunking other
types of tables, or maybe evan add something to the TableAM for
chunking support.

> 1. For the patch, I think you should tighten the new option up to mean
> the maximum segment size that a table will be dumped in. I see you
> have comments like:
>
> /* TODO: add hysteresis here, maybe < 1.1 * huge_table_chunk_pages */
>
> You *have* to put the cutoff *somewhere*, so I think it very much
> should be exactly the specified threshold. If anyone is unhappy that
> some segments consist of a single page, then that's on them to adjust
> the parameter accordingly. Otherwise, someone complaints that they got
> a 1-page segment when the table was 10.0001% bigger than the cutoff
> and then we're tempted to add a new setting to control the 1.1 factor,
> which is just silly. If there's a 1-page segment, so what? It's not a
> big deal.

Agreed, will drop the TODO

> Perhaps --max-table-segment-pages is a better name than
> --huge-table-chunk-pages as it's quite subjective what the minimum
> number of pages required to make a table "huge".

I agree. My initial thinking was that it is mainly useful for huge
tables, but indeed that does not need to be reflected in the flag name

> 2. I'm not sure if you're going to get away with using relpages for
> this. Is it really that bad to query pg_relation_size() when this
> option is set? If it really is a problem, then maybe let the user
> choose with another option. I understand we're using relpages for
> sorting table sizes so we prefer dumping larger tables first, but that
> just seems way less important if it's not perfectly accurate.

Yeah, I had thought of pg_relation_size() myself.

Another option would be something more complex which tries to estimate
the dump file sizes by figuring out  TOAST for each chunk. The think
that makes this really complex is the possible uneven distribution of
toast and needing to take into account both the compression of toast
AND the compression of resulting dump file.

> 3. You should be able to simplify the code in dumpTableData() so
> you're not adding any extra cases. You could use InvalidBlockNumber to
> indicate an unbounded ctid range and only add ctid qual to the WHERE
> clause when you have a bounded range (i.e not InvalidBlockNumber).
> That way the first segment will need WHERE ctid <= '...' and the final
> one will need WHERE ctid >= '...'.  Everything in between will have an
> upper and lower bound. That results in no ctid quals being added when
> both ranges are set to InvalidBlockNumber, which you should use for
> all tables not large enough to be segmented, thus no special case.

Makes sense, will look into it.

> TID Range scans are perfectly capable of working when only bounded at one side.
>
> 4. I think using "int" here is a future complaint waiting to happen.
>
> + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> +   &dopt.huge_table_chunk_pages))
>
> I bet we'll eventually see a complaint that someone can't make the
> segment size larger than 16TB. I think option_parse_uint32() might be
> called for.

There can be no more than 2 * INT2_MAX pages anyway.
I thought half of the max possible size should be enough.
Do you really think that somebody would want that ?

> David






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-14 10:52 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-14 21:10   ` David Rowley <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: David Rowley @ 2026-01-14 21:10 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Wed, 14 Jan 2026 at 23:52, Hannu Krosing <[email protected]> wrote:
>
> On Tue, Jan 13, 2026 at 3:27 AM David Rowley <[email protected]> wrote:
> > On testing Citus's columnar AM, I get:
> > postgres=# select * from t where ctid between '(0,1)' and '(10,0)';
> > ERROR:  UPDATE and CTID scans not supported for ColumnarScan
>
> Should we just silently not chunk tables that have some storage
> architecture that does not have tids, or should pg_dump just error out
> in thiscase ?

I think you should just document that it only applies to heap tables.
I don't think erroring out is useful to anyone, especially if the
error only arrives after pg_dump has been running for several hours or
even days.

> > 4. I think using "int" here is a future complaint waiting to happen.
> >
> > + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> > +   &dopt.huge_table_chunk_pages))
> >
> > I bet we'll eventually see a complaint that someone can't make the
> > segment size larger than 16TB. I think option_parse_uint32() might be
> > called for.
>
> There can be no more than 2 * INT2_MAX pages anyway.
> I thought half of the max possible size should be enough.
> Do you really think that somebody would want that ?

IMO, if the option can't represent the full range of BlockNumber, then
that's a bug.

I see pg_resetwal has recently invented strtouint32_strict for this.
It might be a good idea to refactor that and put it into
option_utils.c rather than having each client app have to invent their
own method.

David






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
@ 2026-01-19 19:01 ` Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 02:20   ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2 siblings, 2 replies; 24+ messages in thread

From: Hannu Krosing @ 2026-01-19 19:01 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Here is a new patch which has

* changed flag name to max-table-segment-pages
* added check for amname = "heap"
* made the table info query use pg_relation_size() to get relpages if
the --max-table-segment-pages is set
* added simple chunked dump and restore test

Currently there is no check for actual restore integrity,  this is
what t/002_pg_dump.pl says:

# TODO: Have pg_restore actually restore to an independent
# database and then pg_dump *that* database (or something along
# those lines) to validate that part of the process.

As my perl-fu is weak I did not build the new facility to have full
restored data checking, but I did add simple count + table hash
warnings for original and restored data so I could manually verify tha
restore

added this for original and chunked restore database:

DO \$\$
DECLARE
    thash_rec RECORD;
BEGIN
    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash
  INTO thash_rec
  FROM tplain AS t;
    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
END;
\$\$;

And this is the verification I did after running `make check` in
src/bin/pg_dump/

hannu@HK395:~/work/pggit/src/bin/pg_dump$ grep "WARNING.*thash"
tmp_check/log/004_pg_dump_parallel_main.log
    RAISE WARNING 'thash: %', thash_rec;
2026-01-19 19:27:57.444 CET client backend[678937]
004_pg_dump_parallel.pl WARNING:  thash: (tplain,1000,38441792160)
    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
2026-01-19 19:27:57.605 CET client backend[678985]
004_pg_dump_parallel.pl WARNING:  thash after parallel chunked
restore: (tplain,1000,38441792160)

As you see both have 1000 rows with sum of full row hashes == 38441792160

Other rows in the same log foile show that it was dumped as 3 chunks
as I still have the Warnings in code which show the query used.

Anyone with a better understanding of our Perl tests is welcome to
turn this into proper tests or advise me where to find info on how to
do it.

On Tue, Jan 13, 2026 at 3:27 AM David Rowley <[email protected]> wrote:
>
...
> 3. You should be able to simplify the code in dumpTableData() so
> you're not adding any extra cases. You could use InvalidBlockNumber to
> indicate an unbounded ctid range and only add ctid qual to the WHERE
> clause when you have a bounded range (i.e not InvalidBlockNumber).
> That way the first segment will need WHERE ctid <= '...' and the final
> one will need WHERE ctid >= '...'.  Everything in between will have an
> upper and lower bound. That results in no ctid quals being added when
> both ranges are set to InvalidBlockNumber, which you should use for
> all tables not large enough to be segmented, thus no special case.
>
> TID Range scans are perfectly capable of working when only bounded at one side.

I changed the last open-ended chunk to use ctid >= (N,1) for clarity
but did not change anything else.

To me it looked like having a loop around the whole thing when there
is no chunking would complicate things for anyone reading the code.

> 4. I think using "int" here is a future complaint waiting to happen.
>
> + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> +   &dopt.huge_table_chunk_pages))
>
> I bet we'll eventually see a complaint that someone can't make the
> segment size larger than 16TB. I think option_parse_uint32() might be
> called for.

I have not yet done anything with this yet, so the maximum chunk size
for now is half of the maximum relpages.


Attachments:

  [application/x-patch] v7-0001-changed-flag-name-to-max-table-segment-pages.patch (14.5K, 2-v7-0001-changed-flag-name-to-max-table-segment-pages.patch)
  download | inline diff:
From 9e4a18c477c7df346ea4150830f34c115fc726be Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Mon, 19 Jan 2026 19:37:58 +0100
Subject: [PATCH v7] * changed flag mname to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test

* added a WARNING with count and table data hash to source and chunked restore database
---
 src/bin/pg_dump/pg_backup.h               |   1 +
 src/bin/pg_dump/pg_backup_archiver.c      |   1 +
 src/bin/pg_dump/pg_dump.c                 | 172 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |   5 +
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  52 +++++++
 5 files changed, 193 insertions(+), 38 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..28df18fd993 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -178,6 +178,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	int			max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..70e4da9a970 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -154,6 +154,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = UINT32_MAX; /* == InvalidBlockNumber, disable chunking by default */
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..515e2f2f64a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:			/* huge table chunk pages */
+				if (!option_parse_int(optarg, "--max-table-segment-pages", 1, INT32_MAX,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",(BlockNumber) dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || tdinfo->chunking || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,23 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (tdinfo->chunking)
+		{
+			if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "%s ctid BETWEEN '(%u,1)' AND '(%u,32000)'", /* there is no (*,0) tuple */
+								 tdinfo->filtercond?" AND ":" WHERE ",
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "%s ctid >= '(%u,1)'", /* there is no (*,0) tuple */
+								 tdinfo->filtercond?" AND ":" WHERE ",
+								 tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+		
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2463,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,42 +2961,100 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* chunking works off relpages, which may be slightly off
+		 * but is the best we have without doing our own page count
+		 * it should be enough for typical use case of huge tables which 
+		 * should have their relpages updated by autovacuum
+		 * 
+		 * For now we only do cunking when table access method is heap
+		 * we may add other chunking methods later. 
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if ((BlockNumber) tbinfo->relpages < dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			* Set the TocEntry's dataLength in case we are doing a parallel dump
+			* and want to order dump jobs by table size.  We choose to measure
+			* dataLength in table pages (including TOAST pages) during dump, so
+			* no scaling is needed.
+			*
+			* However, relpages is declared as "integer" in pg_class, and hence
+			* also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
+			* Cast so that we get the right interpretation of table sizes
+			* exceeding INT_MAX pages.
+			*/
+			te->dataLength = (BlockNumber) tbinfo->relpages;
+			te->dataLength += (BlockNumber) tbinfo->toastpages;
+
+			/*
+			* If pgoff_t is only 32 bits wide, the above refinement is useless,
+			* and instead we'd better worry about integer overflow.  Clamp to
+			* INT_MAX if the correct result exceeds that.
+			*/
+			if (sizeof(te->dataLength) == 4 &&
+				(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
+				te->dataLength < 0))
+				te->dataLength = INT_MAX;
+		}
+		else
+		{
+			BlockNumber current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]",(BlockNumber) tbinfo->relpages);
+
+			while (current_chunk_start < (BlockNumber) tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+				chunk_tdinfo->chunking = true;
+				chunk_tdinfo->startPage = current_chunk_start;
+				chunk_tdinfo->endPage = current_chunk_start + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= (BlockNumber) tbinfo->relpages)
+					chunk_tdinfo->endPage = UINT32_MAX; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				te->dataLength += (off_t)dopt->max_table_segment_pages * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3092,6 +3178,9 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->chunking = false; /* defaults */
+	tdinfo->startPage = 0;
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7343,15 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* use real relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.tableoid)/8192 AS relpages, ");
+	else
+		appendPQExpBufferStr(query, "c.relpages, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..ddaf341bb3b 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -413,6 +414,10 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	bool 		chunking;
+	BlockNumber	startPage;		/* starting table page */
+	BlockNumber	endPage;		/* ending table page for page-range dump,
+	                    		 * mostly startPage+max_table_segment_pages */
 } TableDataInfo;
 
 typedef struct _indxInfo
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..9094352e29f 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -44,6 +46,18 @@ create table tht_p1 partition of tht for values with (modulus 3, remainder 0);
 create table tht_p2 partition of tht for values with (modulus 3, remainder 1);
 create table tht_p3 partition of tht for values with (modulus 3, remainder 2);
 insert into tht select (x%10)::text::digit, x from generate_series(1,1000) x;
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash: %', thash_rec;
+END;
+\$\$;
 	});
 
 $node->command_ok(
@@ -87,4 +101,42 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 5,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of five heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of five heap pages');
+
+$node->safe_psql(
+	$dbname4,
+	qq{
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
+END;
+\$\$;
+	});
+
 done_testing();
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-19 21:15   ` Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Zsolt Parragi @ 2026-01-19 21:15 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Hello

pgdump.c:7174

+ appendPQExpBufferStr(query, "pg_relation_size(c.tableoid)/8192 AS
relpages, ");

Shouldn't this be something like

+ appendPQExpBufferStr(query,
"pg_relation_size(c.oid)/current_setting('block_size')::int AS
relpages, ");

instead?






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
@ 2026-01-19 23:07     ` Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-19 23:07 UTC (permalink / raw)
  To: Zsolt Parragi <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Thanks Zsolt

I changed it to use the configured BLCKSZ in attached patch.

But you may be right that current_setting('block_size')::int is the
better way to go as we are interested in the page size at the target
database, not what the pg_sump was compiled with

I'll wait for other feedback as well and then send the next patc with changes




On Mon, Jan 19, 2026 at 10:15 PM Zsolt Parragi
<[email protected]> wrote:
>
> Hello
>
> pgdump.c:7174
>
> + appendPQExpBufferStr(query, "pg_relation_size(c.tableoid)/8192 AS
> relpages, ");
>
> Shouldn't this be something like
>
> + appendPQExpBufferStr(query,
> "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> relpages, ");
>
> instead?


Attachments:

  [application/x-patch] v8-0001-changed-flag-name-to-max-table-segment-pages.patch (14.6K, 2-v8-0001-changed-flag-name-to-max-table-segment-pages.patch)
  download | inline diff:
From 6577b964a4b4b85aa51cc7ba12f785ed5567894a Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Mon, 19 Jan 2026 23:56:49 +0100
Subject: [PATCH v8] * changed flag mname to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test *
 switched to using of pg_relation_size()/BLCKSZ when --max-table-segment-pages
 is set

* added a WARNING with count and table data hash to source and chunked restore database
---
 src/bin/pg_dump/pg_backup.h               |   1 +
 src/bin/pg_dump/pg_backup_archiver.c      |   1 +
 src/bin/pg_dump/pg_dump.c                 | 172 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |   5 +
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  52 +++++++
 5 files changed, 193 insertions(+), 38 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..28df18fd993 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -178,6 +178,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	int			max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..70e4da9a970 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -154,6 +154,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = UINT32_MAX; /* == InvalidBlockNumber, disable chunking by default */
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..747b396c788 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:			/* huge table chunk pages */
+				if (!option_parse_int(optarg, "--max-table-segment-pages", 1, INT32_MAX,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",(BlockNumber) dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || tdinfo->chunking || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,23 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (tdinfo->chunking)
+		{
+			if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "%s ctid BETWEEN '(%u,1)' AND '(%u,32000)'", /* there is no (*,0) tuple */
+								 tdinfo->filtercond?" AND ":" WHERE ",
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "%s ctid >= '(%u,1)'", /* there is no (*,0) tuple */
+								 tdinfo->filtercond?" AND ":" WHERE ",
+								 tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+		
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2463,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,42 +2961,100 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* chunking works off relpages, which may be slightly off
+		 * but is the best we have without doing our own page count
+		 * it should be enough for typical use case of huge tables which 
+		 * should have their relpages updated by autovacuum
+		 * 
+		 * For now we only do cunking when table access method is heap
+		 * we may add other chunking methods later. 
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if ((BlockNumber) tbinfo->relpages < dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			* Set the TocEntry's dataLength in case we are doing a parallel dump
+			* and want to order dump jobs by table size.  We choose to measure
+			* dataLength in table pages (including TOAST pages) during dump, so
+			* no scaling is needed.
+			*
+			* However, relpages is declared as "integer" in pg_class, and hence
+			* also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
+			* Cast so that we get the right interpretation of table sizes
+			* exceeding INT_MAX pages.
+			*/
+			te->dataLength = (BlockNumber) tbinfo->relpages;
+			te->dataLength += (BlockNumber) tbinfo->toastpages;
+
+			/*
+			* If pgoff_t is only 32 bits wide, the above refinement is useless,
+			* and instead we'd better worry about integer overflow.  Clamp to
+			* INT_MAX if the correct result exceeds that.
+			*/
+			if (sizeof(te->dataLength) == 4 &&
+				(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
+				te->dataLength < 0))
+				te->dataLength = INT_MAX;
+		}
+		else
+		{
+			BlockNumber current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]",(BlockNumber) tbinfo->relpages);
+
+			while (current_chunk_start < (BlockNumber) tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+				chunk_tdinfo->chunking = true;
+				chunk_tdinfo->startPage = current_chunk_start;
+				chunk_tdinfo->endPage = current_chunk_start + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= (BlockNumber) tbinfo->relpages)
+					chunk_tdinfo->endPage = UINT32_MAX; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				te->dataLength += (off_t)dopt->max_table_segment_pages * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3092,6 +3178,9 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->chunking = false; /* defaults */
+	tdinfo->startPage = 0;
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7343,15 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* use real relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBuffer(query, "pg_relation_size(c.tableoid)/%d AS relpages, ", BLCKSZ);
+	else
+		appendPQExpBufferStr(query, "c.relpages, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..ddaf341bb3b 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -413,6 +414,10 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	bool 		chunking;
+	BlockNumber	startPage;		/* starting table page */
+	BlockNumber	endPage;		/* ending table page for page-range dump,
+	                    		 * mostly startPage+max_table_segment_pages */
 } TableDataInfo;
 
 typedef struct _indxInfo
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..9094352e29f 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -44,6 +46,18 @@ create table tht_p1 partition of tht for values with (modulus 3, remainder 0);
 create table tht_p2 partition of tht for values with (modulus 3, remainder 1);
 create table tht_p3 partition of tht for values with (modulus 3, remainder 2);
 insert into tht select (x%10)::text::digit, x from generate_series(1,1000) x;
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash: %', thash_rec;
+END;
+\$\$;
 	});
 
 $node->command_ok(
@@ -87,4 +101,42 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 5,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of five heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of five heap pages');
+
+$node->safe_psql(
+	$dbname4,
+	qq{
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
+END;
+\$\$;
+	});
+
 done_testing();
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-20 06:13       ` Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Zsolt Parragi @ 2026-01-20 06:13 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Hello

I changed two things on the line in my previous email, I think
c.tableoid is also wrong, pg_relation_size(c.tableoid) will return the
size of pg_class, not the size of the relation in question, that
should be pg_relation_size(c.oid)






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
@ 2026-01-20 12:48         ` Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-20 12:48 UTC (permalink / raw)
  To: Zsolt Parragi <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Tue, Jan 20, 2026 at 7:14 AM Zsolt Parragi <[email protected]> wrote:
>
> Hello
>
> I changed two things on the line in my previous email, I think
> c.tableoid is also wrong, pg_relation_size(c.tableoid) will return the
> size of pg_class, not the size of the relation in question, that
> should be pg_relation_size(c.oid)

Thanks, a good catch :)






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-21 13:05           ` Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-21 13:05 UTC (permalink / raw)
  To: David Rowley <[email protected]>; Zsolt Parragi <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Please find the latest patch attached which incorporates the feedback received.

* changed flag name to --max-table-segment-pages
* added check for amname = "heap"
* switched to using of
pg_relation_size(c.oid)/current_setting('block_size')::int when
--max-table-segment-pages is set
* added option_parse_uint32(...) to be used for full range of pages numbers
* The COPY SELECTs now use <= , BETWEEN or >= depending on the segment position

* added documentation

* TESTS:
  * added simple chunked dump and restore test
  * added a WARNING with count and table data hash to source and
chunked restore database

I left in the boolean to indicate if this is a full table or chunk
(was named chunking, nor is_segment)

An a lternative would be to use an expression like (td->startPage != 0
|| td->endPage != InvalidBlockNumber) whenever td->is_segment is
needed

If you insist on not having a separate structure member we could turn
this into something like this

#define is_segment(td) ((td->startPage != 0 || td->endPage !=
InvalidBlockNumber))

and then use is_segment(td) instead of td->is_segment where needed.


Attachments:

  [application/x-patch] v9-0001-changed-flag-name-to-max-table-segment-pages.patch (19.3K, 2-v9-0001-changed-flag-name-to-max-table-segment-pages.patch)
  download | inline diff:
From 11bd8c953968299e62d32ec4ca648077ea8fd8c9 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Wed, 21 Jan 2026 13:45:46 +0100
Subject: [PATCH v9] * changed flag mname to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test *
 switched to using of
 pg_relation_size(c.oid)/current_setting('block_size')::int when
 --max-table-segment-pages is set * added documentation * added
 option_parse_uint32(...) to be used for full range of pages numbers

* TESTS: added a WARNING with count and table data hash to source and chunked restore database
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |   2 +
 src/bin/pg_dump/pg_dump.c                 | 170 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |   7 +
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  52 +++++++
 src/fe_utils/option_utils.c               |  54 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 8 files changed, 279 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 688e23c0e90..1811c67d141 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1088,6 +1088,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..b63ae05d895 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -178,6 +179,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..ed1913d66bc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..ca7e9c5eeba 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",(BlockNumber) dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || tdinfo->is_segment || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,25 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (tdinfo->is_segment)
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid <= '(%u,32000)'", /* there is no (*,0) tuple */
+								 tdinfo->endPage);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'", /* there is no (*,0) tuple */
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", /* there is no (*,0) tuple */
+								 tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2465,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,41 +2963,101 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if ((BlockNumber) tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * However, relpages is declared as "integer" in pg_class, and hence
+			 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
+			 * Cast so that we get the right interpretation of table sizes
+			 * exceeding INT_MAX pages.
+			 */
+			te->dataLength = (BlockNumber) tbinfo->relpages;
+			te->dataLength += (BlockNumber) tbinfo->toastpages;
+		}
+		else
+		{
+			BlockNumber current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]",(BlockNumber) tbinfo->relpages);
+
+			while (current_chunk_start < (BlockNumber) tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+				chunk_tdinfo->is_segment = true;
+				chunk_tdinfo->startPage = current_chunk_start;
+				chunk_tdinfo->endPage = current_chunk_start + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= (BlockNumber) tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = (BlockNumber) tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
+	   /*
+		* If pgoff_t is only 32 bits wide, the above refinement is useless,
+		* and instead we'd better worry about integer overflow.  Clamp to
+		* INT_MAX if the correct result exceeds that.
+		*/
 		if (sizeof(te->dataLength) == 4 &&
 			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
+			te->dataLength < 0))
 			te->dataLength = INT_MAX;
 	}
 
@@ -3092,6 +3182,9 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->is_segment = false; /* we could use (tdinfo->startPage != 0 || tdinfo->endPage != InvalidBlockNumber) */
+	tdinfo->startPage = 0;
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7347,15 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		appendPQExpBufferStr(query, "c.relpages, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..e362253d4d5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -413,6 +414,12 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	bool 		is_segment;		/* true if this is a data segment.
+								 * we could use (tdinfo->startPage != 0 || 
+								 * tdinfo->endPage != InvalidBlockNumber) */
+	BlockNumber	startPage;		/* starting table page */
+	BlockNumber	endPage;		/* ending table page for page-range dump,
+	                    		 * mostly startPage+max_table_segment_pages */
 } TableDataInfo;
 
 typedef struct _indxInfo
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..88af25d2889 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -44,6 +46,18 @@ create table tht_p1 partition of tht for values with (modulus 3, remainder 0);
 create table tht_p2 partition of tht for values with (modulus 3, remainder 1);
 create table tht_p3 partition of tht for values with (modulus 3, remainder 2);
 insert into tht select (x%10)::text::digit, x from generate_series(1,1000) x;
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash: %', thash_rec;
+END;
+\$\$;
 	});
 
 $node->command_ok(
@@ -87,4 +101,42 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+$node->safe_psql(
+	$dbname4,
+	qq{
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
+END;
+\$\$;
+	});
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index cc483ae176c..93d58d7e1a9 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,60 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_int
+ *
+ * Parse integer value for an option.  If the parsing is successful, returns
+ * true and stores the result in *result if that's given; if parsing fails,
+ * returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint64 min_range, uint64 max_range,
+				 uint32 *result)
+{
+	char	   *endptr;
+	uint64		val64;
+
+	/* check there is no minus sign in value because strtoul() 
+	 * will silently convert negative numbers to two's complement */
+	for(endptr = optarg; *endptr != '\0'; endptr++)
+		if(*endptr == '-')
+		{
+			pg_log_error("negative value \"%s\" for option %s",
+						optarg, optname);
+			return false;
+		}
+
+	errno = 0;
+	val64 = strtoull(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	if (errno == ERANGE || val64 < min_range || val64 > max_range)
+	{
+		pg_log_error("%s musst be in range %lu..%lu",
+					 optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32)val64;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index 0db6e3b6e91..268590a18bd 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint64 min_range, uint64 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-22 17:05             ` Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-22 17:05 UTC (permalink / raw)
  To: David Rowley <[email protected]>; Zsolt Parragi <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Fixing all the warnings


On Wed, Jan 21, 2026 at 2:05 PM Hannu Krosing <[email protected]> wrote:
>
> Please find the latest patch attached which incorporates the feedback received.
>
> * changed flag name to --max-table-segment-pages
> * added check for amname = "heap"
> * switched to using of
> pg_relation_size(c.oid)/current_setting('block_size')::int when
> --max-table-segment-pages is set
> * added option_parse_uint32(...) to be used for full range of pages numbers
> * The COPY SELECTs now use <= , BETWEEN or >= depending on the segment position
>
> * added documentation
>
> * TESTS:
>   * added simple chunked dump and restore test
>   * added a WARNING with count and table data hash to source and
> chunked restore database
>
> I left in the boolean to indicate if this is a full table or chunk
> (was named chunking, nor is_segment)
>
> An a lternative would be to use an expression like (td->startPage != 0
> || td->endPage != InvalidBlockNumber) whenever td->is_segment is
> needed
>
> If you insist on not having a separate structure member we could turn
> this into something like this
>
> #define is_segment(td) ((td->startPage != 0 || td->endPage !=
> InvalidBlockNumber))
>
> and then use is_segment(td) instead of td->is_segment where needed.


Attachments:

  [application/x-patch] v10-0001-changed-flag-name-to-max-table-segment-pages.patch (19.4K, 2-v10-0001-changed-flag-name-to-max-table-segment-pages.patch)
  download | inline diff:
From 9a77781df2140cb5676a79eea6b32934672ee36f Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Thu, 22 Jan 2026 17:56:34 +0100
Subject: [PATCH v10] * changed flag name to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test *
 switched to using of
 pg_relation_size(c.oid)/current_setting('block_size')::int when
 --max-table-segment-pages is set * added documentation * added
 option_parse_uint32(...) to be used for full range of pages numbers * .. and
 fixed the warnings

* TESTS: added a WARNING with count and table data hash to source and chunked restore database
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |   2 +
 src/bin/pg_dump/pg_dump.c                 | 170 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |   7 +
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  52 +++++++
 src/fe_utils/option_utils.c               |  57 ++++++++
 src/include/fe_utils/option_utils.h       |   3 +
 8 files changed, 282 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 688e23c0e90..1811c67d141 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1088,6 +1088,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..b63ae05d895 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -178,6 +179,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..ed1913d66bc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..ca7e9c5eeba 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",(BlockNumber) dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || tdinfo->is_segment || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,25 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (tdinfo->is_segment)
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid <= '(%u,32000)'", /* there is no (*,0) tuple */
+								 tdinfo->endPage);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'", /* there is no (*,0) tuple */
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", /* there is no (*,0) tuple */
+								 tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2465,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,41 +2963,101 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if ((BlockNumber) tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * However, relpages is declared as "integer" in pg_class, and hence
+			 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
+			 * Cast so that we get the right interpretation of table sizes
+			 * exceeding INT_MAX pages.
+			 */
+			te->dataLength = (BlockNumber) tbinfo->relpages;
+			te->dataLength += (BlockNumber) tbinfo->toastpages;
+		}
+		else
+		{
+			BlockNumber current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]",(BlockNumber) tbinfo->relpages);
+
+			while (current_chunk_start < (BlockNumber) tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+				chunk_tdinfo->is_segment = true;
+				chunk_tdinfo->startPage = current_chunk_start;
+				chunk_tdinfo->endPage = current_chunk_start + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= (BlockNumber) tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = (BlockNumber) tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
+	   /*
+		* If pgoff_t is only 32 bits wide, the above refinement is useless,
+		* and instead we'd better worry about integer overflow.  Clamp to
+		* INT_MAX if the correct result exceeds that.
+		*/
 		if (sizeof(te->dataLength) == 4 &&
 			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
+			te->dataLength < 0))
 			te->dataLength = INT_MAX;
 	}
 
@@ -3092,6 +3182,9 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->is_segment = false; /* we could use (tdinfo->startPage != 0 || tdinfo->endPage != InvalidBlockNumber) */
+	tdinfo->startPage = 0;
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7347,15 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		appendPQExpBufferStr(query, "c.relpages, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..e362253d4d5 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -413,6 +414,12 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	bool 		is_segment;		/* true if this is a data segment.
+								 * we could use (tdinfo->startPage != 0 || 
+								 * tdinfo->endPage != InvalidBlockNumber) */
+	BlockNumber	startPage;		/* starting table page */
+	BlockNumber	endPage;		/* ending table page for page-range dump,
+	                    		 * mostly startPage+max_table_segment_pages */
 } TableDataInfo;
 
 typedef struct _indxInfo
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..88af25d2889 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -44,6 +46,18 @@ create table tht_p1 partition of tht for values with (modulus 3, remainder 0);
 create table tht_p2 partition of tht for values with (modulus 3, remainder 1);
 create table tht_p3 partition of tht for values with (modulus 3, remainder 2);
 insert into tht select (x%10)::text::digit, x from generate_series(1,1000) x;
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash: %', thash_rec;
+END;
+\$\$;
 	});
 
 $node->command_ok(
@@ -87,4 +101,42 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+$node->safe_psql(
+	$dbname4,
+	qq{
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
+END;
+\$\$;
+	});
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index cc483ae176c..958f47711aa 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,63 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_int
+ *
+ * Parse integer value for an option.  If the parsing is successful, returns
+ * true and stores the result in *result if that's given; if parsing fails,
+ * returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint64 min_range, uint64 max_range,
+				 uint32 *result)
+{
+	const char *chkptr;
+	char	   *endptr;
+	uint64		val64;
+
+	/* check there is no minus sign in value because strtoul() 
+	 * will silently convert negative numbers to two's complement */
+	for(chkptr = optarg; *chkptr != '\0'; chkptr++)
+		if(*chkptr == '-')
+		{
+			pg_log_error("value \"%s\" for option %s can not be negative",
+						optarg, optname);
+			return false;
+		}
+
+	errno = 0;
+	val64 = strtoull(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	if (errno == ERANGE || val64 < min_range || val64 > max_range)
+	{
+		pg_log_error("%s musst be in range %lu..%lu",
+					 optname, 
+					 (long unsigned int) min_range, 
+					 (long unsigned int) max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32)val64;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index 0db6e3b6e91..268590a18bd 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint64 min_range, uint64 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-23 02:15               ` David Rowley <[email protected]>
  2026-01-27 22:43                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-28 21:27                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 2 replies; 24+ messages in thread

From: David Rowley @ 2026-01-23 02:15 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <[email protected]> wrote:
>
> Fixing all the warnings

I think overall this needs significantly more care and precision than
what you've given it so far. For example, you have:

+    if(dopt->max_table_segment_pages != InvalidBlockNumber)
+        appendPQExpBufferStr(query,
"pg_relation_size(c.oid)/current_setting('block_size')::int AS
relpages, ");
+    else
+        appendPQExpBufferStr(query, "c.relpages, ");

Note that pg_class.relpages is "int". Later the code in master does:

tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));

If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
num_pages;" that the value stored in relpages will be negative when
the table is >= 16TB (assuming 8k pages). Your pg_relation_size
expression is not going to produce an INT. It'll produce a BIGINT, per
"select pg_typeof(pg_relation_size('pg_class') /
current_setting('block_size')::int);". So the atoi() can receive a
string of digits representing an integer larger than INT_MAX in this
case. Looking at [1], I see:

"7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
and atoll need not affect the value of the integer expression errno on
an error. If the value of the result cannot be represented, *the
behavior is undefined.*"

And testing locally, I see that my Microsoft compiler will just return
INT_MAX on overflow, whereas I see gcc does nothing to prevent
overflows and just continues to multiply by 10 regardless of what
overflows occur, which I think would just make the code work by
accident.

Aside from that, nothing in the documentation mentions that this is
for "heap" tables only. That should be mentioned as it'll just result
in people posting questions about why it's not working for some other
table access method. There's also not much care for white space.
You've introduced a bunch of whitespace changes unrelated to code
changes you've made, plus there's not much regard for following
project standard. For example, you commonly do "if(" and don't
consistently follow the bracing rules, e.g:

+ for(chkptr = optarg; *chkptr != '\0'; chkptr++)
+     if(*chkptr == '-')

Things like the following help convey the level of care that's gone into this:

+/*
+ * option_parse_int
+ *
+ * Parse integer value for an option.  If the parsing is successful, returns
+ * true and stores the result in *result if that's given; if parsing fails,
+ * returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,

i.e zero effort gone in to modify the comments after pasting them from
option_parse_int().

Another example:

+ pg_log_error("%s musst be in range %lu..%lu",

Also, I have no comprehension of why you'd use uint64 for the valid
range when the function is for processing uint32 types in:

+bool
+option_parse_uint32(const char *optarg, const char *optname,
+ uint64 min_range, uint64 max_range,
+ uint32 *result)

In its current state, it's quite hard to take this patch seriously.
Please spend longer self-reviewing it before posting. You could
temporarily hard-code something for testing which makes at least 1
table appear to be larger than 16TB and ensure your code works. What
you have is visually broken and depends on whatever the atoi
implementation opts to do in the overflow case. These are all things
diligent commiters will be testing and it's sad to see how little
effort you're putting into this. How do you expect this community to
scale with this quality level of patch submissions? You've been around
long enough and should know and do better.  Are you just expecting the
committer to fix these things for you? That work does not get done via
magic wand. Being on v10 already, I'd have expected the patch to be
far beyond proof of concept grade. If you're withholding investing
time on this until you see more community buy-in, then I'd suggest you
write that and withhold further revisions until you're happy with the
level of buy-in.

I'm also still not liking your de-normalised TableInfo representation
for "is_segment". IMO, InvalidBlockNumber should be used to represent
open bounded ranges, and if there's no chunking, then startPage and
endPage will both be InvalidBlockNumber. IMO, what you have now
needlessly allows invalid states where is_segment == true and
startPage, endPage are not set correctly. If you want to keep the code
simple, hide the complexity in a macro or an inline function. There's
just no performance reason to materialise the more complex condition
into a dedicated boolean flag.

If the quality level of this has not improved significantly by v11,
count me out.

David

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
@ 2026-01-27 22:43                 ` Hannu Krosing <[email protected]>
  2026-01-28 17:29                   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-27 22:43 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Hi David

Thanks for reviewing.

Please hold back reviewing this v12 patch until I have verified that
it passes CfBot and until I have finished my testing with 17TB table.

I would appreciate pointers or help on adding the data correctness
tests to the tap tests as real tests.
For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
I check manually that the table data checksums match
If we add them we should add them to other places in pg_dump tests as
well. Currently we just test that dump and restore do not fail, not
that the restored data is correct.

On Fri, Jan 23, 2026 at 3:15 AM David Rowley <[email protected]> wrote:
>
> On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <[email protected]> wrote:
> >
> > Fixing all the warnings
>
> I think overall this needs significantly more care and precision than
> what you've given it so far. For example, you have:
>
> +    if(dopt->max_table_segment_pages != InvalidBlockNumber)
> +        appendPQExpBufferStr(query,
> "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> relpages, ");
> +    else
> +        appendPQExpBufferStr(query, "c.relpages, ");
>
> Note that pg_class.relpages is "int". Later the code in master does:
>
> tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));

I have now fixed the base issue by changing the data type of
TableInfo.relpages to BlockNumber, and also changed the way we get it
by

1. converting it to unsigned int ( c.relpages::oid ) in the query
2. reading it from the result using strtoul()

(technically it should have been enough to just use strtoul() as it
already wraps signed ints to unsigned ones, but having it converted in
the query seems cleaner)

This allowed removing casts to (BlockNumber) everywhere where
.relpages was used.

Functionally value was ever only used for ordering and even this
loosley, which explains why patch v10 did not break anything.

I also changed the data type of TocEntry.dataLength from pgoff_t to
uint64. The current clearly had an overflow in case when off_t was 32
bit and sum of relpages from heap and toast was larger than allowed
for it.

> If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> num_pages;" that the value stored in relpages will be negative when
> the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> expression is not going to produce an INT. It'll produce a BIGINT, per
> "select pg_typeof(pg_relation_size('pg_class') /
> current_setting('block_size')::int);". So the atoi() can receive a
> string of digits representing an integer larger than INT_MAX in this
> case. Looking at [1], I see:

As said above this should be fixed now by using correct type in struch
and strtoul().
To be sure  I have now created a 17TB  table and running some tests on
this as well.
Will let you know here when done.

> "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> and atoll need not affect the value of the integer expression errno on
> an error. If the value of the result cannot be represented, *the
> behavior is undefined.*"
>
> And testing locally, I see that my Microsoft compiler will just return
> INT_MAX on overflow, whereas I see gcc does nothing to prevent
> overflows and just continues to multiply by 10 regardless of what
> overflows occur, which I think would just make the code work by
> accident.

As .relpages was only ever used for ordering parallel copies it does
work just not optimally.

The old code has similar overflow/wraparound for case when off_t is 32
bit int and the sum of relpages from heap and toast table is above
INT_MAX

I have removed the whole part where this was partially fixed for the
case when one of them was > 0x7fffffff (i.e. negative) by pinning the
dataLength to INT_MAX in that case

> Aside from that, nothing in the documentation mentions that this is
> for "heap" tables only. That should be mentioned as it'll just result
> in people posting questions about why it's not working for some other
> table access method. There's also not much care for white space.
> You've introduced a bunch of whitespace changes unrelated to code
> changes you've made, plus there's not much regard for following
> project standard. For example, you commonly do "if(" and don't
> consistently follow the bracing rules, e.g:
>
> + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> +     if(*chkptr == '-')

I assumed that it is the classical "single statemet -- no braces.

Do we have a writeup of our coding standards somewhere ?

Now this specific case is rewritten using while() so shoud be ok.

> Things like the following help convey the level of care that's gone into this:
>
> +/*
> + * option_parse_int
> + *
> + * Parse integer value for an option.  If the parsing is successful, returns
> + * true and stores the result in *result if that's given; if parsing fails,
> + * returns false.
> + */
> +bool
> +option_parse_uint32(const char *optarg, const char *optname,
>
> i.e zero effort gone in to modify the comments after pasting them from
> option_parse_int().
>
> Another example:
>
> + pg_log_error("%s musst be in range %lu..%lu",
>
> Also, I have no comprehension of why you'd use uint64 for the valid
> range when the function is for processing uint32 types in:

The uint64 there I picked up from the referenced long unsigned usage
in pg_resetval after I managed to get pg_log_warning to print out -1
for format %u and did not want to go to debug why that happens.

I have now made all the arguments uint32

> +bool
> +option_parse_uint32(const char *optarg, const char *optname,
> + uint64 min_range, uint64 max_range,
> + uint32 *result)
>
> In its current state, it's quite hard to take this patch seriously.
> Please spend longer self-reviewing it before posting. You could
> temporarily hard-code something for testing which makes at least 1
> table appear to be larger than 16TB and ensure your code works. What
> you have is visually broken and depends on whatever the atoi
> implementation opts to do in the overflow case. These are all things
> diligent commiters will be testing and it's sad to see how little
> effort you're putting into this. How do you expect this community to
> scale with this quality level of patch submissions? You've been around
> long enough and should know and do better.  Are you just expecting the
> committer to fix these things for you? That work does not get done via
> magic wand. Being on v10 already, I'd have expected the patch to be
> far beyond proof of concept grade. If you're withholding investing
> time on this until you see more community buy-in, then I'd suggest you
> write that and withhold further revisions until you're happy with the
> level of buy-in.

> I'm also still not liking your de-normalised TableInfo representation
> for "is_segment".
> IMO, InvalidBlockNumber should be used to represent
> open bounded ranges, and if there's no chunking, then startPage and
> endPage will both be InvalidBlockNumber.

That's what I ended up doing

I switched to using startPage = InvalidBlockNumber to indicate that no
chunking is in effect.

This is safe because when chunking is in use I always try to set both
chunk end pages, and lower bound I can always set the lower bound.

Only for the last page is the endPage left to InvalidBlockNumber.

> IMO, what you have now
> needlessly allows invalid states where is_segment == true and
> startPage, endPage are not set correctly. If you want to keep the code
> simple, hide the complexity in a macro or an inline function. There's
> just no performance reason to materialise the more complex condition
> into a dedicated boolean flag.


Attachments:

  [text/x-patch] v12-0001-changed-flag-name-to-max-table-segment-pages.patch (21.9K, 2-v12-0001-changed-flag-name-to-max-table-segment-pages.patch)
  download | inline diff:
From 2b76c8242fe9b1b294b5ad443612fe583e51792a Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Tue, 27 Jan 2026 22:51:03 +0100
Subject: [PATCH v12] * changed flag name to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test *
 changed the data type of TableInfo.relpages to BlockNumber,   * select it
 using relpages:oid to get unsigned int out   * read it in from query result
 using strtoul()   * removed a bunch of casts from .relpages to (BlocNumber) *
 changed the data type of TocEntry.dataLength to uint64   current pgoff_t
 certainly had an overflow in 32bit case when heap relpages + toast relpages >
 INT_MAX * switched to using of
 pg_relation_size(c.oid)/current_setting('block_size')::int when
 --max-table-segment-pages is set * added documentation * added
 option_parse_uint32(...) to be used for full range of pages numbers

* TESTS: added a WARNING with count and table data hash to source and chunked restore database
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |   2 +
 src/bin/pg_dump/pg_backup_archiver.h      |   2 +-
 src/bin/pg_dump/pg_dump.c                 | 169 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  52 +++++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 289 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 688e23c0e90..1811c67d141 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1088,6 +1088,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..b63ae05d895 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -178,6 +179,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..ed1913d66bc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 325b53fc9bd..b6a9f16a122 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -377,7 +377,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..0badb245b55 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]", dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,23 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2463,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,42 +2961,95 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]", tbinfo->relpages);
+
+			/* TODO - use uint 64 for current_chunk_start to avoid wraparound */
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+//				chunk_tdinfo->is_segment = true;
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3092,6 +3173,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7337,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7495,7 +7586,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..be71661ac41 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pog_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) (tdiptr->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -448,7 +466,7 @@ typedef struct _indexAttachInfo
 typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..88af25d2889 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -44,6 +46,18 @@ create table tht_p1 partition of tht for values with (modulus 3, remainder 0);
 create table tht_p2 partition of tht for values with (modulus 3, remainder 1);
 create table tht_p3 partition of tht for values with (modulus 3, remainder 2);
 insert into tht select (x%10)::text::digit, x from generate_series(1,1000) x;
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash: %', thash_rec;
+END;
+\$\$;
 	});
 
 $node->command_ok(
@@ -87,4 +101,42 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+$node->safe_psql(
+	$dbname4,
+	qq{
+
+-- raise warning so I can check in .log if data was correct
+DO \$\$
+DECLARE
+    thash_rec RECORD;
+BEGIN
+    SELECT 'tplain', count(*), sum(hashtext(t::text)) as tablehash 
+	  INTO thash_rec
+	  FROM tplain AS t;
+    RAISE WARNING 'thash after parallel chunked restore: %', thash_rec;
+END;
+\$\$;
+	});
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index cc483ae176c..aff1fbd31a3 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index 0db6e3b6e91..c74cd1fb595 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-27 22:43                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-28 17:29                   ` Hannu Krosing <[email protected]>
  2026-02-12 06:13                     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Dilip Kumar <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-28 17:29 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

v13 has added a proper test comparing original and restored table data



On Tue, Jan 27, 2026 at 11:43 PM Hannu Krosing <[email protected]> wrote:
>
> Hi David
>
> Thanks for reviewing.
>
> Please hold back reviewing this v12 patch until I have verified that
> it passes CfBot and until I have finished my testing with 17TB table.
>
> I would appreciate pointers or help on adding the data correctness
> tests to the tap tests as real tests.
> For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
> I check manually that the table data checksums match
> If we add them we should add them to other places in pg_dump tests as
> well. Currently we just test that dump and restore do not fail, not
> that the restored data is correct.
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <[email protected]> wrote:
> >
> > On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <[email protected]> wrote:
> > >
> > > Fixing all the warnings
> >
> > I think overall this needs significantly more care and precision than
> > what you've given it so far. For example, you have:
> >
> > +    if(dopt->max_table_segment_pages != InvalidBlockNumber)
> > +        appendPQExpBufferStr(query,
> > "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> > relpages, ");
> > +    else
> > +        appendPQExpBufferStr(query, "c.relpages, ");
> >
> > Note that pg_class.relpages is "int". Later the code in master does:
> >
> > tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
>
> I have now fixed the base issue by changing the data type of
> TableInfo.relpages to BlockNumber, and also changed the way we get it
> by
>
> 1. converting it to unsigned int ( c.relpages::oid ) in the query
> 2. reading it from the result using strtoul()
>
> (technically it should have been enough to just use strtoul() as it
> already wraps signed ints to unsigned ones, but having it converted in
> the query seems cleaner)
>
> This allowed removing casts to (BlockNumber) everywhere where
> .relpages was used.
>
> Functionally value was ever only used for ordering and even this
> loosley, which explains why patch v10 did not break anything.
>
> I also changed the data type of TocEntry.dataLength from pgoff_t to
> uint64. The current clearly had an overflow in case when off_t was 32
> bit and sum of relpages from heap and toast was larger than allowed
> for it.
>
> > If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> > num_pages;" that the value stored in relpages will be negative when
> > the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> > expression is not going to produce an INT. It'll produce a BIGINT, per
> > "select pg_typeof(pg_relation_size('pg_class') /
> > current_setting('block_size')::int);". So the atoi() can receive a
> > string of digits representing an integer larger than INT_MAX in this
> > case. Looking at [1], I see:
>
> As said above this should be fixed now by using correct type in struch
> and strtoul().
> To be sure  I have now created a 17TB  table and running some tests on
> this as well.
> Will let you know here when done.
>
> > "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> > and atoll need not affect the value of the integer expression errno on
> > an error. If the value of the result cannot be represented, *the
> > behavior is undefined.*"
> >
> > And testing locally, I see that my Microsoft compiler will just return
> > INT_MAX on overflow, whereas I see gcc does nothing to prevent
> > overflows and just continues to multiply by 10 regardless of what
> > overflows occur, which I think would just make the code work by
> > accident.
>
> As .relpages was only ever used for ordering parallel copies it does
> work just not optimally.
>
> The old code has similar overflow/wraparound for case when off_t is 32
> bit int and the sum of relpages from heap and toast table is above
> INT_MAX
>
> I have removed the whole part where this was partially fixed for the
> case when one of them was > 0x7fffffff (i.e. negative) by pinning the
> dataLength to INT_MAX in that case
>
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only. That should be mentioned as it'll just result
> > in people posting questions about why it's not working for some other
> > table access method. There's also not much care for white space.
> > You've introduced a bunch of whitespace changes unrelated to code
> > changes you've made, plus there's not much regard for following
> > project standard. For example, you commonly do "if(" and don't
> > consistently follow the bracing rules, e.g:
> >
> > + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> > +     if(*chkptr == '-')
>
> I assumed that it is the classical "single statemet -- no braces.
>
> Do we have a writeup of our coding standards somewhere ?
>
> Now this specific case is rewritten using while() so shoud be ok.
>
> > Things like the following help convey the level of care that's gone into this:
> >
> > +/*
> > + * option_parse_int
> > + *
> > + * Parse integer value for an option.  If the parsing is successful, returns
> > + * true and stores the result in *result if that's given; if parsing fails,
> > + * returns false.
> > + */
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> >
> > i.e zero effort gone in to modify the comments after pasting them from
> > option_parse_int().
> >
> > Another example:
> >
> > + pg_log_error("%s musst be in range %lu..%lu",
> >
> > Also, I have no comprehension of why you'd use uint64 for the valid
> > range when the function is for processing uint32 types in:
>
> The uint64 there I picked up from the referenced long unsigned usage
> in pg_resetval after I managed to get pg_log_warning to print out -1
> for format %u and did not want to go to debug why that happens.
>
> I have now made all the arguments uint32
>
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> > + uint64 min_range, uint64 max_range,
> > + uint32 *result)
> >
> > In its current state, it's quite hard to take this patch seriously.
> > Please spend longer self-reviewing it before posting. You could
> > temporarily hard-code something for testing which makes at least 1
> > table appear to be larger than 16TB and ensure your code works. What
> > you have is visually broken and depends on whatever the atoi
> > implementation opts to do in the overflow case. These are all things
> > diligent commiters will be testing and it's sad to see how little
> > effort you're putting into this. How do you expect this community to
> > scale with this quality level of patch submissions? You've been around
> > long enough and should know and do better.  Are you just expecting the
> > committer to fix these things for you? That work does not get done via
> > magic wand. Being on v10 already, I'd have expected the patch to be
> > far beyond proof of concept grade. If you're withholding investing
> > time on this until you see more community buy-in, then I'd suggest you
> > write that and withhold further revisions until you're happy with the
> > level of buy-in.
>
> > I'm also still not liking your de-normalised TableInfo representation
> > for "is_segment".
> > IMO, InvalidBlockNumber should be used to represent
> > open bounded ranges, and if there's no chunking, then startPage and
> > endPage will both be InvalidBlockNumber.
>
> That's what I ended up doing
>
> I switched to using startPage = InvalidBlockNumber to indicate that no
> chunking is in effect.
>
> This is safe because when chunking is in use I always try to set both
> chunk end pages, and lower bound I can always set the lower bound.
>
> Only for the last page is the endPage left to InvalidBlockNumber.
>
> > IMO, what you have now
> > needlessly allows invalid states where is_segment == true and
> > startPage, endPage are not set correctly. If you want to keep the code
> > simple, hide the complexity in a macro or an inline function. There's
> > just no performance reason to materialise the more complex condition
> > into a dedicated boolean flag.


Attachments:

  [application/x-patch] v13-0001-changed-flag-name-to-max-table-segment-pages.patch (21.2K, 2-v13-0001-changed-flag-name-to-max-table-segment-pages.patch)
  download | inline diff:
From e598191f7464ca2ecfa9779a823d1aa8a409cdf7 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Wed, 28 Jan 2026 18:24:19 +0100
Subject: [PATCH v13] * changed flag name to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test *
 changed the data type of TableInfo.relpages to BlockNumber,   * select it
 using relpages:oid to get unsigned int out   * read it in from query result
 using strtoul()   * removed a bunch of casts from .relpages to (BlocNumber) *
 changed the data type of TocEntry.dataLength to uint64   current pgoff_t
 certainly had an overflow in 32bit case when heap relpages + toast relpages >
 INT_MAX * switched to using of
 pg_relation_size(c.oid)/current_setting('block_size')::int when
 --max-table-segment-pages is set * added documentation * added
 option_parse_uint32(...) to be used for full range of pages numbers

* TESTS: added test  to compare original and restored table contents
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |   2 +
 src/bin/pg_dump/pg_backup_archiver.h      |   2 +-
 src/bin/pg_dump/pg_dump.c                 | 169 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 268 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 688e23c0e90..1811c67d141 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1088,6 +1088,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..b63ae05d895 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -178,6 +179,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..ed1913d66bc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 325b53fc9bd..b6a9f16a122 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -377,7 +377,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..0badb245b55 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]", dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,23 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2463,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,42 +2961,95 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]", tbinfo->relpages);
+
+			/* TODO - use uint 64 for current_chunk_start to avoid wraparound */
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+//				chunk_tdinfo->is_segment = true;
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3092,6 +3173,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7337,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7495,7 +7586,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..be71661ac41 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pog_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) (tdiptr->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -448,7 +466,7 @@ typedef struct _indexAttachInfo
 typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index cc483ae176c..aff1fbd31a3 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index 0db6e3b6e91..c74cd1fb595 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-27 22:43                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-28 17:29                   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-02-12 06:13                     ` Dilip Kumar <[email protected]>
  2026-03-28 10:59                       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Dilip Kumar @ 2026-02-12 06:13 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: David Rowley <[email protected]>; Zsolt Parragi <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Wed, Jan 28, 2026 at 11:00 PM Hannu Krosing <[email protected]> wrote:
>
> v13 has added a proper test comparing original and restored table data
>
I was reviewing v13 and here are some initial comments I have

1. IMHO the commit message details about the work progress instead of
a high level idea about what it actually does and how.
Suggestion:

SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(start,1)'
AND ctid <= '(end,32000)') to extract specific chunks of the relation.

<more architecture details and limitation if any>

2.
+ pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",
dopt.max_table_segment_pages);
+ break;

IMHO we don't need to place warning here while processing the input parameters

3.
+ printf(_("  --max-table-segment-pages=NUMPAGES\n"
+      "                               Number of main table pages
above which data is \n"
+ "                               copied out in chunks, also
determines the chunk size\n"));

Check the comment formatting, all the parameter description starts
with lower case, so better we start with "number" rather than "Number"

4.
+ if (is_segment(tdinfo))
+ {
+ appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+ if(tdinfo->startPage == 0)
+ appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
+ else if(tdinfo->endPage != InvalidBlockNumber)
+ appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
+ tdinfo->startPage, tdinfo->endPage);
+ else
+ appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+ pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+ }

IMHO we should explain this chunking logic in the comment above this code block?




--
Regards,
Dilip Kumar
Google






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-27 22:43                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-28 17:29                   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-02-12 06:13                     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Dilip Kumar <[email protected]>
@ 2026-03-28 10:59                       ` Hannu Krosing <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: Hannu Krosing @ 2026-03-28 10:59 UTC (permalink / raw)
  To: Zsolt Parragi <[email protected]>; Dilip Kumar <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Hi Zsolt and Dilip,

Thanks for review and useful comments!

On Tue, Feb 3, 2026 at 10:10 PM Zsolt Parragi <[email protected]> wrote:
>
> Hello!
>
> I did some testing with this patch, and I think there are some issues
> during restoration:
>
> 1. Isn't there a possible race / scheduling mistake during restore
> because of missing dependencies? The code now prints out "TABLE DATA
> (pages %u:%u)", while the restore code checks for the explicit "TABLE
> DATA" string for dependency tracking (pg_backup_archiver.c:2013 and a
> few other places). This causes POST DATA to have no dependency on the
> table data, and can be scheduled before we load all table data.

I have resolved this by adding a second array to the reverse dependencies
mechanism in buildTocEntryArrays() for chunked dump where I collect arrays
of ids in AH->tableDataChunkIds.

For this I extracted the list management part from DumpableObject

typedef struct _DependencyList
{
DumpId    *dependencies; /* dumpIds of objects this one depends on */
int nDeps; /* number of valid dependencies */
int allocDeps; /* allocated size of dependencies[] */
} DependencyList;

And added addStandaloneDependency() based addObjectDependency()

I simplified it to always use realloc, as it can handle the NULL case

void
addStandaloneDependency(DependencyList *dobj, DumpId refId)
{
if (dobj->nDeps >= dobj->allocDeps)
{
dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
dobj->dependencies = pg_realloc_array(dobj->dependencies,
  DumpId, dobj->allocDeps);
}
dobj->dependencies[dobj->nDeps++] = refId;
}

And then I use AH->tableDataChunkIds in repoint_table_dependencies() to
- replace the dependency on table def with dependency on first chunk
- add the remaining cunks at the end of dependency list.

> I was able to verify the scheduling issue with an index: the INDEX
> part is scheduled too early, before all TABLE DATA completes, but then
> locking prevents it from progressing, so everything completed fine in
> the end. Even if that's guaranteed, which I'm not 100% sure of, it's
> still based on luck and not proper logic, and takes up a slot (or
> multiple), reducing parallelism.
>
> 2. Fixing the TABLE DATA strcmp checks solves the scheduling issue,
> but it's not that simple, because then it causes truncation issues
> during restore, which needs additional changes in the restore code. I
> did a quick fix for that by adding an additional condition to the
> created flag, and with that it seems to restore everything properly,
> and with proper ordering, only starting index/constraint/etc after all
> table data is completed. However this was definitely just a quick test
> fix, this needs a proper better solution.
>
> Other issues I see are more minor, but numerous:

I collect the chunk dependencies in a separate array, which
should solve the truncation issue.

Can you advise a good check to add to tap tests for verifying?

> 3. The patch still has lots of debug output (pg_log_WARNING("CHUNKING
> ...")); Is this intended? Shouldn't these be behind some verbose
> check, and maybe use info instead of warning?

This left in for easing initial reviewing. I have either removed them
or turned them into pg_log_debug()

> 4. The is_segment macro should have () around the use of tdiptr

Thanks, fixed.

> 5. There's still a 32000 magic constant, shouldn't that have some
> descriptive name / explanatory comment?

I turned this into "ctid < (pagenr+1, 0)" for clarity and
futureproofing, as it is not entirely impossible that we could have
at some point more than 32000 items per page.

> 6. formatting issues at multiple places, mostly missing spaces after
> if/while/for statements

My hope was that the pre-release automatic formatting run takes care of
this.

I will eyeball to see if I find theem, but I don't think I have a good
way to detect them all.

Suggestions very much welcome!

> 7. inconsistent error messages (not in range vs must be in range)

> 8. There's a remaining TODO that seems stale, current_chunk_start is
> already uint64

Removed.

> 9. typo: "the computed from pog_relation_size" -> "then computed from
> pg_relation_size"

Fixed.

On Thu, Feb 12, 2026 at 7:13 AM Dilip Kumar <[email protected]> wrote:
>
> On Wed, Jan 28, 2026 at 11:00 PM Hannu Krosing <[email protected]> wrote:
> >
> > v13 has added a proper test comparing original and restored table data
> >
> I was reviewing v13 and here are some initial comments I have
>
> 1. IMHO the commit message details about the work progress instead of
> a high level idea about what it actually does and how.
> Suggestion:
>
> SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
> table dumping.
>
> This patch introduces the ability to split large heap tables into segments
> based on a specified number of pages. These segments can then be dumped in
> parallel using the existing jobs infrastructure, significantly reducing
> the time required to dump very large tables.
>
> The implementation uses ctid-based range queries (e.g., WHERE ctid >=
> '(start,1)'
> AND ctid <= '(end,32000)') to extract specific chunks of the relation.
>
> <more architecture details and limitation if any>

SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid <= '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column

> 2.
> + pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",
> dopt.max_table_segment_pages);
> + break;
>
> IMHO we don't need to place warning here while processing the input parameters

Either removed or turned to pg_log_debug()

> 3.
> + printf(_("  --max-table-segment-pages=NUMPAGES\n"
> +      "                               Number of main table pages
> above which data is \n"
> + "                               copied out in chunks, also
> determines the chunk size\n"));
>
> Check the comment formatting, all the parameter description starts
> with lower case, so better we start with "number" rather than "Number"

Fixed

> 4.
> + if (is_segment(tdinfo))
> + {
> + appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
> + if(tdinfo->startPage == 0)
> + appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
> + else if(tdinfo->endPage != InvalidBlockNumber)
> + appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
> + tdinfo->startPage, tdinfo->endPage);
> + else
> + appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
> + pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
> + }
>
> IMHO we should explain this chunking logic in the comment above this code block?

I added the comment.
I also changed the chunk end logic to "ctid < '(LastPage+1,0)'" for clarity and
future-proofing.

----
Best Regards

Hannu


Attachments:

  [application/x-patch] v14-0001-SUBJECT-Add-max-table-segment-pages-option-to-pg.patch (27.9K, 2-v14-0001-SUBJECT-Add-max-table-segment-pages-option-to-pg.patch)
  download | inline diff:
From d9442eb6476ba27e0f3dee085e48de2efbb445d6 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Sat, 28 Mar 2026 11:53:39 +0100
Subject: [PATCH v14] SUBJECT: Add --max-table-segment-pages option to pg_dump
 for parallel table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid < '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |  84 +++++++++-
 src/bin/pg_dump/pg_backup_archiver.h      |  12 +-
 src/bin/pg_dump/pg_dump.c                 | 177 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 364 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7f538e90194..5f056bb4af6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1066,6 +1066,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index fda912ba0a9..11863a1915f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 271a2c3e481..384add0713b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
@@ -1995,6 +1997,28 @@ _moveBefore(TocEntry *pos, TocEntry *te)
 	pos->prev = te;
 }
 
+/*
+ * Add a dependency id to a DependencyList object
+ * This is currently used for collecting reverse 
+ * dependencies for chunked data dump 
+ *
+ * Note: duplicate dependencies are currently not eliminated
+ */
+void
+addStandaloneDependency(DependencyList *dobj, DumpId refId)
+{
+	pg_log_warning("Adding dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	if (dobj->nDeps >= dobj->allocDeps)
+	{
+		dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
+		dobj->dependencies = pg_realloc_array(dobj->dependencies,
+											  DumpId, dobj->allocDeps);
+		pg_log_warning("Realloced list %p to size %d", (void *) dobj->dependencies, dobj->allocDeps);
+	}
+	pg_log_warning("Added dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	dobj->dependencies[dobj->nDeps++] = refId;
+}
+
 /*
  * Build index arrays for the TOC list
  *
@@ -2014,6 +2038,7 @@ buildTocEntryArrays(ArchiveHandle *AH)
 
 	AH->tocsByDumpId = pg_malloc0_array(TocEntry *, (maxDumpId + 1));
 	AH->tableDataId = pg_malloc0_array(DumpId, (maxDumpId + 1));
+	AH->tableDataChunkIds = pg_malloc0_array(DependencyList, (maxDumpId + 1));
 
 	for (te = AH->toc->next; te != AH->toc; te = te->next)
 	{
@@ -2029,8 +2054,12 @@ buildTocEntryArrays(ArchiveHandle *AH)
 		 * TOC entry that has a DATA item.  We compute this by reversing the
 		 * TABLE DATA item's dependency, knowing that a TABLE DATA item has
 		 * just one dependency and it is the TABLE item.
+		 *
+		 * For chunked table data, the TABLE DATA item has a description like
+		 * "TABLE DATA (pages 100:199)", and we collect all such items as
+		 * reverse dependencies for the parent table's entry in tableDataChunkIds.
 		 */
-		if (strcmp(te->desc, "TABLE DATA") == 0 && te->nDeps > 0)
+		if (strncmp(te->desc, "TABLE DATA", 10) == 0 && te->nDeps > 0)
 		{
 			DumpId		tableId = te->dependencies[0];
 
@@ -2042,7 +2071,14 @@ buildTocEntryArrays(ArchiveHandle *AH)
 			if (tableId <= 0 || tableId > maxDumpId)
 				pg_fatal("bad table dumpId for TABLE DATA item");
 
-			AH->tableDataId[tableId] = te->dumpId;
+			if (te->desc[10] == '\0') /* te->desc == "TABLE DATA" */
+				AH->tableDataId[tableId] = te->dumpId;
+			else
+			{
+				/* Chunked table data, the description is "TABLE DATA (pages %u:%u)" */
+				addStandaloneDependency(&(AH->tableDataChunkIds[tableId]), te->dumpId);
+				pg_log_debug("Added chunked table data dependency: tableId %u + chunkId %u",
+							 tableId, te->dumpId);}
 		}
 	}
 }
@@ -5017,6 +5053,12 @@ fix_dependencies(ArchiveHandle *AH)
  * that parallel restore will prioritize larger jobs (index builds, FK
  * constraint checks, etc) over smaller ones, avoiding situations where we
  * end a restore with only one active job working on a large table.
+ *
+ * In case of chunked dumps, we change the depenency on table with depedency
+ * on the first chunk of data and add the remaingi chunk ids, if any, to the 
+ * end of depencency list
+ * we also calculate the fullDataLength as the sum of the lengths of chunk
+ * data items and use that to set the item's dataLength.
  */
 static void
 repoint_table_dependencies(ArchiveHandle *AH)
@@ -5032,8 +5074,9 @@ repoint_table_dependencies(ArchiveHandle *AH)
 		for (i = 0; i < te->nDeps; i++)
 		{
 			olddep = te->dependencies[i];
-			if (olddep <= AH->maxDumpId &&
-				AH->tableDataId[olddep] != 0)
+			if (olddep > AH->maxDumpId)
+				continue;
+			if (AH->tableDataId[olddep] != 0)
 			{
 				DumpId		tabledataid = AH->tableDataId[olddep];
 				TocEntry   *tabledatate = AH->tocsByDumpId[tabledataid];
@@ -5043,6 +5086,39 @@ repoint_table_dependencies(ArchiveHandle *AH)
 				pg_log_debug("transferring dependency %d -> %d to %d",
 							 te->dumpId, olddep, tabledataid);
 			}
+			else if (AH->tableDataChunkIds[olddep].nDeps > 0)
+			{
+				int			j;
+				DumpId		chunkdataid;
+				uint64		fullDataLength;
+				DependencyList *deplist = &AH->tableDataChunkIds[olddep];
+
+				/* first in list replaces the dependency on table */
+				chunkdataid = deplist->dependencies[0];
+				te->dependencies[i] = chunkdataid;
+				fullDataLength = AH->tocsByDumpId[chunkdataid]->dataLength;
+				pg_log_debug("transferring chunk list %d -> %d to %d",
+							 te->dumpId, olddep, chunkdataid);
+
+				if (deplist->nDeps > 1)
+				{
+					/* make space */
+					te->dependencies = pg_realloc_array(te->dependencies,
+												  DumpId,
+												  te->nDeps + deplist->nDeps - 1);
+
+					/* the rest are appended to dependencies */
+					for (j = 1; j < deplist->nDeps; j++)
+					{
+						chunkdataid = deplist->dependencies[j];
+						te->dependencies[te->nDeps + j] = chunkdataid;
+						fullDataLength += AH->tocsByDumpId[chunkdataid]->dataLength;
+						pg_log_debug("adding chunk list %d -> %d to %d",
+									te->dumpId, olddep, chunkdataid);
+					}
+				}
+				te->dataLength = Max(te->dataLength, fullDataLength);
+			}
 		}
 	}
 }
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 365073b3eae..cfa3ea1bbd6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -179,6 +179,13 @@ typedef enum
 	OUTPUT_OTHERDATA,			/* writing data as INSERT commands */
 } ArchiverOutput;
 
+typedef struct _DependencyList
+{
+	DumpId	   *dependencies;	/* dumpIds of objects this one depends on */
+	int			nDeps;			/* number of valid dependencies */
+	int			allocDeps;		/* allocated size of dependencies[] */
+} DependencyList;
+
 /*
  * For historical reasons, ACL items are interspersed with everything else in
  * a dump file's TOC; typically they're right after the object they're for.
@@ -311,6 +318,7 @@ struct _archiveHandle
 	/* arrays created after the TOC list is complete: */
 	struct _tocEntry **tocsByDumpId;	/* TOCs indexed by dumpId */
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
+	DependencyList *tableDataChunkIds; /* dependencies indexed by dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
 	pg_compress_specification compression_spec; /* Requested specification for
@@ -377,7 +385,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
@@ -437,6 +445,8 @@ extern int	TocIDRequired(ArchiveHandle *AH, DumpId id);
 TocEntry   *getTocEntryByDumpId(ArchiveHandle *AH, DumpId id);
 extern bool checkSeek(FILE *fp);
 
+extern void addStandaloneDependency(DependencyList *dobj, DumpId refId);
+
 #define appendStringLiteralAHX(buf,str,AH) \
 	appendStringLiteral(buf, str, (AH)->public.encoding, (AH)->public.std_strings)
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5d1f7682f11..1e7d9a3f7f3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -535,6 +535,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -799,6 +800,12 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1344,6 +1351,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2396,7 +2406,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
 	 * cases a simple COPY suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 		(fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
 		 tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
 	{
@@ -2414,9 +2424,37 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		/* If it's a segment, we need to add a filter condition to select the
+		 * right page range 
+		 * - for first segment we add "ctid < (endPage+1, 0)" 
+		 *   first segment is the one with startPage == 0
+		 * - for last segment we add "ctid >= (startPage, 1)"
+		 *   last segment is the one with endPage == InvalidBlockNumber
+		 *   we leave to upper bound open for the case where more pages 
+		 *   were added after we measured 
+		 * - for middle segments we add 
+		 *   "ctid >= (startPage, 1) AND ctid < (endPage+1, 0)"
+		 *
+		 * "ctid < (endPage+1, 0)" instead of "ctid <= (endPage, maxtuple)"
+		 * was chosen as range end so that we do not have to estimate the maxtuple
+		 * 
+		 */
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid < '(%u,0)'", tdinfo->endPage+1);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid >= '(%u,1)' AND ctid < '(%u,0)'",
+								 tdinfo->startPage, tdinfo->endPage+1);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2424,6 +2462,10 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	if (is_segment(tdinfo))
+		pg_log_debug("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2919,42 +2961,89 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId);
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3081,6 +3170,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7347,8 +7438,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7589,7 +7688,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5a6726d8b12..84e682d585f 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pg_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) ((tdiptr)->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -449,7 +467,7 @@ typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
 	Oid			relid;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index 8d0659c1164..a516d8c86a9 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index d975db77af2..67fd3650d7a 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 extern void check_mut_excl_opts_internal(int n,...);
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
@ 2026-01-28 21:27                 ` Hannu Krosing <[email protected]>
  2026-01-28 21:33                   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  1 sibling, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-28 21:27 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Hi David

About documentation :

On Fri, Jan 23, 2026 at 3:15 AM David Rowley <[email protected]> wrote:
>
> Aside from that, nothing in the documentation mentions that this is
> for "heap" tables only.

The <note> part did mention it and even gave an example

     <varlistentry>
      <term><option>--max-table-segment-pages=<replaceable
class="parameter">npages</replaceable></option></term>
      <listitem>
       <para>
        Dump data in segments based on number of pages in the main relation.
        If the number of data pages in the relation is more than
<replaceable class="parameter">npages</replaceable>
        the data is split into segments based on that number of pages.
        Individual segments can be dumped in parallel.
       </para>

       <note>
        <para>
         The option <option>--max-table-segment-pages</option> is
applied to only pages
         in the main heap and if the table has a large TOASTed part
this has to be
         taken into account when deciding on the number of pages to use.
         In the extreme case a single 8kB heap page can have ~200
toast pointers each
         corresponding to 1GB of data. If this data is also
non-compressible then a
         single-page segment can dump as 200GB file.
        </para>
       </note>

Would it be a good idea to add a 2nd paragraph like this to make it
even more clear ?

        <para>
         It is also possible that segments end up with very different
sizes even when
         no TOAST is involved, as there is no guarantees that pages
are not unevenly
         bloated in a real production database up to  a point where
some pagese may
         have no live tuples in them.
        </para>

---
Hannu






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-28 21:27                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-28 21:33                   ` Hannu Krosing <[email protected]>
  2026-02-03 21:10                     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-01-28 21:33 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Wed, Jan 28, 2026 at 10:27 PM Hannu Krosing <[email protected]> wrote:
>
> Hi David
>
> About documentation :
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <[email protected]> wrote:
> >
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only.

On re-reading I finally understood what you meant - that the
chunking applies to only standard postgreSQL TAM called "heap"

Will add that as well. Would something like work

"This flag applies only to tables that use the standard PostgreSQL
Table Access Method (TAM) called "heap".
The tables that are using any custom TAM  are dumped as if
--max-table-segment-pages was not set."

---
Hannu






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-19 21:15   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-19 23:07     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-20 06:13       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  2026-01-20 12:48         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-21 13:05           ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-22 17:05             ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-23 02:15               ` Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-28 21:27                 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-01-28 21:33                   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-02-03 21:10                     ` Zsolt Parragi <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: Zsolt Parragi @ 2026-02-03 21:10 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Hello!

I did some testing with this patch, and I think there are some issues
during restoration:

1. Isn't there a possible race / scheduling mistake during restore
because of missing dependencies? The code now prints out "TABLE DATA
(pages %u:%u)", while the restore code checks for the explicit "TABLE
DATA" string for dependency tracking (pg_backup_archiver.c:2013 and a
few other places). This causes POST DATA to have no dependency on the
table data, and can be scheduled before we load all table data.

I was able to verify the scheduling issue with an index: the INDEX
part is scheduled too early, before all TABLE DATA completes, but then
locking prevents it from progressing, so everything completed fine in
the end. Even if that's guaranteed, which I'm not 100% sure of, it's
still based on luck and not proper logic, and takes up a slot (or
multiple), reducing parallelism.

2. Fixing the TABLE DATA strcmp checks solves the scheduling issue,
but it's not that simple, because then it causes truncation issues
during restore, which needs additional changes in the restore code. I
did a quick fix for that by adding an additional condition to the
created flag, and with that it seems to restore everything properly,
and with proper ordering, only starting index/constraint/etc after all
table data is completed. However this was definitely just a quick test
fix, this needs a proper better solution.

Other issues I see are more minor, but numerous:

3. The patch still has lots of debug output (pg_log_WARNING("CHUNKING
...")); Is this intended? Shouldn't these be behind some verbose
check, and maybe use info instead of warning?

4. The is_segment macro should have () around the use of tdiptr

5. There's still a 32000 magic constant, shouldn't that have some
descriptive name / explanatory comment?

6. formatting issues at multiple places, mostly missing spaces after
if/while/for statements

7. inconsistent error messages (not in range vs must be in range)

8. There's a remaining TODO that seems stale, current_chunk_start is
already uint64

9. typo: "the computed from pog_relation_size" -> "then computed from
pg_relation_size"






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-01-19 19:01 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-01-20 02:20   ` David Rowley <[email protected]>
  1 sibling, 0 replies; 24+ messages in thread

From: David Rowley @ 2026-01-20 02:20 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

On Tue, 20 Jan 2026 at 08:01, Hannu Krosing <[email protected]> wrote:
> I changed the last open-ended chunk to use ctid >= (N,1) for clarity
> but did not change anything else.

You have:

int max_table_segment_pages; /* chunk when relpages is above this */

and:

opts->max_table_segment_pages = UINT32_MAX; /* == InvalidBlockNumber,
disable chunking by default */

It's not valid to assign UINT32_MAX to a signed int.


> To me it looked like having a loop around the whole thing when there
> is no chunking would complicate things for anyone reading the code.

The problem I have with it is the duplicate code. If you don't want to
loop around the standard code, then make a function and call that
instead of copying and pasting the code.

I'd also get rid of the "chunking" boolean and make use of
InvalidBlockNumber to determine if the range is constrained. It also
seems very strange that you opted to do that just for endPage and not
for startPage.

> > 4. I think using "int" here is a future complaint waiting to happen.
> >
> > + if (!option_parse_int(optarg, "--huge-table-chunk-pages", 1, INT32_MAX,
> > +   &dopt.huge_table_chunk_pages))
> >
> > I bet we'll eventually see a complaint that someone can't make the
> > segment size larger than 16TB. I think option_parse_uint32() might be
> > called for.
>
> I have not yet done anything with this yet, so the maximum chunk size
> for now is half of the maximum relpages.

OK. I can look again once all that's done.

David






^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
@ 2026-03-28 15:32 ` Hannu Krosing <[email protected]>
  2026-03-28 15:33   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-03-28 15:32 UTC (permalink / raw)
  To: Michael Banck <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

The issue is that currently the value is given in "main table pages"
and it would be somewhat deceptive, or at least confusing, to try to
express this in any other unit.

As I explained in the commit message:

---------8<-------------------8<-------------------8<----------------
This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
---------8<-------------------8<-------------------8<----------------

And I can think of no cheap and reliable way to change that equation.

I'll be very happy if you have any good ideas for either improving the
flag name, or even propose a way to better estimate the resulting dump
file size so we could give the chunk size in better units

---
Hannu





On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <[email protected]> wrote:
>
> Hi,
>
> On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > Perhaps --max-table-segment-pages is a better name than
> > --huge-table-chunk-pages as it's quite subjective what the minimum
> > number of pages required to make a table "huge".
>
> I'm not sure that's better - without looking at the documentation,
> people might confuse segment here with the 1GB split of tables into
> segments. As pg_dump is a very common and basic user tool, I don't think
> implementation details like pages/page sizes and blocks should be part
> of its UX.
>
> Can't we just make it a storage size, like '10GB' and then rename it to
> --table-parallel-threshold or something? I agree it's bikeshedding, but
> I personally don't like either --max-table-segment-pages or
> --huge-table-chunk-pages.
>
>
> Michael





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-03-28 15:32 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-03-28 15:33   ` Hannu Krosing <[email protected]>
  2026-03-29 21:49     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-03-28 15:33 UTC (permalink / raw)
  To: Michael Banck <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

The above

"Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."

should read

"Or it can be almost 200 GB *for a single page* if the page has just
pointers to 1GB TOAST items."


On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <[email protected]> wrote:
>
> The issue is that currently the value is given in "main table pages"
> and it would be somewhat deceptive, or at least confusing, to try to
> express this in any other unit.
>
> As I explained in the commit message:
>
> ---------8<-------------------8<-------------------8<----------------
> This --max-table-segment-pages number specifically applies to main table
> pages which does not guarantee anything about output size.
> The output could be empty if there are no live tuples in the page range.
> Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> ---------8<-------------------8<-------------------8<----------------
>
> And I can think of no cheap and reliable way to change that equation.
>
> I'll be very happy if you have any good ideas for either improving the
> flag name, or even propose a way to better estimate the resulting dump
> file size so we could give the chunk size in better units
>
> ---
> Hannu
>
>
>
>
>
> On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <[email protected]> wrote:
> >
> > Hi,
> >
> > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > Perhaps --max-table-segment-pages is a better name than
> > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > number of pages required to make a table "huge".
> >
> > I'm not sure that's better - without looking at the documentation,
> > people might confuse segment here with the 1GB split of tables into
> > segments. As pg_dump is a very common and basic user tool, I don't think
> > implementation details like pages/page sizes and blocks should be part
> > of its UX.
> >
> > Can't we just make it a storage size, like '10GB' and then rename it to
> > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > I personally don't like either --max-table-segment-pages or
> > --huge-table-chunk-pages.
> >
> >
> > Michael





^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-03-28 15:32 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-28 15:33   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-03-29 21:49     ` Hannu Krosing <[email protected]>
  2026-03-30 17:32       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-03-29 21:49 UTC (permalink / raw)
  To: Michael Banck <[email protected]>; +Cc: David Rowley <[email protected]>; Ashutosh Bapat <[email protected]>; PostgreSQL Hackers <[email protected]>; Nathan Bossart <[email protected]>

Fixing a off-by-one error in copying over dependencies


On Sat, Mar 28, 2026 at 4:33 PM Hannu Krosing <[email protected]> wrote:
>
> The above
>
> "Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."
>
> should read
>
> "Or it can be almost 200 GB *for a single page* if the page has just
> pointers to 1GB TOAST items."
>
>
> On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <[email protected]> wrote:
> >
> > The issue is that currently the value is given in "main table pages"
> > and it would be somewhat deceptive, or at least confusing, to try to
> > express this in any other unit.
> >
> > As I explained in the commit message:
> >
> > ---------8<-------------------8<-------------------8<----------------
> > This --max-table-segment-pages number specifically applies to main table
> > pages which does not guarantee anything about output size.
> > The output could be empty if there are no live tuples in the page range.
> > Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> > ---------8<-------------------8<-------------------8<----------------
> >
> > And I can think of no cheap and reliable way to change that equation.
> >
> > I'll be very happy if you have any good ideas for either improving the
> > flag name, or even propose a way to better estimate the resulting dump
> > file size so we could give the chunk size in better units
> >
> > ---
> > Hannu
> >
> >
> >
> >
> >
> > On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > > Perhaps --max-table-segment-pages is a better name than
> > > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > > number of pages required to make a table "huge".
> > >
> > > I'm not sure that's better - without looking at the documentation,
> > > people might confuse segment here with the 1GB split of tables into
> > > segments. As pg_dump is a very common and basic user tool, I don't think
> > > implementation details like pages/page sizes and blocks should be part
> > > of its UX.
> > >
> > > Can't we just make it a storage size, like '10GB' and then rename it to
> > > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > > I personally don't like either --max-table-segment-pages or
> > > --huge-table-chunk-pages.
> > >
> > >
> > > Michael


Attachments:

  [application/x-patch] v15-0001-Add-max-table-segment-pages-option-to-pg.patch (27.9K, 2-v15-0001-Add-max-table-segment-pages-option-to-pg.patch)
  download | inline diff:
From d9442eb6476ba27e0f3dee085e48de2efbb445d6 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Sat, 28 Mar 2026 11:53:39 +0100
Subject: [PATCH v14] SUBJECT: Add --max-table-segment-pages option to pg_dump
 for parallel table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid < '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |  84 +++++++++-
 src/bin/pg_dump/pg_backup_archiver.h      |  12 +-
 src/bin/pg_dump/pg_dump.c                 | 177 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 364 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7f538e90194..5f056bb4af6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1066,6 +1066,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index fda912ba0a9..11863a1915f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 271a2c3e481..384add0713b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
@@ -1995,6 +1997,28 @@ _moveBefore(TocEntry *pos, TocEntry *te)
 	pos->prev = te;
 }
 
+/*
+ * Add a dependency id to a DependencyList object
+ * This is currently used for collecting reverse 
+ * dependencies for chunked data dump 
+ *
+ * Note: duplicate dependencies are currently not eliminated
+ */
+void
+addStandaloneDependency(DependencyList *dobj, DumpId refId)
+{
+	pg_log_warning("Adding dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	if (dobj->nDeps >= dobj->allocDeps)
+	{
+		dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
+		dobj->dependencies = pg_realloc_array(dobj->dependencies,
+											  DumpId, dobj->allocDeps);
+		pg_log_warning("Realloced list %p to size %d", (void *) dobj->dependencies, dobj->allocDeps);
+	}
+	pg_log_warning("Added dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	dobj->dependencies[dobj->nDeps++] = refId;
+}
+
 /*
  * Build index arrays for the TOC list
  *
@@ -2014,6 +2038,7 @@ buildTocEntryArrays(ArchiveHandle *AH)
 
 	AH->tocsByDumpId = pg_malloc0_array(TocEntry *, (maxDumpId + 1));
 	AH->tableDataId = pg_malloc0_array(DumpId, (maxDumpId + 1));
+	AH->tableDataChunkIds = pg_malloc0_array(DependencyList, (maxDumpId + 1));
 
 	for (te = AH->toc->next; te != AH->toc; te = te->next)
 	{
@@ -2029,8 +2054,12 @@ buildTocEntryArrays(ArchiveHandle *AH)
 		 * TOC entry that has a DATA item.  We compute this by reversing the
 		 * TABLE DATA item's dependency, knowing that a TABLE DATA item has
 		 * just one dependency and it is the TABLE item.
+		 *
+		 * For chunked table data, the TABLE DATA item has a description like
+		 * "TABLE DATA (pages 100:199)", and we collect all such items as
+		 * reverse dependencies for the parent table's entry in tableDataChunkIds.
 		 */
-		if (strcmp(te->desc, "TABLE DATA") == 0 && te->nDeps > 0)
+		if (strncmp(te->desc, "TABLE DATA", 10) == 0 && te->nDeps > 0)
 		{
 			DumpId		tableId = te->dependencies[0];
 
@@ -2042,7 +2071,14 @@ buildTocEntryArrays(ArchiveHandle *AH)
 			if (tableId <= 0 || tableId > maxDumpId)
 				pg_fatal("bad table dumpId for TABLE DATA item");
 
-			AH->tableDataId[tableId] = te->dumpId;
+			if (te->desc[10] == '\0') /* te->desc == "TABLE DATA" */
+				AH->tableDataId[tableId] = te->dumpId;
+			else
+			{
+				/* Chunked table data, the description is "TABLE DATA (pages %u:%u)" */
+				addStandaloneDependency(&(AH->tableDataChunkIds[tableId]), te->dumpId);
+				pg_log_debug("Added chunked table data dependency: tableId %u + chunkId %u",
+							 tableId, te->dumpId);}
 		}
 	}
 }
@@ -5017,6 +5053,12 @@ fix_dependencies(ArchiveHandle *AH)
  * that parallel restore will prioritize larger jobs (index builds, FK
  * constraint checks, etc) over smaller ones, avoiding situations where we
  * end a restore with only one active job working on a large table.
+ *
+ * In case of chunked dumps, we change the depenency on table with depedency
+ * on the first chunk of data and add the remaingi chunk ids, if any, to the 
+ * end of depencency list
+ * we also calculate the fullDataLength as the sum of the lengths of chunk
+ * data items and use that to set the item's dataLength.
  */
 static void
 repoint_table_dependencies(ArchiveHandle *AH)
@@ -5032,8 +5074,9 @@ repoint_table_dependencies(ArchiveHandle *AH)
 		for (i = 0; i < te->nDeps; i++)
 		{
 			olddep = te->dependencies[i];
-			if (olddep <= AH->maxDumpId &&
-				AH->tableDataId[olddep] != 0)
+			if (olddep > AH->maxDumpId)
+				continue;
+			if (AH->tableDataId[olddep] != 0)
 			{
 				DumpId		tabledataid = AH->tableDataId[olddep];
 				TocEntry   *tabledatate = AH->tocsByDumpId[tabledataid];
@@ -5043,6 +5086,39 @@ repoint_table_dependencies(ArchiveHandle *AH)
 				pg_log_debug("transferring dependency %d -> %d to %d",
 							 te->dumpId, olddep, tabledataid);
 			}
+			else if (AH->tableDataChunkIds[olddep].nDeps > 0)
+			{
+				int			j;
+				DumpId		chunkdataid;
+				uint64		fullDataLength;
+				DependencyList *deplist = &AH->tableDataChunkIds[olddep];
+
+				/* first in list replaces the dependency on table */
+				chunkdataid = deplist->dependencies[0];
+				te->dependencies[i] = chunkdataid;
+				fullDataLength = AH->tocsByDumpId[chunkdataid]->dataLength;
+				pg_log_debug("transferring chunk list %d -> %d to %d",
+							 te->dumpId, olddep, chunkdataid);
+
+				if (deplist->nDeps > 1)
+				{
+					/* make space */
+					te->dependencies = pg_realloc_array(te->dependencies,
+												  DumpId,
+												  te->nDeps + deplist->nDeps - 1);
+
+					/* the rest are appended to dependencies */
+					for (j = 1; j < deplist->nDeps; j++)
+					{
+						chunkdataid = deplist->dependencies[j];
+						te->dependencies[te->nDeps++] = chunkdataid;
+						fullDataLength += AH->tocsByDumpId[chunkdataid]->dataLength;
+						pg_log_debug("adding chunk list %d -> %d to %d",
+									te->dumpId, olddep, chunkdataid);
+					}
+				}
+				te->dataLength = Max(te->dataLength, fullDataLength);
+			}
 		}
 	}
 }
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 365073b3eae..cfa3ea1bbd6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -179,6 +179,13 @@ typedef enum
 	OUTPUT_OTHERDATA,			/* writing data as INSERT commands */
 } ArchiverOutput;
 
+typedef struct _DependencyList
+{
+	DumpId	   *dependencies;	/* dumpIds of objects this one depends on */
+	int			nDeps;			/* number of valid dependencies */
+	int			allocDeps;		/* allocated size of dependencies[] */
+} DependencyList;
+
 /*
  * For historical reasons, ACL items are interspersed with everything else in
  * a dump file's TOC; typically they're right after the object they're for.
@@ -311,6 +318,7 @@ struct _archiveHandle
 	/* arrays created after the TOC list is complete: */
 	struct _tocEntry **tocsByDumpId;	/* TOCs indexed by dumpId */
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
+	DependencyList *tableDataChunkIds; /* dependencies indexed by dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
 	pg_compress_specification compression_spec; /* Requested specification for
@@ -377,7 +385,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
@@ -437,6 +445,8 @@ extern int	TocIDRequired(ArchiveHandle *AH, DumpId id);
 TocEntry   *getTocEntryByDumpId(ArchiveHandle *AH, DumpId id);
 extern bool checkSeek(FILE *fp);
 
+extern void addStandaloneDependency(DependencyList *dobj, DumpId refId);
+
 #define appendStringLiteralAHX(buf,str,AH) \
 	appendStringLiteral(buf, str, (AH)->public.encoding, (AH)->public.std_strings)
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5d1f7682f11..1e7d9a3f7f3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -535,6 +535,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -799,6 +800,12 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1344,6 +1351,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2396,7 +2406,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
 	 * cases a simple COPY suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 		(fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
 		 tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
 	{
@@ -2414,9 +2424,37 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		/* If it's a segment, we need to add a filter condition to select the
+		 * right page range 
+		 * - for first segment we add "ctid < (endPage+1, 0)" 
+		 *   first segment is the one with startPage == 0
+		 * - for last segment we add "ctid >= (startPage, 1)"
+		 *   last segment is the one with endPage == InvalidBlockNumber
+		 *   we leave to upper bound open for the case where more pages 
+		 *   were added after we measured 
+		 * - for middle segments we add 
+		 *   "ctid >= (startPage, 1) AND ctid < (endPage+1, 0)"
+		 *
+		 * "ctid < (endPage+1, 0)" instead of "ctid <= (endPage, maxtuple)"
+		 * was chosen as range end so that we do not have to estimate the maxtuple
+		 * 
+		 */
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid < '(%u,0)'", tdinfo->endPage+1);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid >= '(%u,1)' AND ctid < '(%u,0)'",
+								 tdinfo->startPage, tdinfo->endPage+1);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2424,6 +2462,10 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	if (is_segment(tdinfo))
+		pg_log_debug("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2919,42 +2961,89 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId);
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3081,6 +3170,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7347,8 +7438,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7589,7 +7688,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5a6726d8b12..84e682d585f 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pg_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) ((tdiptr)->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -449,7 +467,7 @@ typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
 	Oid			relid;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index 8d0659c1164..a516d8c86a9 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index d975db77af2..67fd3650d7a 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 extern void check_mut_excl_opts_internal(int n,...);
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-03-28 15:32 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-28 15:33   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-29 21:49     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-03-30 17:32       ` Hannu Krosing <[email protected]>
  2026-03-30 21:32         ` Re: Patch: dumping tables data in multiple chunks in pg_dump Zsolt Parragi <[email protected]>
  0 siblings, 1 reply; 24+ messages in thread

From: Hannu Krosing @ 2026-03-30 17:32 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>; +Cc: David Rowley <[email protected]>; Michael Banck <[email protected]>; Ashutosh Bapat <[email protected]>; Nathan Bossart <[email protected]>

Now the dependencies on chunks should also behave correctly


On Sun, Mar 29, 2026 at 11:49 PM Hannu Krosing <[email protected]> wrote:
>
> Fixing a off-by-one error in copying over dependencies
>
>
> On Sat, Mar 28, 2026 at 4:33 PM Hannu Krosing <[email protected]> wrote:
> >
> > The above
> >
> > "Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items."
> >
> > should read
> >
> > "Or it can be almost 200 GB *for a single page* if the page has just
> > pointers to 1GB TOAST items."
> >
> >
> > On Sat, Mar 28, 2026 at 4:32 PM Hannu Krosing <[email protected]> wrote:
> > >
> > > The issue is that currently the value is given in "main table pages"
> > > and it would be somewhat deceptive, or at least confusing, to try to
> > > express this in any other unit.
> > >
> > > As I explained in the commit message:
> > >
> > > ---------8<-------------------8<-------------------8<----------------
> > > This --max-table-segment-pages number specifically applies to main table
> > > pages which does not guarantee anything about output size.
> > > The output could be empty if there are no live tuples in the page range.
> > > Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.
> > > ---------8<-------------------8<-------------------8<----------------
> > >
> > > And I can think of no cheap and reliable way to change that equation.
> > >
> > > I'll be very happy if you have any good ideas for either improving the
> > > flag name, or even propose a way to better estimate the resulting dump
> > > file size so we could give the chunk size in better units
> > >
> > > ---
> > > Hannu
> > >
> > >
> > >
> > >
> > >
> > > On Sat, Mar 28, 2026 at 12:26 PM Michael Banck <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Tue, Jan 13, 2026 at 03:27:25PM +1300, David Rowley wrote:
> > > > > Perhaps --max-table-segment-pages is a better name than
> > > > > --huge-table-chunk-pages as it's quite subjective what the minimum
> > > > > number of pages required to make a table "huge".
> > > >
> > > > I'm not sure that's better - without looking at the documentation,
> > > > people might confuse segment here with the 1GB split of tables into
> > > > segments. As pg_dump is a very common and basic user tool, I don't think
> > > > implementation details like pages/page sizes and blocks should be part
> > > > of its UX.
> > > >
> > > > Can't we just make it a storage size, like '10GB' and then rename it to
> > > > --table-parallel-threshold or something? I agree it's bikeshedding, but
> > > > I personally don't like either --max-table-segment-pages or
> > > > --huge-table-chunk-pages.
> > > >
> > > >
> > > > Michael


Attachments:

  [application/x-patch] v16-0001-Add-max-table-segment-pages-option-to-pg_dump-fo.patch (29.5K, 2-v16-0001-Add-max-table-segment-pages-option-to-pg_dump-fo.patch)
  download | inline diff:
From b0d27b32c17d1e09f9484a81b3d3c3581d190adb Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Mon, 30 Mar 2026 19:28:45 +0200
Subject: [PATCH v16] Add --max-table-segment-pages option to pg_dump for
 parallel table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid < '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |  92 ++++++++++-
 src/bin/pg_dump/pg_backup_archiver.h      |  12 +-
 src/bin/pg_dump/pg_dump.c                 | 177 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 368 insertions(+), 50 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7f538e90194..5f056bb4af6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1066,6 +1066,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index fda912ba0a9..11863a1915f 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -179,6 +180,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 271a2c3e481..e32bd8149cb 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
@@ -1995,6 +1997,28 @@ _moveBefore(TocEntry *pos, TocEntry *te)
 	pos->prev = te;
 }
 
+/*
+ * Add a dependency id to a DependencyList object
+ * This is currently used for collecting reverse 
+ * dependencies for chunked data dump 
+ *
+ * Note: duplicate dependencies are currently not eliminated
+ */
+void
+addStandaloneDependency(DependencyList *dobj, DumpId refId)
+{
+	pg_log_warning("Adding dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	if (dobj->nDeps >= dobj->allocDeps)
+	{
+		dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
+		dobj->dependencies = pg_realloc_array(dobj->dependencies,
+											  DumpId, dobj->allocDeps);
+		pg_log_warning("Realloced list %p to size %d", (void *) dobj->dependencies, dobj->allocDeps);
+	}
+	pg_log_warning("Added dep: list %p + dep %u", (void *) dobj->dependencies, refId);
+	dobj->dependencies[dobj->nDeps++] = refId;
+}
+
 /*
  * Build index arrays for the TOC list
  *
@@ -2014,6 +2038,7 @@ buildTocEntryArrays(ArchiveHandle *AH)
 
 	AH->tocsByDumpId = pg_malloc0_array(TocEntry *, (maxDumpId + 1));
 	AH->tableDataId = pg_malloc0_array(DumpId, (maxDumpId + 1));
+	AH->tableDataChunkIds = pg_malloc0_array(DependencyList, (maxDumpId + 1));
 
 	for (te = AH->toc->next; te != AH->toc; te = te->next)
 	{
@@ -2029,8 +2054,12 @@ buildTocEntryArrays(ArchiveHandle *AH)
 		 * TOC entry that has a DATA item.  We compute this by reversing the
 		 * TABLE DATA item's dependency, knowing that a TABLE DATA item has
 		 * just one dependency and it is the TABLE item.
+		 *
+		 * For chunked table data, the TABLE DATA item has a description like
+		 * "TABLE DATA (pages 100:199)", and we collect all such items as
+		 * reverse dependencies for the parent table's entry in tableDataChunkIds.
 		 */
-		if (strcmp(te->desc, "TABLE DATA") == 0 && te->nDeps > 0)
+		if (strncmp(te->desc, "TABLE DATA", 10) == 0 && te->nDeps > 0)
 		{
 			DumpId		tableId = te->dependencies[0];
 
@@ -2042,7 +2071,14 @@ buildTocEntryArrays(ArchiveHandle *AH)
 			if (tableId <= 0 || tableId > maxDumpId)
 				pg_fatal("bad table dumpId for TABLE DATA item");
 
-			AH->tableDataId[tableId] = te->dumpId;
+			if (te->desc[10] == '\0') /* te->desc == "TABLE DATA" */
+				AH->tableDataId[tableId] = te->dumpId;
+			else
+			{
+				/* Chunked table data, the description is "TABLE DATA (pages %u:%u)" */
+				addStandaloneDependency(&(AH->tableDataChunkIds[tableId]), te->dumpId);
+				pg_log_debug("Added chunked table data dependency: tableId %u + chunkId %u",
+							 tableId, te->dumpId);}
 		}
 	}
 }
@@ -2785,7 +2821,7 @@ ReadToc(ArchiveHandle *AH)
 				strcmp(te->desc, "ACL") == 0 ||
 				strcmp(te->desc, "ACL LANGUAGE") == 0)
 				te->section = SECTION_NONE;
-			else if (strcmp(te->desc, "TABLE DATA") == 0 ||
+			else if (strncmp(te->desc, "TABLE DATA", 10) == 0 ||
 					 strcmp(te->desc, "BLOBS") == 0 ||
 					 strcmp(te->desc, "BLOB COMMENTS") == 0)
 				te->section = SECTION_DATA;
@@ -3015,7 +3051,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 	 * associated pg_shdepend rows. This is faster to restore than the
 	 * equivalent set of large object commands.
 	 */
-	if (ropt->binary_upgrade && strcmp(te->desc, "TABLE DATA") == 0 &&
+	if (ropt->binary_upgrade && strncmp(te->desc, "TABLE DATA", 10) == 0 &&
 		(te->catalogId.oid == LargeObjectMetadataRelationId ||
 		 te->catalogId.oid == SharedDependRelationId))
 		return REQ_DATA;
@@ -3246,7 +3282,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
 		if (ropt->selTypes)
 		{
 			if (strcmp(te->desc, "TABLE") == 0 ||
-				strcmp(te->desc, "TABLE DATA") == 0 ||
+				strncmp(te->desc, "TABLE DATA", 10) == 0 ||
 				strcmp(te->desc, "VIEW") == 0 ||
 				strcmp(te->desc, "FOREIGN TABLE") == 0 ||
 				strcmp(te->desc, "MATERIALIZED VIEW") == 0 ||
@@ -5017,6 +5053,12 @@ fix_dependencies(ArchiveHandle *AH)
  * that parallel restore will prioritize larger jobs (index builds, FK
  * constraint checks, etc) over smaller ones, avoiding situations where we
  * end a restore with only one active job working on a large table.
+ *
+ * In case of chunked dumps, we change the depenency on table with depedency
+ * on the first chunk of data and add the remaingi chunk ids, if any, to the 
+ * end of depencency list
+ * we also calculate the fullDataLength as the sum of the lengths of chunk
+ * data items and use that to set the item's dataLength.
  */
 static void
 repoint_table_dependencies(ArchiveHandle *AH)
@@ -5032,8 +5074,9 @@ repoint_table_dependencies(ArchiveHandle *AH)
 		for (i = 0; i < te->nDeps; i++)
 		{
 			olddep = te->dependencies[i];
-			if (olddep <= AH->maxDumpId &&
-				AH->tableDataId[olddep] != 0)
+			if (olddep > AH->maxDumpId)
+				continue;
+			if (AH->tableDataId[olddep] != 0)
 			{
 				DumpId		tabledataid = AH->tableDataId[olddep];
 				TocEntry   *tabledatate = AH->tocsByDumpId[tabledataid];
@@ -5043,6 +5086,39 @@ repoint_table_dependencies(ArchiveHandle *AH)
 				pg_log_debug("transferring dependency %d -> %d to %d",
 							 te->dumpId, olddep, tabledataid);
 			}
+			else if (AH->tableDataChunkIds[olddep].nDeps > 0)
+			{
+				int			j;
+				DumpId		chunkdataid;
+				uint64		fullDataLength;
+				DependencyList *deplist = &AH->tableDataChunkIds[olddep];
+
+				/* first in list replaces the dependency on table */
+				chunkdataid = deplist->dependencies[0];
+				te->dependencies[i] = chunkdataid;
+				fullDataLength = AH->tocsByDumpId[chunkdataid]->dataLength;
+				pg_log_debug("transferring chunk list %d -> %d to %d",
+							 te->dumpId, olddep, chunkdataid);
+
+				if (deplist->nDeps > 1)
+				{
+					/* make space */
+					te->dependencies = pg_realloc_array(te->dependencies,
+												  DumpId,
+												  te->nDeps + deplist->nDeps - 1);
+
+					/* the rest are appended to dependencies */
+					for (j = 1; j < deplist->nDeps; j++)
+					{
+						chunkdataid = deplist->dependencies[j];
+						te->dependencies[te->nDeps++] = chunkdataid;
+						fullDataLength += AH->tocsByDumpId[chunkdataid]->dataLength;
+						pg_log_debug("adding chunk list %d -> %d to %d",
+									te->dumpId, olddep, chunkdataid);
+					}
+				}
+				te->dataLength = Max(te->dataLength, fullDataLength);
+			}
 		}
 	}
 }
@@ -5096,7 +5172,7 @@ identify_locking_dependencies(ArchiveHandle *AH, TocEntry *te)
 		DumpId		depid = te->dependencies[i];
 
 		if (depid <= AH->maxDumpId && AH->tocsByDumpId[depid] != NULL &&
-			((strcmp(AH->tocsByDumpId[depid]->desc, "TABLE DATA") == 0) ||
+			((strncmp(AH->tocsByDumpId[depid]->desc, "TABLE DATA", 10) == 0) ||
 			 strcmp(AH->tocsByDumpId[depid]->desc, "TABLE") == 0))
 			lockids[nlockids++] = depid;
 	}
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 365073b3eae..cfa3ea1bbd6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -179,6 +179,13 @@ typedef enum
 	OUTPUT_OTHERDATA,			/* writing data as INSERT commands */
 } ArchiverOutput;
 
+typedef struct _DependencyList
+{
+	DumpId	   *dependencies;	/* dumpIds of objects this one depends on */
+	int			nDeps;			/* number of valid dependencies */
+	int			allocDeps;		/* allocated size of dependencies[] */
+} DependencyList;
+
 /*
  * For historical reasons, ACL items are interspersed with everything else in
  * a dump file's TOC; typically they're right after the object they're for.
@@ -311,6 +318,7 @@ struct _archiveHandle
 	/* arrays created after the TOC list is complete: */
 	struct _tocEntry **tocsByDumpId;	/* TOCs indexed by dumpId */
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
+	DependencyList *tableDataChunkIds; /* dependencies indexed by dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
 	pg_compress_specification compression_spec; /* Requested specification for
@@ -377,7 +385,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
@@ -437,6 +445,8 @@ extern int	TocIDRequired(ArchiveHandle *AH, DumpId id);
 TocEntry   *getTocEntryByDumpId(ArchiveHandle *AH, DumpId id);
 extern bool checkSeek(FILE *fp);
 
+extern void addStandaloneDependency(DependencyList *dobj, DumpId refId);
+
 #define appendStringLiteralAHX(buf,str,AH) \
 	appendStringLiteral(buf, str, (AH)->public.encoding, (AH)->public.std_strings)
 
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5d1f7682f11..1e7d9a3f7f3 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -535,6 +535,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -799,6 +800,12 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1344,6 +1351,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2396,7 +2406,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
 	 * cases a simple COPY suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 		(fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
 		 tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
 	{
@@ -2414,9 +2424,37 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		/* If it's a segment, we need to add a filter condition to select the
+		 * right page range 
+		 * - for first segment we add "ctid < (endPage+1, 0)" 
+		 *   first segment is the one with startPage == 0
+		 * - for last segment we add "ctid >= (startPage, 1)"
+		 *   last segment is the one with endPage == InvalidBlockNumber
+		 *   we leave to upper bound open for the case where more pages 
+		 *   were added after we measured 
+		 * - for middle segments we add 
+		 *   "ctid >= (startPage, 1) AND ctid < (endPage+1, 0)"
+		 *
+		 * "ctid < (endPage+1, 0)" instead of "ctid <= (endPage, maxtuple)"
+		 * was chosen as range end so that we do not have to estimate the maxtuple
+		 * 
+		 */
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid < '(%u,0)'", tdinfo->endPage+1);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid >= '(%u,1)' AND ctid < '(%u,0)'",
+								 tdinfo->startPage, tdinfo->endPage+1);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2424,6 +2462,10 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	if (is_segment(tdinfo))
+		pg_log_debug("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2919,42 +2961,89 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId);
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3081,6 +3170,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7347,8 +7438,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7589,7 +7688,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 5a6726d8b12..84e682d585f 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pg_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) ((tdiptr)->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -449,7 +467,7 @@ typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
 	Oid			relid;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index 8d0659c1164..a516d8c86a9 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index d975db77af2..67fd3650d7a 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 extern void check_mut_excl_opts_internal(int n,...);
-- 
2.53.0.1018.g2bb0e51243-goog



^ permalink  raw  reply  [nested|flat] 24+ messages in thread

* Re: Patch: dumping tables data in multiple chunks in pg_dump
  2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
  2026-03-28 15:32 ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-28 15:33   ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-29 21:49     ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
  2026-03-30 17:32       ` Re: Patch: dumping tables data in multiple chunks in pg_dump Hannu Krosing <[email protected]>
@ 2026-03-30 21:32         ` Zsolt Parragi <[email protected]>
  0 siblings, 0 replies; 24+ messages in thread

From: Zsolt Parragi @ 2026-03-30 21:32 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; David Rowley <[email protected]>; Michael Banck <[email protected]>; Ashutosh Bapat <[email protected]>; Nathan Bossart <[email protected]>

Hello!

A simple test causes an assertion failure in my testing, dependency
counting still doesn't seem to work correctly:

pg_restore: >...>/pg_backup_archiver.c:5207: reduce_dependencies:
Assertion `otherte->depCount > 0' failed.

Without assertions it results in data loss.

004_pg_dump_parallel also showcases the issue in my testing.

But simple manual testing also confirms it:

1. create some data

CREATE TABLE tplain (id int UNIQUE);
INSERT INTO tplain SELECT x FROM generate_series(1,1000) x;

2. create a dump

dump with --max-table-segment-pages=2

3. try to restore

restore with --jobs=3





^ permalink  raw  reply  [nested|flat] 24+ messages in thread


end of thread, other threads:[~2026-03-30 21:32 UTC | newest]

Thread overview: 24+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-13 02:27 Re: Patch: dumping tables data in multiple chunks in pg_dump David Rowley <[email protected]>
2026-01-14 10:52 ` Hannu Krosing <[email protected]>
2026-01-14 21:10   ` David Rowley <[email protected]>
2026-01-19 19:01 ` Hannu Krosing <[email protected]>
2026-01-19 21:15   ` Zsolt Parragi <[email protected]>
2026-01-19 23:07     ` Hannu Krosing <[email protected]>
2026-01-20 06:13       ` Zsolt Parragi <[email protected]>
2026-01-20 12:48         ` Hannu Krosing <[email protected]>
2026-01-21 13:05           ` Hannu Krosing <[email protected]>
2026-01-22 17:05             ` Hannu Krosing <[email protected]>
2026-01-23 02:15               ` David Rowley <[email protected]>
2026-01-27 22:43                 ` Hannu Krosing <[email protected]>
2026-01-28 17:29                   ` Hannu Krosing <[email protected]>
2026-02-12 06:13                     ` Dilip Kumar <[email protected]>
2026-03-28 10:59                       ` Hannu Krosing <[email protected]>
2026-01-28 21:27                 ` Hannu Krosing <[email protected]>
2026-01-28 21:33                   ` Hannu Krosing <[email protected]>
2026-02-03 21:10                     ` Zsolt Parragi <[email protected]>
2026-01-20 02:20   ` David Rowley <[email protected]>
2026-03-28 15:32 ` Hannu Krosing <[email protected]>
2026-03-28 15:33   ` Hannu Krosing <[email protected]>
2026-03-29 21:49     ` Hannu Krosing <[email protected]>
2026-03-30 17:32       ` Hannu Krosing <[email protected]>
2026-03-30 21:32         ` Zsolt Parragi <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox