public inbox for [email protected]
help / color / mirror / Atom feedFrom: Hannu Krosing <[email protected]>
To: David Rowley <[email protected]>
Cc: Zsolt Parragi <[email protected]>
Cc: Ashutosh Bapat <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Nathan Bossart <[email protected]>
Subject: Re: Patch: dumping tables data in multiple chunks in pg_dump
Date: Wed, 28 Jan 2026 18:29:58 +0100
Message-ID: <CAMT0RQQKWWrYYYQ8QNurs7hXBC5DwBAV6b0JmHKvJk7wZnup0g@mail.gmail.com> (raw)
In-Reply-To: <CAMT0RQQT1dQTPj4etRTc0877mirCPVKzjbF_U5KRAyPwhHMr0Q@mail.gmail.com>
References: <CAMT0RQT_0qVxcTT6ycM20QUN-pEQ6iMLbz6gLWgLpeF0NmNOUA@mail.gmail.com>
<CAExHW5t54GPKFbW3KLzintJ6jMMRYwb-t2Fjm4JTxEcZbGDomA@mail.gmail.com>
<CAMT0RQTHoL8S7OonFWC_aDSC-2oX7BGBBLAQ+OOBhRPcxV2eiw@mail.gmail.com>
<CAMT0RQQAH1a8kY-mx7B07Uzn3T_zeaU9detqFFtW36_k67Su+A@mail.gmail.com>
<CAMT0RQQr7KtPAY903+F42csiHc1EPHo70Xji-znkxEhwdoKa6w@mail.gmail.com>
<CAMT0RQSNHFffbCmDNxQogVBD8H5gTDJNwhUR2btCVE+Lq1sGGw@mail.gmail.com>
<CAMT0RQTEFGctCfgVx3u2XgVRCAj_QURV2tfdzL0HOQi=u0sV2A@mail.gmail.com>
<CAApHDvr8ay+31Wd0TptDGp8cAg2-NOnWddx8csnUE3R03EbvZw@mail.gmail.com>
<CAMT0RQShjXPPdXQS-5uzDC3bXt+QEZR5tO02o1NHdXWNu2quvw@mail.gmail.com>
<CAN4CZFOTOX1nFkPrmM7fQ9qnrccvjwywXPf4Eo7C+DF-h-x96g@mail.gmail.com>
<CAMT0RQTbQxjdN5nv6M_HhFWiLqdT84=NYBM1ZYQKaAcf8Ufyaw@mail.gmail.com>
<CAN4CZFOjG2kUciKeVBpxrHJZNkZuzY_Q5ij_qpDZcAS3Ak2GxA@mail.gmail.com>
<CAMT0RQStjytRrGTe0X03ErC7anwxNRHAULYBsSmdWZV3fr4-Dg@mail.gmail.com>
<CAMT0RQROPMSPwfxAUCm1gZs9cUr7FmvwX+eO6Wzq_wWdd6eEAQ@mail.gmail.com>
<CAMT0RQQ8DX+K7OTw3Lg+Yp2ew8TsZduiqtPszfiBixcpxKbz-A@mail.gmail.com>
<CAApHDvo29-vQz=xV6+x5hU--NZ9qGPXsCNBuOAf88pAHjTpvvQ@mail.gmail.com>
<CAMT0RQQT1dQTPj4etRTc0877mirCPVKzjbF_U5KRAyPwhHMr0Q@mail.gmail.com>
v13 has added a proper test comparing original and restored table data
On Tue, Jan 27, 2026 at 11:43 PM Hannu Krosing <[email protected]> wrote:
>
> Hi David
>
> Thanks for reviewing.
>
> Please hold back reviewing this v12 patch until I have verified that
> it passes CfBot and until I have finished my testing with 17TB table.
>
> I would appreciate pointers or help on adding the data correctness
> tests to the tap tests as real tests.
> For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
> I check manually that the table data checksums match
> If we add them we should add them to other places in pg_dump tests as
> well. Currently we just test that dump and restore do not fail, not
> that the restored data is correct.
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <[email protected]> wrote:
> >
> > On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <[email protected]> wrote:
> > >
> > > Fixing all the warnings
> >
> > I think overall this needs significantly more care and precision than
> > what you've given it so far. For example, you have:
> >
> > + if(dopt->max_table_segment_pages != InvalidBlockNumber)
> > + appendPQExpBufferStr(query,
> > "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> > relpages, ");
> > + else
> > + appendPQExpBufferStr(query, "c.relpages, ");
> >
> > Note that pg_class.relpages is "int". Later the code in master does:
> >
> > tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
>
> I have now fixed the base issue by changing the data type of
> TableInfo.relpages to BlockNumber, and also changed the way we get it
> by
>
> 1. converting it to unsigned int ( c.relpages::oid ) in the query
> 2. reading it from the result using strtoul()
>
> (technically it should have been enough to just use strtoul() as it
> already wraps signed ints to unsigned ones, but having it converted in
> the query seems cleaner)
>
> This allowed removing casts to (BlockNumber) everywhere where
> .relpages was used.
>
> Functionally value was ever only used for ordering and even this
> loosley, which explains why patch v10 did not break anything.
>
> I also changed the data type of TocEntry.dataLength from pgoff_t to
> uint64. The current clearly had an overflow in case when off_t was 32
> bit and sum of relpages from heap and toast was larger than allowed
> for it.
>
> > If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> > num_pages;" that the value stored in relpages will be negative when
> > the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> > expression is not going to produce an INT. It'll produce a BIGINT, per
> > "select pg_typeof(pg_relation_size('pg_class') /
> > current_setting('block_size')::int);". So the atoi() can receive a
> > string of digits representing an integer larger than INT_MAX in this
> > case. Looking at [1], I see:
>
> As said above this should be fixed now by using correct type in struch
> and strtoul().
> To be sure I have now created a 17TB table and running some tests on
> this as well.
> Will let you know here when done.
>
> > "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> > and atoll need not affect the value of the integer expression errno on
> > an error. If the value of the result cannot be represented, *the
> > behavior is undefined.*"
> >
> > And testing locally, I see that my Microsoft compiler will just return
> > INT_MAX on overflow, whereas I see gcc does nothing to prevent
> > overflows and just continues to multiply by 10 regardless of what
> > overflows occur, which I think would just make the code work by
> > accident.
>
> As .relpages was only ever used for ordering parallel copies it does
> work just not optimally.
>
> The old code has similar overflow/wraparound for case when off_t is 32
> bit int and the sum of relpages from heap and toast table is above
> INT_MAX
>
> I have removed the whole part where this was partially fixed for the
> case when one of them was > 0x7fffffff (i.e. negative) by pinning the
> dataLength to INT_MAX in that case
>
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only. That should be mentioned as it'll just result
> > in people posting questions about why it's not working for some other
> > table access method. There's also not much care for white space.
> > You've introduced a bunch of whitespace changes unrelated to code
> > changes you've made, plus there's not much regard for following
> > project standard. For example, you commonly do "if(" and don't
> > consistently follow the bracing rules, e.g:
> >
> > + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> > + if(*chkptr == '-')
>
> I assumed that it is the classical "single statemet -- no braces.
>
> Do we have a writeup of our coding standards somewhere ?
>
> Now this specific case is rewritten using while() so shoud be ok.
>
> > Things like the following help convey the level of care that's gone into this:
> >
> > +/*
> > + * option_parse_int
> > + *
> > + * Parse integer value for an option. If the parsing is successful, returns
> > + * true and stores the result in *result if that's given; if parsing fails,
> > + * returns false.
> > + */
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> >
> > i.e zero effort gone in to modify the comments after pasting them from
> > option_parse_int().
> >
> > Another example:
> >
> > + pg_log_error("%s musst be in range %lu..%lu",
> >
> > Also, I have no comprehension of why you'd use uint64 for the valid
> > range when the function is for processing uint32 types in:
>
> The uint64 there I picked up from the referenced long unsigned usage
> in pg_resetval after I managed to get pg_log_warning to print out -1
> for format %u and did not want to go to debug why that happens.
>
> I have now made all the arguments uint32
>
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> > + uint64 min_range, uint64 max_range,
> > + uint32 *result)
> >
> > In its current state, it's quite hard to take this patch seriously.
> > Please spend longer self-reviewing it before posting. You could
> > temporarily hard-code something for testing which makes at least 1
> > table appear to be larger than 16TB and ensure your code works. What
> > you have is visually broken and depends on whatever the atoi
> > implementation opts to do in the overflow case. These are all things
> > diligent commiters will be testing and it's sad to see how little
> > effort you're putting into this. How do you expect this community to
> > scale with this quality level of patch submissions? You've been around
> > long enough and should know and do better. Are you just expecting the
> > committer to fix these things for you? That work does not get done via
> > magic wand. Being on v10 already, I'd have expected the patch to be
> > far beyond proof of concept grade. If you're withholding investing
> > time on this until you see more community buy-in, then I'd suggest you
> > write that and withhold further revisions until you're happy with the
> > level of buy-in.
>
> > I'm also still not liking your de-normalised TableInfo representation
> > for "is_segment".
> > IMO, InvalidBlockNumber should be used to represent
> > open bounded ranges, and if there's no chunking, then startPage and
> > endPage will both be InvalidBlockNumber.
>
> That's what I ended up doing
>
> I switched to using startPage = InvalidBlockNumber to indicate that no
> chunking is in effect.
>
> This is safe because when chunking is in use I always try to set both
> chunk end pages, and lower bound I can always set the lower bound.
>
> Only for the last page is the endPage left to InvalidBlockNumber.
>
> > IMO, what you have now
> > needlessly allows invalid states where is_segment == true and
> > startPage, endPage are not set correctly. If you want to keep the code
> > simple, hide the complexity in a macro or an inline function. There's
> > just no performance reason to materialise the more complex condition
> > into a dedicated boolean flag.
Attachments:
[application/x-patch] v13-0001-changed-flag-name-to-max-table-segment-pages.patch (21.2K, 2-v13-0001-changed-flag-name-to-max-table-segment-pages.patch)
download | inline diff:
From e598191f7464ca2ecfa9779a823d1aa8a409cdf7 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Wed, 28 Jan 2026 18:24:19 +0100
Subject: [PATCH v13] * changed flag name to max-table-segment-pages * added
check for amname = "heap" * added simple chunked dump and restore test *
changed the data type of TableInfo.relpages to BlockNumber, * select it
using relpages:oid to get unsigned int out * read it in from query result
using strtoul() * removed a bunch of casts from .relpages to (BlocNumber) *
changed the data type of TocEntry.dataLength to uint64 current pgoff_t
certainly had an overflow in 32bit case when heap relpages + toast relpages >
INT_MAX * switched to using of
pg_relation_size(c.oid)/current_setting('block_size')::int when
--max-table-segment-pages is set * added documentation * added
option_parse_uint32(...) to be used for full range of pages numbers
* TESTS: added test to compare original and restored table contents
---
doc/src/sgml/ref/pg_dump.sgml | 24 +++
src/bin/pg_dump/pg_backup.h | 2 +
src/bin/pg_dump/pg_backup_archiver.c | 2 +
src/bin/pg_dump/pg_backup_archiver.h | 2 +-
src/bin/pg_dump/pg_dump.c | 169 +++++++++++++++++-----
src/bin/pg_dump/pg_dump.h | 22 ++-
src/bin/pg_dump/t/004_pg_dump_parallel.pl | 31 ++++
src/fe_utils/option_utils.c | 55 +++++++
src/include/fe_utils/option_utils.h | 3 +
9 files changed, 268 insertions(+), 42 deletions(-)
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 688e23c0e90..1811c67d141 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1088,6 +1088,30 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+ <listitem>
+ <para>
+ Dump data in segments based on number of pages in the main relation.
+ If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable>
+ the data is split into segments based on that number of pages.
+ Individual segments can be dumped in parallel.
+ </para>
+
+ <note>
+ <para>
+ The option <option>--max-table-segment-pages</option> is applied to only pages
+ in the main heap and if the table has a large TOASTed part this has to be
+ taken into account when deciding on the number of pages to use.
+ In the extreme case a single 8kB heap page can have ~200 toast pointers each
+ corresponding to 1GB of data. If this data is also non-compressible then a
+ single-page segment can dump as 200GB file.
+ </para>
+ </note>
+
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>--no-comments</option></term>
<listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..b63ae05d895 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
#include "common/file_utils.h"
#include "fe_utils/simple_list.h"
#include "libpq-fe.h"
+#include "storage/block.h"
typedef enum trivalue
@@ -178,6 +179,7 @@ typedef struct _dumpOptions
bool aclsSkip;
const char *lockWaitTimeout;
int dump_inserts; /* 0 = COPY, otherwise rows per INSERT */
+ BlockNumber max_table_segment_pages; /* chunk when relpages is above this */
/* flags for various command-line long options */
int disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..ed1913d66bc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
#include "pg_backup_archiver.h"
#include "pg_backup_db.h"
#include "pg_backup_utils.h"
+#include "storage/block.h"
#define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
#define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
opts->dumpSchema = true;
opts->dumpData = true;
opts->dumpStatistics = false;
+ opts->max_table_segment_pages = InvalidBlockNumber;
}
/*
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 325b53fc9bd..b6a9f16a122 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -377,7 +377,7 @@ struct _tocEntry
size_t defnLen; /* length of dumped definition */
/* working state while dumping/restoring */
- pgoff_t dataLength; /* item's data size; 0 if none or unknown */
+ uint64 dataLength; /* item's data size; 0 if none or unknown */
int reqs; /* do we need schema and/or data of object
* (REQ_* bit mask) */
bool created; /* set for DATA member if TABLE was created */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..0badb245b55 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
{"exclude-extension", required_argument, NULL, 17},
{"sequence-data", no_argument, &dopt.sequence_data, 1},
{"restrict-key", required_argument, NULL, 25},
+ {"max-table-segment-pages", required_argument, NULL, 26},
{NULL, 0, NULL, 0}
};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
dopt.restrict_key = pg_strdup(optarg);
break;
+ case 26:
+ if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+ &dopt.max_table_segment_pages))
+ exit_nicely(1);
+ pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]", dopt.max_table_segment_pages);
+ break;
+
default:
/* getopt_long already emitted a complaint */
pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
printf(_(" --extra-float-digits=NUM override default setting for extra_float_digits\n"));
printf(_(" --filter=FILENAME include or exclude objects and data from dump\n"
" based on expressions in FILENAME\n"));
+ printf(_(" --max-table-segment-pages=NUMPAGES\n"
+ " Number of main table pages above which data is \n"
+ " copied out in chunks, also determines the chunk size\n"));
printf(_(" --if-exists use IF EXISTS when dropping objects\n"));
printf(_(" --include-foreign-data=PATTERN\n"
" include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
* a filter condition was specified. For other cases a simple COPY
* suffices.
*/
- if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+ if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
{
/* Temporary allows to access to foreign tables to dump data */
if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,23 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
else
appendPQExpBufferStr(q, "* ");
- appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+ appendPQExpBuffer(q, "FROM %s %s",
fmtQualifiedDumpable(tbinfo),
tdinfo->filtercond ? tdinfo->filtercond : "");
+ if (is_segment(tdinfo))
+ {
+ appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+ if(tdinfo->startPage == 0)
+ appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
+ else if(tdinfo->endPage != InvalidBlockNumber)
+ appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
+ tdinfo->startPage, tdinfo->endPage);
+ else
+ appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+ pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+ }
+
+ appendPQExpBuffer(q, ") TO stdout;");
}
else
{
@@ -2438,6 +2463,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
fmtQualifiedDumpable(tbinfo),
column_list);
}
+
+ pg_log_warning("CHUNKING: data query: %s", q->data);
+
res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
PQclear(res);
destroyPQExpBuffer(clistBuf);
@@ -2933,42 +2961,95 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
{
TocEntry *te;
- te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
- ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
- .namespace = tbinfo->dobj.namespace->dobj.name,
- .owner = tbinfo->rolname,
- .description = "TABLE DATA",
- .section = SECTION_DATA,
- .createStmt = tdDefn,
- .copyStmt = copyStmt,
- .deps = &(tbinfo->dobj.dumpId),
- .nDeps = 1,
- .dumpFn = dumpFn,
- .dumpArg = tdinfo));
-
- /*
- * Set the TocEntry's dataLength in case we are doing a parallel dump
- * and want to order dump jobs by table size. We choose to measure
- * dataLength in table pages (including TOAST pages) during dump, so
- * no scaling is needed.
- *
- * However, relpages is declared as "integer" in pg_class, and hence
- * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
- * Cast so that we get the right interpretation of table sizes
- * exceeding INT_MAX pages.
+ /* data chunking works off relpages, which are computed exactly using
+ * pg_relation_size() when --max-table-segment-pages was set
+ *
+ * We also don't chunk if table access method is not "heap"
+ * TODO: we may add chunking for other access methods later, maybe
+ * based on primary key tranges
*/
- te->dataLength = (BlockNumber) tbinfo->relpages;
- te->dataLength += (BlockNumber) tbinfo->toastpages;
+ if (tbinfo->relpages <= dopt->max_table_segment_pages ||
+ strcmp(tbinfo->amname, "heap") != 0)
+ {
+ te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+ ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+ .namespace = tbinfo->dobj.namespace->dobj.name,
+ .owner = tbinfo->rolname,
+ .description = "TABLE DATA",
+ .section = SECTION_DATA,
+ .createStmt = tdDefn,
+ .copyStmt = copyStmt,
+ .deps = &(tbinfo->dobj.dumpId),
+ .nDeps = 1,
+ .dumpFn = dumpFn,
+ .dumpArg = tdinfo));
- /*
- * If pgoff_t is only 32 bits wide, the above refinement is useless,
- * and instead we'd better worry about integer overflow. Clamp to
- * INT_MAX if the correct result exceeds that.
- */
- if (sizeof(te->dataLength) == 4 &&
- (tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
- te->dataLength < 0))
- te->dataLength = INT_MAX;
+ /*
+ * Set the TocEntry's dataLength in case we are doing a parallel dump
+ * and want to order dump jobs by table size. We choose to measure
+ * dataLength in table pages (including TOAST pages) during dump, so
+ * no scaling is needed.
+ *
+ * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+ * is declared as "integer" we convert it back and store it as
+ * BlockNumber in TableInfo.
+ * And dataLenght is pgoff_t (long int) so does now overflow for
+ * 2 x UINT32_MAX
+ */
+ te->dataLength = tbinfo->relpages;
+ te->dataLength += tbinfo->toastpages;
+ }
+ else
+ {
+ uint64 current_chunk_start = 0;
+ PQExpBuffer chunk_desc = createPQExpBuffer();
+
+ pg_log_warning("CHUNKING: toc for chunked relpages [%u]", tbinfo->relpages);
+
+ /* TODO - use uint 64 for current_chunk_start to avoid wraparound */
+ while (current_chunk_start < tbinfo->relpages)
+ {
+ TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+ memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+ AssignDumpId(&chunk_tdinfo->dobj);
+ //addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+// chunk_tdinfo->is_segment = true;
+ chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+ chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+
+ pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+ current_chunk_start += dopt->max_table_segment_pages;
+ if (current_chunk_start >= tbinfo->relpages)
+ chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+ printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+ te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+ ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+ .namespace = tbinfo->dobj.namespace->dobj.name,
+ .owner = tbinfo->rolname,
+ .description = chunk_desc->data,
+ .section = SECTION_DATA,
+ .createStmt = tdDefn,
+ .copyStmt = copyStmt,
+ .deps = &(tbinfo->dobj.dumpId),
+ .nDeps = 1,
+ .dumpFn = dumpFn,
+ .dumpArg = chunk_tdinfo));
+
+ if(chunk_tdinfo->endPage == InvalidBlockNumber)
+ te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+ else
+ te->dataLength = dopt->max_table_segment_pages;
+ /* let's assume toast pages distribute evenly among chunks */
+ if(tbinfo->relpages)
+ te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+ }
+
+ destroyPQExpBuffer(chunk_desc);
+ }
}
destroyPQExpBuffer(copyBuf);
@@ -3092,6 +3173,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
tdinfo->dobj.namespace = tbinfo->dobj.namespace;
tdinfo->tdtable = tbinfo;
tdinfo->filtercond = NULL; /* might get set later */
+ tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+ tdinfo->endPage = InvalidBlockNumber;
addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
/* A TableDataInfo contains data, of course */
@@ -7254,8 +7337,16 @@ getTables(Archive *fout, int *numTables)
"c.relnamespace, c.relkind, c.reltype, "
"c.relowner, "
"c.relchecks, "
- "c.relhasindex, c.relhasrules, c.relpages, "
- "c.reltuples, c.relallvisible, ");
+ "c.relhasindex, c.relhasrules, ");
+
+ /* fetch current relation size if chunking is requested */
+ if(dopt->max_table_segment_pages != InvalidBlockNumber)
+ appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+ else
+ /* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+ appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+ appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
if (fout->remoteVersion >= 180000)
appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7495,7 +7586,7 @@ getTables(Archive *fout, int *numTables)
tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
- tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+ tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
if (PQgetisnull(res, i, i_toastpages))
tblinfo[i].toastpages = 0;
else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..be71661ac41 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
#include "pg_backup.h"
#include "catalog/pg_publication_d.h"
+#include "storage/block.h"
#define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ? 1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
Oid owning_tab; /* OID of table owning sequence */
int owning_col; /* attr # of column owning sequence */
bool is_identity_sequence;
- int32 relpages; /* table's size in pages (from pg_class) */
+ BlockNumber relpages; /* table's size in pages (from pg_class)
+ * converted to unsigned integer
+ * when --max-table-segment-pages is set
+ * the computed from pog_relation_size()
+ */
int toastpages; /* toast table's size in pages, if any */
bool interesting; /* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
DumpableObject dobj;
TableInfo *tdtable; /* link to table to dump */
char *filtercond; /* WHERE condition to limit rows dumped */
+ /* startPage and endPage to support segmented dump */
+ BlockNumber startPage; /* As we always know the lowest segment page
+ * number we can use InvalidBlockNumber here
+ * to recognize no segmenting case.
+ * When 0 for the first page of first
+ * segment we can omit in range query */
+ BlockNumber endPage; /* last page in segment for page-range dump,
+ * startPage+max_table_segment_pages-1 for
+ * most segments, but InvalidBlockNumber for
+ * the last one to indicate open range
+ */
} TableDataInfo;
+#define is_segment(tdiptr) (tdiptr->startPage != InvalidBlockNumber)
+
typedef struct _indxInfo
{
DumpableObject dobj;
@@ -448,7 +466,7 @@ typedef struct _indexAttachInfo
typedef struct _relStatsInfo
{
DumpableObject dobj;
- int32 relpages;
+ BlockNumber relpages;
char *reltuples;
int32 relallvisible;
int32 relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
my $dbname1 = 'regression_src';
my $dbname2 = 'regression_dest1';
my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
my $node = PostgreSQL::Test::Cluster->new('main');
$node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
$node->run_log([ 'createdb', $dbname1 ]);
$node->run_log([ 'createdb', $dbname2 ]);
$node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
$node->safe_psql(
$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
],
'parallel restore as inserts');
+$node->command_ok(
+ [
+ 'pg_dump',
+ '--format' => 'directory',
+ '--max-table-segment-pages' => 2,
+ '--no-sync',
+ '--jobs' => 2,
+ '--file' => "$backupdir/dump3",
+ $node->connstr($dbname1),
+ ],
+ 'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+ [
+ 'pg_restore', '--verbose',
+ '--dbname' => $node->connstr($dbname4),
+ '--jobs' => 3,
+ "$backupdir/dump3",
+ ],
+ 'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index cc483ae176c..aff1fbd31a3 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
return true;
}
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option. If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+ uint32 min_range, uint32 max_range,
+ uint32 *result)
+{
+ char *endptr;
+ unsigned long val;
+
+ /* Fail if there is a minus sign at the start of value */
+ while(isspace((unsigned char) *optarg))
+ optarg++;
+ if(*optarg == '-')
+ {
+ pg_log_error("value \"%s\" for option %s can not be negative",
+ optarg, optname);
+ return false;
+ }
+
+ errno = 0;
+ val = strtoul(optarg, &endptr, 10);
+
+ /*
+ * Skip any trailing whitespace; if anything but whitespace remains before
+ * the terminating character, fail.
+ */
+ while (*endptr != '\0' && isspace((unsigned char) *endptr))
+ endptr++;
+
+ if (*endptr != '\0')
+ {
+ pg_log_error("invalid value \"%s\" for option %s",
+ optarg, optname);
+ return false;
+ }
+
+ /* as min_range and max_range are uint32 then the range check will
+ * catch the case where unsigned long val is outside 32 bit range */
+ if (errno == ERANGE || val < min_range || val > max_range)
+ {
+ pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+ return false;
+ }
+
+ if (result)
+ *result = (uint32) val;
+ return true;
+}
+
/*
* Provide strictly harmonized handling of the --sync-method option.
*/
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index 0db6e3b6e91..c74cd1fb595 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
extern bool option_parse_int(const char *optarg, const char *optname,
int min_range, int max_range,
int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+ uint32 min_range, uint32 max_range,
+ uint32 *result);
extern bool parse_sync_method(const char *optarg,
DataDirSyncMethod *sync_method);
--
2.43.0
view thread (24+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Patch: dumping tables data in multiple chunks in pg_dump
In-Reply-To: <CAMT0RQQKWWrYYYQ8QNurs7hXBC5DwBAV6b0JmHKvJk7wZnup0g@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox