public inbox for [email protected]
help / color / mirror / Atom feedFrom: Kumar, Sachin <[email protected]>
To: Tom Lane <[email protected]>
To: Robins Tharakan <[email protected]>
Cc: Nathan Bossart <[email protected]>
Cc: Jan Wieck <[email protected]>
Cc: Bruce Momjian <[email protected]>
Cc: Zhihong Yu <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Magnus Hagander <[email protected]>
Cc: Robins Tharakan <[email protected]>
Cc: Peter Eisentraut <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: pg_upgrade failing for 200+ million Large Objects
Date: Tue, 2 Jan 2024 17:33:00 +0000
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<20220825003227.GA1456581@nathanxps13>
<[email protected]>
<20220908231807.GA2242918@nathanxps13>
<CAAWbhmgUb8p7ff_ZX5jCvqM=ipPxbbDJTXMNVzH-Ho_CXVkRHA@mail.gmail.com>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
> On 11/12/2023, 01:43, "Tom Lane" <[email protected] <mailto:[email protected]>> wrote:
> I had initially supposed that in a parallel restore we could
> have child workers also commit after every N TOC items, but was
> soon disabused of that idea. After a worker processes a TOC
> item, any dependent items (such as index builds) might get
> dispatched to some other worker, which had better be able to
> see the results of the first worker's step. So at least in
> this implementation, we disable the multi-command-per-COMMIT
> behavior during the parallel part of the restore. Maybe that
> could be improved in future, but it seems like it'd add a
> lot more complexity, and it wouldn't make life any better for
> pg_upgrade (which doesn't use parallel pg_restore, and seems
> unlikely to want to in future).
I was not able to find email thread which details why we are not using
parallel pg_restore for pg_upgrade. IMHO most of the customer will have single large
database, and not using parallel restore will cause slow pg_upgrade.
I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA part
of dump. It will push down --jobs value to pg_restore and will restore database sequentially.
Benchmarks
{5 million LOs 1 large DB}
Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 17.51s user 65.80s system 35% cpu 3:56.64 total
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 17.51s user 65.85s system 34% cpu 3:58.39 total
HEAD
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 53.95s user 82.44s system 41% cpu 5:25.23 total
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 54.94s user 81.26s system 41% cpu 5:24.86 total
Fix with --jobs propagation to pg_restore {on top of v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 29.12s user 69.85s system 275% cpu 35.930 total
Although parallel restore does have small regression in ideal case of pg_upgrade --jobs
Multiple DBs {4 DBs each having 2 million LOs}
Fix with --jobs scheduling
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 31.80s user 109.52s system 120% cpu 1:57.35 total
Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 30.88s user 110.05s system 135% cpu 1:43.97 total
Regards
Sachin
Attachments:
[application/octet-stream] v9-005-parallel_pg_restore.patch (5.0K, 2-v9-005-parallel_pg_restore.patch)
download | inline diff:
commit 02513d121ce0b96be5619edfd0317b46b70a44da
Author: Sachin Kumar <[email protected]>
Date: Tue Jan 2 16:57:42 2024 +0000
Pass pg_upgrade --jobs parameter to pg_restore
This patch changes pg_upgrade --jobs behaviour for data, post_data
part of restore. Instead for restoring N databases in parallel,
they are restored in sequential order but with pg_restore
--jobs = {original pg_upgrade --jobs value}
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 5cfd2282e1..73ffdafd19 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@
static void set_locale_and_encoding(void);
static void prepare_new_cluster(void);
static void prepare_new_globals(void);
+static void parallel_pg_restore_dbs(bool);
static void create_new_objects(void);
static void copy_xact_xlog_xid(void);
static void set_frozenxids(bool minmxid_only);
@@ -508,6 +509,96 @@ prepare_new_globals(void)
check_ok();
}
+/*
+ * This function will use parallel pg_restore to restore
+ * {data}, {post_data} section of the dump.
+ * Since {pre_data} section cant be parallelized , instead we will
+ * restore {pg_upgrade --jobs} dbs in parallel.
+ */
+static void
+parallel_pg_restore_dbs(bool pre_data)
+{
+ int dbnum;
+ int jobs = user_opts.jobs ? user_opts.jobs : 1;
+ /*
+ * Restore @section of the dump with parallel pg_restore
+ */
+ for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+ {
+ char sql_file_name[MAXPGPATH],
+ log_file_name[MAXPGPATH];
+ DbInfo *old_db = &old_cluster.dbarr.dbs[dbnum];
+ const char *create_opts;
+
+ /* Skip template1 in this pass */
+ if (strcmp(old_db->db_name, "template1") == 0)
+ continue;
+
+ pg_log(PG_STATUS, "%s", old_db->db_name);
+ snprintf(sql_file_name, sizeof(sql_file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
+ snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
+
+ /*
+ * postgres database will already exist in the target installation, so
+ * tell pg_restore to drop and recreate it; otherwise we would fail to
+ * propagate its database-level properties.
+ */
+ if (pre_data)
+ {
+ if (strcmp(old_db->db_name, "postgres") == 0)
+ create_opts = "--clean --create";
+ else if (pre_data)
+ create_opts = "--create";
+ }
+ else
+ create_opts = "";
+
+ /*
+ * Restore pre-data section of the dump in parallel with single pg_restore job
+ * This section of dump cant be parallelized with parallel pg_restore
+ */
+ if (pre_data)
+ {
+ parallel_exec_prog(log_file_name,
+ NULL,
+ "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
+ " --section=pre-data "
+ "--dbname template1 \"%s/%s\"",
+ new_cluster.bindir,
+ cluster_conn_opts(&new_cluster),
+ create_opts,
+ log_opts.dumpdir,
+ sql_file_name);
+
+ }
+ else
+ {
+ exec_prog(log_file_name,
+ NULL,
+ true,
+ true,
+ "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
+ "--transaction-size=1000 --jobs %d --section=data --section=post-data "
+ "--dbname template1 \"%s/%s\"",
+ new_cluster.bindir,
+ cluster_conn_opts(&new_cluster),
+ create_opts,
+ jobs,
+ log_opts.dumpdir,
+ sql_file_name);
+ }
+ }
+ /*
+ * Wait for child process for pre-data section
+ */
+ if (pre_data)
+ {
+ /* reap all children */
+ while (reap_child(true) == true)
+ ;
+ }
+
+}
static void
create_new_objects(void)
@@ -559,46 +650,10 @@ create_new_objects(void)
break; /* done once we've processed template1 */
}
- for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
- {
- char sql_file_name[MAXPGPATH],
- log_file_name[MAXPGPATH];
- DbInfo *old_db = &old_cluster.dbarr.dbs[dbnum];
- const char *create_opts;
-
- /* Skip template1 in this pass */
- if (strcmp(old_db->db_name, "template1") == 0)
- continue;
-
- pg_log(PG_STATUS, "%s", old_db->db_name);
- snprintf(sql_file_name, sizeof(sql_file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
- snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
-
- /*
- * postgres database will already exist in the target installation, so
- * tell pg_restore to drop and recreate it; otherwise we would fail to
- * propagate its database-level properties.
- */
- if (strcmp(old_db->db_name, "postgres") == 0)
- create_opts = "--clean --create";
- else
- create_opts = "--create";
-
- parallel_exec_prog(log_file_name,
- NULL,
- "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
- "--transaction-size=1000 "
- "--dbname template1 \"%s/%s\"",
- new_cluster.bindir,
- cluster_conn_opts(&new_cluster),
- create_opts,
- log_opts.dumpdir,
- sql_file_name);
- }
-
- /* reap all children */
- while (reap_child(true) == true)
- ;
+ /* Restore pre_data */
+ parallel_pg_restore_dbs(true);
+ /* Restore data, post_data */
+ parallel_pg_restore_dbs(false);
end_progress_output();
check_ok();
view thread (18+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: pg_upgrade failing for 200+ million Large Objects
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox