Re: pg_upgrade failing for 200+ million Large Objects

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Kumar, Sachin <[email protected]>
To: Tom Lane <[email protected]>
To: Robins Tharakan <[email protected]>
Cc: Nathan Bossart <[email protected]>
Cc: Jan Wieck <[email protected]>
Cc: Bruce Momjian <[email protected]>
Cc: Zhihong Yu <[email protected]>
Cc: Andrew Dunstan <[email protected]>
Cc: Magnus Hagander <[email protected]>
Cc: Robins Tharakan <[email protected]>
Cc: Peter Eisentraut <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: pg_upgrade failing for 200+ million Large Objects
Date: Tue, 2 Jan 2024 17:33:00 +0000
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<20220825003227.GA1456581@nathanxps13>
	<[email protected]>
	<20220908231807.GA2242918@nathanxps13>
	<CAAWbhmgUb8p7ff_ZX5jCvqM=ipPxbbDJTXMNVzH-Ho_CXVkRHA@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>

> On 11/12/2023, 01:43, "Tom Lane" <[email protected] <mailto:[email protected]>> wrote:

> I had initially supposed that in a parallel restore we could
> have child workers also commit after every N TOC items, but was
> soon disabused of that idea. After a worker processes a TOC
> item, any dependent items (such as index builds) might get
> dispatched to some other worker, which had better be able to
> see the results of the first worker's step. So at least in
> this implementation, we disable the multi-command-per-COMMIT
> behavior during the parallel part of the restore. Maybe that
> could be improved in future, but it seems like it'd add a
> lot more complexity, and it wouldn't make life any better for
> pg_upgrade (which doesn't use parallel pg_restore, and seems
> unlikely to want to in future).

I was not able to find email thread which details why we are not using
parallel pg_restore for pg_upgrade. IMHO most of the customer will have single large
database, and not using parallel restore will cause slow pg_upgrade.

I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA part
of dump. It will push down --jobs value to pg_restore and will restore database sequentially.

Benchmarks

{5 million LOs 1 large DB}
Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      17.51s user 65.80s system 35% cpu 3:56.64 total


time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      17.51s user 65.85s system 34% cpu 3:58.39 total


HEAD
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      53.95s user 82.44s system 41% cpu 5:25.23 total

time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      54.94s user 81.26s system 41% cpu 5:24.86 total



Fix with --jobs propagation to pg_restore {on top of v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=20
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      29.12s user 69.85s system 275% cpu 35.930 total 


Although parallel restore does have small regression in ideal case of pg_upgrade --jobs


Multiple DBs {4 DBs each having 2 million LOs}

Fix with --jobs scheduling
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      31.80s user 109.52s system 120% cpu 1:57.35 total


Patched {v9}
time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub -r --jobs=4
pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir      30.88s user 110.05s system 135% cpu 1:43.97 total


Regards
Sachin



Attachments:

  [application/octet-stream] v9-005-parallel_pg_restore.patch (5.0K, 2-v9-005-parallel_pg_restore.patch)
  download | inline diff:
commit 02513d121ce0b96be5619edfd0317b46b70a44da
Author: Sachin Kumar <[email protected]>
Date:   Tue Jan 2 16:57:42 2024 +0000

    Pass pg_upgrade --jobs parameter to pg_restore
    
    This patch changes pg_upgrade --jobs behaviour for data, post_data
    part of restore. Instead for restoring N databases in parallel,
    they are restored in sequential order but with pg_restore
    --jobs = {original pg_upgrade --jobs value}

diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 5cfd2282e1..73ffdafd19 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -54,6 +54,7 @@
 static void set_locale_and_encoding(void);
 static void prepare_new_cluster(void);
 static void prepare_new_globals(void);
+static void parallel_pg_restore_dbs(bool);
 static void create_new_objects(void);
 static void copy_xact_xlog_xid(void);
 static void set_frozenxids(bool minmxid_only);
@@ -508,6 +509,96 @@ prepare_new_globals(void)
 	check_ok();
 }
 
+/*
+ * This function will use parallel pg_restore to restore
+ * {data}, {post_data} section of the dump.
+ * Since {pre_data} section cant be parallelized , instead we will
+ * restore {pg_upgrade --jobs} dbs in parallel.
+ */
+static void
+parallel_pg_restore_dbs(bool pre_data)
+{
+	int			dbnum;
+	int         jobs = user_opts.jobs ? user_opts.jobs : 1;
+	/*
+	 * Restore @section of the dump with parallel pg_restore
+	 */
+	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
+	{
+		char		sql_file_name[MAXPGPATH],
+					log_file_name[MAXPGPATH];
+		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
+		const char *create_opts;
+
+		/* Skip template1 in this pass */
+		if (strcmp(old_db->db_name, "template1") == 0)
+			continue;
+
+		pg_log(PG_STATUS, "%s", old_db->db_name);
+		snprintf(sql_file_name, sizeof(sql_file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
+		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
+
+		/*
+		 * postgres database will already exist in the target installation, so
+		 * tell pg_restore to drop and recreate it; otherwise we would fail to
+		 * propagate its database-level properties.
+		 */
+		if (pre_data)
+		{
+			if (strcmp(old_db->db_name, "postgres") == 0)
+				create_opts = "--clean --create";
+			else if (pre_data)
+				create_opts = "--create";
+		}
+		else
+			create_opts = "";
+
+		/*
+		 * Restore pre-data section of the dump in parallel with single pg_restore job
+		 * This section of dump cant be parallelized with parallel pg_restore
+		 */
+		if (pre_data)
+		{
+			parallel_exec_prog(log_file_name,
+							   NULL,
+							   "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
+							   " --section=pre-data "
+							   "--dbname template1 \"%s/%s\"",
+							   new_cluster.bindir,
+							   cluster_conn_opts(&new_cluster),
+							   create_opts,
+							   log_opts.dumpdir,
+							   sql_file_name);
+
+		}
+		else
+		{
+			exec_prog(log_file_name,
+							   NULL,
+							   true,
+							   true,
+							   "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
+							   "--transaction-size=1000 --jobs %d --section=data --section=post-data "
+							   "--dbname template1 \"%s/%s\"",
+							   new_cluster.bindir,
+							   cluster_conn_opts(&new_cluster),
+							   create_opts,
+							   jobs,
+							   log_opts.dumpdir,
+							   sql_file_name);
+		}
+	}
+	/*
+	 * Wait for child process for pre-data section
+	 */
+	if (pre_data)
+	{
+		/* reap all children */
+		while (reap_child(true) == true)
+			;
+	}
+
+}
 
 static void
 create_new_objects(void)
@@ -559,46 +650,10 @@ create_new_objects(void)
 		break;					/* done once we've processed template1 */
 	}
 
-	for (dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
-	{
-		char		sql_file_name[MAXPGPATH],
-					log_file_name[MAXPGPATH];
-		DbInfo	   *old_db = &old_cluster.dbarr.dbs[dbnum];
-		const char *create_opts;
-
-		/* Skip template1 in this pass */
-		if (strcmp(old_db->db_name, "template1") == 0)
-			continue;
-
-		pg_log(PG_STATUS, "%s", old_db->db_name);
-		snprintf(sql_file_name, sizeof(sql_file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
-		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
-
-		/*
-		 * postgres database will already exist in the target installation, so
-		 * tell pg_restore to drop and recreate it; otherwise we would fail to
-		 * propagate its database-level properties.
-		 */
-		if (strcmp(old_db->db_name, "postgres") == 0)
-			create_opts = "--clean --create";
-		else
-			create_opts = "--create";
-
-		parallel_exec_prog(log_file_name,
-						   NULL,
-						   "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
-						   "--transaction-size=1000 "
-						   "--dbname template1 \"%s/%s\"",
-						   new_cluster.bindir,
-						   cluster_conn_opts(&new_cluster),
-						   create_opts,
-						   log_opts.dumpdir,
-						   sql_file_name);
-	}
-
-	/* reap all children */
-	while (reap_child(true) == true)
-		;
+	/* Restore pre_data */
+	parallel_pg_restore_dbs(true);
+	/* Restore data, post_data */
+	parallel_pg_restore_dbs(false);
 
 	end_progress_output();
 	check_ok();

view thread (18+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: pg_upgrade failing for 200+ million Large Objects
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox