RE: pg_upgrade failing for 200+ million Large Objects

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tharakan, Robins <[email protected]>
To: [email protected] <[email protected]>
Subject: RE: pg_upgrade failing for 200+ million Large Objects
Date: Sun, 7 Mar 2021 08:43:28 +0000
Message-ID: <[email protected]> (raw)

Hi all,

Attached is a proof-of-concept patch that allows Postgres to perform
pg_upgrade if the instance has Millions of objects.

It would be great if someone could take a look and see if this patch is in
the right direction. There are some pending tasks (such as documentation /
pg_resetxlog vs pg_resetwal related changes) but for now, the patch helps
remove a stalemate where if a Postgres instance has a large number
(accurately speaking 146+ Million) of Large Objects, pg_upgrade fails. This
is easily reproducible and besides deleting Large Objects before upgrade,
there is no other (apparent) way for pg_upgrade to complete.

The patch (attached):
- Applies cleanly on REL9_6_STABLE -
c7a4fc3dd001646d5938687ad59ab84545d5d043
- 'make check' passes
- Allows the user to provide a constant via pg_upgrade command-line, that
overrides the 2 billion constant in pg_resetxlog [1] thereby increasing the
(window of) Transaction IDs available for pg_upgrade to complete. 


Sample argument for pg_upgrade:
$ /opt/postgres/96/bin/pg_upgrade --max-limit-xid 1000000000 --old-bindir
...


With this patch, pg_upgrade is now able to upgrade a v9.5 cluster with 500
million Large Objects successfully to v9.6 - some stats below:

Source Postgres - v9.5.24
Target Version - v9.6.21
Large Object Count: 500 Million Large Objects
Machine - r5.4xlarge (16vCPU / 128GB RAM + 1TB swap)
Memory used during pg_upgrade - ~350GB
Time taken - 25+ hrs. (tested twice) - (All LOs processed sequentially ->
Scope for optimization)

Although counter-intuitive, for this testing purpose all Large Objects were
small (essentially the idea was to test the count) and created by using
something like this:

seq 1 50000 | xargs -n 1 -i -P 10 /opt/postgres/95/bin/psql -c "select
lo_from_bytea(0, '\xffffff00') from generate_series(1,10000);" > /dev/null

I am not married to the patch (especially the argument name) but ideally I'd
prefer a way to get this upgrade going without a patch. For now, I am unable
to find any other way to upgrade a v9.5 Postgres database in this scenario,
facing End-of-Life.

Reference:
1) 2 Billion constant -
https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a211
a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444

Thanks,
Robins Tharakan

> -----Original Message-----
> From: Tharakan, Robins
> Sent: Wednesday, 3 March 2021 10:36 PM
> To: [email protected]
> Subject: pg_upgrade failing for 200+ million Large Objects
> 
> Hi,
> 
> While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that
> the
> instance had ~200 million (in-use) Large Objects. I was able to reproduce
> this on a test instance which too fails with a similar error.
> 
> 
> pg_restore: executing BLOB 4980622
> pg_restore: WARNING: database with OID 0 must be vacuumed within 1000001
> transactions
> HINT: To avoid a database shutdown, execute a database-wide VACUUM in
> that
> database.
> You might also need to commit or roll back old prepared transactions.
> pg_restore: executing BLOB 4980623
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 2565; 2613 4980623 BLOB
> 4980623 postgres
> pg_restore: [archiver (db)] could not execute query: ERROR: database is
> not
> accepting commands to avoid wraparound data loss in database with OID 0
> HINT: Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.
> Command was: SELECT pg_catalog.lo_create('4980623');
> 
> 
> 
> To remove the obvious possibilities, these Large Objects that are still
> in-use (so vacuumlo wouldn't help), giving more system resources doesn't
> help, moving Large Objects around to another database doesn't help (since
> this is cluster-wide restriction), the source instance is nowhere close
> to
> wraparound and lastly recent-most minor versions don't help either (I
> tried
> compiling 9_6_STABLE + upgrade database with 150 million LO and still
> encountered the same issue).
> 
> Do let me know if I am missing something obvious but it appears that this
> is
> happening owing to 2 things coming together:
> 
> * Each Large Object is migrated in its own transaction during pg_upgrade
> * pg_resetxlog appears to be narrowing the window (available for
> pg_upgrade)
> to ~146 Million XIDs (2^31 - 1 million XID wraparound margin - 2 billion
> which is a hard-coded constant - see [1] - in what appears to be an
> attempt
> to force an Autovacuum Wraparound session soon after upgrade completes).
> 
> Ideally such an XID based restriction, is limiting for an instance that's
> actively using a lot of Large Objects. Besides forcing AutoVacuum
> Wraparound
> logic to kick in soon after, I am unclear what much else it aims to do.
> What
> it does seem to be doing is to block Major Version upgrades if the
> pre-upgrade instance has >146 Million Large Objects (half that, if the LO
> additionally requires ALTER LARGE OBJECT OWNER TO for each of those
> objects
> during pg_restore)
> 
> For long-term these ideas came to mind, although am unsure which are
> low-hanging fruits and which outright impossible - For e.g. clubbing
> multiple objects in a transaction [2] / Force AutoVacuum post upgrade
> (and
> thus remove this limitation altogether) or see if "pg_resetxlog -x" (from
> within pg_upgrade) could help in some way to work-around this limitation.
> 
> Is there a short-term recommendation for this scenario?
> 
> I can understand a high number of small-sized objects is not a great way
> to
> use pg_largeobject (since Large Objects was intended to be for, well,
> 'large
> objects') but this magic number of Large Objects is now a stalemate at
> this
> point (with respect to v9.5 EOL).
> 
> 
> Reference:
> 1) pg_resetxlog -
> https://github.com/postgres/postgres/blob/ca3b37487be333a1d241dab1bbdd17a
> 211
> a88f43/src/bin/pg_resetwal/pg_resetwal.c#L444
> 2)
> https://www.postgresql.org/message-id/ed7d86a1-b907-4f53-9f6e-
> 63482d2f2bac%4
> 0manitou-mail.org
> 
> -
> Thanks
> Robins Tharakan


Attachments:

  [application/octet-stream] pgupgrade_lo_v2.patch (5.4K, 2-pgupgrade_lo_v2.patch)
  download | inline diff:
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 3e79482ca2..f2e9824cb5 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -67,6 +67,7 @@ static TransactionId set_xid = 0;
 static TransactionId set_oldest_commit_ts_xid = 0;
 static TransactionId set_newest_commit_ts_xid = 0;
 static Oid	set_oid = 0;
+static Oid	max_limit_xid = 2000000000;
 static MultiXactId set_mxid = 0;
 static MultiXactOffset set_mxoff = (MultiXactOffset) -1;
 static uint32 minXlogTli = 0;
@@ -116,7 +117,7 @@ main(int argc, char *argv[])
 	}
 
 
-	while ((c = getopt(argc, argv, "c:D:e:fl:m:no:O:x:")) != -1)
+	while ((c = getopt(argc, argv, "c:D:e:fl:L:m:no:O:x:")) != -1)
 	{
 		switch (c)
 		{
@@ -210,6 +211,21 @@ main(int argc, char *argv[])
 				}
 				break;
 
+			case 'L':
+				max_limit_xid = strtoul(optarg, &endptr, 0);
+				if (endptr == optarg || *endptr != '\0')
+				{
+					fprintf(stderr, _("%s: invalid argument for option %s\n"), progname, "-L");
+					fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+					exit(1);
+				}
+				if (max_limit_xid <= 500000000)
+				{
+					fprintf(stderr, _("%s: Max Limit XID (-L) must not be less than 500 Million\n"), progname);
+					exit(1);
+				}
+				break;
+
 			case 'm':
 				set_mxid = strtoul(optarg, &endptr, 0);
 				if (endptr == optarg || *endptr != ',')
@@ -381,7 +397,7 @@ main(int argc, char *argv[])
 		 * reasonably safe.  The magic constant here corresponds to the
 		 * maximum allowed value of autovacuum_freeze_max_age.
 		 */
-		ControlFile.checkPointCopy.oldestXid = set_xid - 2000000000;
+		ControlFile.checkPointCopy.oldestXid = set_xid - max_limit_xid;
 		if (ControlFile.checkPointCopy.oldestXid < FirstNormalTransactionId)
 			ControlFile.checkPointCopy.oldestXid += FirstNormalTransactionId;
 		ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
@@ -1239,6 +1255,7 @@ usage(void)
 	printf(_("  -e XIDEPOCH      set next transaction ID epoch\n"));
 	printf(_("  -f               force update to be done\n"));
 	printf(_("  -l XLOGFILE      force minimum WAL starting location for new transaction log\n"));
+	printf(_("  -L MAXLIMITXID   force max XID starting location for new transaction log\n"));
 	printf(_("  -m MXID,MXID     set next and oldest multitransaction ID\n"));
 	printf(_("  -n               no update, just show what would be done (for testing)\n"));
 	printf(_("  -o OID           set next OID\n"));
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 7ab284a51b..3a861739f7 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -53,6 +53,7 @@ parseCommandLine(int argc, char *argv[])
 		{"link", no_argument, NULL, 'k'},
 		{"retain", no_argument, NULL, 'r'},
 		{"jobs", required_argument, NULL, 'j'},
+		{"max-limit-xid", required_argument, NULL, 'L'},
 		{"verbose", no_argument, NULL, 'v'},
 		{NULL, 0, NULL, 0}
 	};
@@ -64,6 +65,7 @@ parseCommandLine(int argc, char *argv[])
 	time_t		run_time = time(NULL);
 
 	user_opts.transfer_mode = TRANSFER_MODE_COPY;
+	user_opts.maxlimitxid = 2000000000;
 
 	os_info.progname = get_progname(argv[0]);
 
@@ -101,7 +103,7 @@ parseCommandLine(int argc, char *argv[])
 	if ((log_opts.internal = fopen_priv(INTERNAL_LOG_FILE, "a")) == NULL)
 		pg_fatal("cannot write to log file %s\n", INTERNAL_LOG_FILE);
 
-	while ((option = getopt_long(argc, argv, "d:D:b:B:cj:ko:O:p:P:rU:v",
+	while ((option = getopt_long(argc, argv, "d:D:b:B:cj:ko:L:O:p:P:rU:v",
 								 long_options, &optindex)) != -1)
 	{
 		switch (option)
@@ -132,6 +134,10 @@ parseCommandLine(int argc, char *argv[])
 				user_opts.jobs = atoi(optarg);
 				break;
 
+			case 'L':
+				user_opts.maxlimitxid = atoi(optarg);
+				break;
+
 			case 'k':
 				user_opts.transfer_mode = TRANSFER_MODE_LINK;
 				break;
@@ -287,6 +293,7 @@ Options:\n\
   -D, --new-datadir=DATADIR     new cluster data directory\n\
   -j, --jobs=NUM                number of simultaneous processes or threads to use\n\
   -k, --link                    link instead of copying files to new cluster\n\
+  -L, --max-limit-xid=NUM       max-limit XIDs to consider\n\
   -o, --old-options=OPTIONS     old cluster options to pass to the server\n\
   -O, --new-options=OPTIONS     new cluster options to pass to the server\n\
   -p, --old-port=PORT           old cluster port number (default %d)\n\
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 02078c0357..2d0f3a7e04 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -411,8 +411,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -L %u -f -x %u \"%s\"",
+			  new_cluster.bindir,
+			  user_opts.maxlimitxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 9fbdacc53e..50fe73ae09 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -298,6 +298,7 @@ typedef struct
 								 * changes */
 	transferMode transfer_mode; /* copy files or link them? */
 	int			jobs;
+	int			maxlimitxid;
 } UserOpts;
 
 


  [application/pkcs7-signature] smime.p7s (5.0K, 3-smime.p7s)
  download

view thread (49+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected]
  Subject: RE: pg_upgrade failing for 200+ million Large Objects
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox