Re: pg_upgrade failing for 200+ million Large Objects

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: pg_upgrade failing for 200+ million Large Objects
15+ messages / 5 participants
[nested] [flat]

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-04-01 19:28  Tom Lane <[email protected]>
  0 siblings, 2 replies; 15+ messages in thread

From: Tom Lane @ 2024-04-01 19:28 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Nathan Bossart <[email protected]> writes:
> Sorry for taking so long to get back to this one.  Overall, I think the
> code is in decent shape.

Thanks for looking at it!

> The one design point that worries me a little is the non-configurability of
> --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
> or something, but given how often I've had to fiddle with
> max_locks_per_transaction, I'm wondering if we might regret hard-coding it.

Well, we could add a command-line switch to pg_upgrade, but I'm
unconvinced that it'd be worth the trouble.  I think a very large
fraction of users invoke pg_upgrade by means of packager-supplied
scripts that are unlikely to provide a way to pass through such
a switch.  I'm inclined to say let's leave it as-is until we get
some actual field requests for a switch.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-04-01 19:37  Nathan Bossart <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 0 replies; 15+ messages in thread

From: Nathan Bossart @ 2024-04-01 19:37 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote:
> Nathan Bossart <[email protected]> writes:
>> The one design point that worries me a little is the non-configurability of
>> --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
>> or something, but given how often I've had to fiddle with
>> max_locks_per_transaction, I'm wondering if we might regret hard-coding it.
> 
> Well, we could add a command-line switch to pg_upgrade, but I'm
> unconvinced that it'd be worth the trouble.  I think a very large
> fraction of users invoke pg_upgrade by means of packager-supplied
> scripts that are unlikely to provide a way to pass through such
> a switch.  I'm inclined to say let's leave it as-is until we get
> some actual field requests for a switch.

Okay.  I'll let you know if I see anything.  IIRC usually the pg_dump side
of pg_upgrade is more prone to lock exhaustion, so you may very well be
right that this is unnecessary.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-24 14:17  Justin Pryzby <[email protected]>
  parent: Tom Lane <[email protected]>
  1 sibling, 1 reply; 15+ messages in thread

From: Justin Pryzby @ 2024-07-24 14:17 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote:
> Nathan Bossart <[email protected]> writes:
> > The one design point that worries me a little is the non-configurability of
> > --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
> > or something, but given how often I've had to fiddle with
> > max_locks_per_transaction, I'm wondering if we might regret hard-coding it.
> 
> Well, we could add a command-line switch to pg_upgrade, but I'm
> unconvinced that it'd be worth the trouble.  I think a very large
> fraction of users invoke pg_upgrade by means of packager-supplied
> scripts that are unlikely to provide a way to pass through such
> a switch.  I'm inclined to say let's leave it as-is until we get
> some actual field requests for a switch.

I've been importing our schemas and doing upgrade testing, and was
surprised when a postgres backend was killed for OOM during pg_upgrade:

Killed process 989302 (postgres) total-vm:5495648kB, anon-rss:5153292kB, ...

Upgrading from v16 => v16 doesn't use nearly as much RAM.

While tracking down the responsible commit, I reproduced the problem
using a subset of tables; at 959b38d770, the backend process used
~650 MB RAM, and at its parent commit used at most ~120 MB.

959b38d770b Invent --transaction-size option for pg_restore.

By changing RESTORE_TRANSACTION_SIZE to 100, backend RAM use goes to
180 MB during pg_upgrade, which is reasonable.

With partitioning, we have a lot of tables, some of them wide (126
partitioned tables, 8942 childs, total 1019315 columns).  I didn't track
if certain parts of our schema contribute most to the high backend mem
use, just that it's now 5x (while testing a subset) to 50x higher.

We'd surely prefer that the transaction size be configurable.

-- 
Justin

^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 19:53  Alexander Korotkov <[email protected]>
  parent: Justin Pryzby <[email protected]>
  0 siblings, 2 replies; 15+ messages in thread

From: Alexander Korotkov @ 2024-07-26 19:53 UTC (permalink / raw)
  To: Justin Pryzby <[email protected]>; +Cc: Tom Lane <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Hi, Justin!

Thank you for sharing this.

On Wed, Jul 24, 2024 at 5:18 PM Justin Pryzby <[email protected]> wrote:
> On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote:
> > Nathan Bossart <[email protected]> writes:
> > > The one design point that worries me a little is the non-configurability of
> > > --transaction-size in pg_upgrade.  I think it's fine to default it to 1,000
> > > or something, but given how often I've had to fiddle with
> > > max_locks_per_transaction, I'm wondering if we might regret hard-coding it.
> >
> > Well, we could add a command-line switch to pg_upgrade, but I'm
> > unconvinced that it'd be worth the trouble.  I think a very large
> > fraction of users invoke pg_upgrade by means of packager-supplied
> > scripts that are unlikely to provide a way to pass through such
> > a switch.  I'm inclined to say let's leave it as-is until we get
> > some actual field requests for a switch.
>
> I've been importing our schemas and doing upgrade testing, and was
> surprised when a postgres backend was killed for OOM during pg_upgrade:
>
> Killed process 989302 (postgres) total-vm:5495648kB, anon-rss:5153292kB, ...
>
> Upgrading from v16 => v16 doesn't use nearly as much RAM.
>
> While tracking down the responsible commit, I reproduced the problem
> using a subset of tables; at 959b38d770, the backend process used
> ~650 MB RAM, and at its parent commit used at most ~120 MB.
>
> 959b38d770b Invent --transaction-size option for pg_restore.
>
> By changing RESTORE_TRANSACTION_SIZE to 100, backend RAM use goes to
> 180 MB during pg_upgrade, which is reasonable.
>
> With partitioning, we have a lot of tables, some of them wide (126
> partitioned tables, 8942 childs, total 1019315 columns).  I didn't track
> if certain parts of our schema contribute most to the high backend mem
> use, just that it's now 5x (while testing a subset) to 50x higher.

Do you think there is a way to anonymize the schema and share it?

> We'd surely prefer that the transaction size be configurable.

I think we can add an option to pg_upgrade.  But I wonder if there is
something else we can do.  It seems that restoring some objects is
much more expensive than restoring others.  It would be nice to
identify such cases and check which memory contexts are growing and
why.  It would be helpful if you could share your data schema, so we
could dig into it.

I can imagine we need to count some DDL commands in aspect of maximum
restore transaction size in a different way than others.  Also, we
probably need to change the default restore transaction size.

------
Regards,
Alexander Korotkov
Supabase






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 20:05  Tom Lane <[email protected]>
  parent: Alexander Korotkov <[email protected]>
  1 sibling, 0 replies; 15+ messages in thread

From: Tom Lane @ 2024-07-26 20:05 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Alexander Korotkov <[email protected]> writes:
> On Wed, Jul 24, 2024 at 5:18 PM Justin Pryzby <[email protected]> wrote:
>> We'd surely prefer that the transaction size be configurable.

> I think we can add an option to pg_upgrade.  But I wonder if there is
> something else we can do.

Yeah, I'm not enamored of adding a command-line option, if only
because I think a lot of people invoke pg_upgrade through
vendor-provided scripts that aren't going to cooperate with that.
If we can find some way to make it adapt without help, that
would be much better.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 20:36  Justin Pryzby <[email protected]>
  parent: Alexander Korotkov <[email protected]>
  1 sibling, 2 replies; 15+ messages in thread

From: Justin Pryzby @ 2024-07-26 20:36 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Tom Lane <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Wed, Jul 24, 2024 at 09:17:51AM -0500, Justin Pryzby wrote:
> With partitioning, we have a lot of tables, some of them wide (126
> partitioned tables, 8942 childs, total 1019315 columns).

On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote:
> It would be nice to identify such cases and check which memory contexts are
> growing and why.

I reproduced the problem with this schema:

SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%s int', i))a FROM generate_series(1,999)i);
SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i;

This used over 4 GB of RAM.
3114201 pryzbyj   20   0 5924520   4.2g  32476 T   0.0  53.8   0:27.35 postgres: pryzbyj postgres [local] UPDATE

The large context is:
2024-07-26 15:22:19.280 CDT [3114201] LOG:  level: 1; CacheMemoryContext: 5211209088 total in 50067 blocks; 420688 free (14 chunks); 5210788400 used

Note that there seemed to be no issue when I created 999 tables without
partitioning:

SELECT format('CREATE TABLE t%s(LIKE p)', i,i,i+1) FROM generate_series(1,999)i;

-- 
Justin






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 21:42  Alexander Korotkov <[email protected]>
  parent: Justin Pryzby <[email protected]>
  1 sibling, 0 replies; 15+ messages in thread

From: Alexander Korotkov @ 2024-07-26 21:42 UTC (permalink / raw)
  To: Justin Pryzby <[email protected]>; +Cc: Tom Lane <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Fri, Jul 26, 2024 at 11:36 PM Justin Pryzby <[email protected]> wrote:
> On Wed, Jul 24, 2024 at 09:17:51AM -0500, Justin Pryzby wrote:
> > With partitioning, we have a lot of tables, some of them wide (126
> > partitioned tables, 8942 childs, total 1019315 columns).
>
> On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote:
> > It would be nice to identify such cases and check which memory contexts are
> > growing and why.
>
> I reproduced the problem with this schema:
>
> SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%s int', i))a FROM generate_series(1,999)i);
> SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i;
>
> This used over 4 GB of RAM.
> 3114201 pryzbyj   20   0 5924520   4.2g  32476 T   0.0  53.8   0:27.35 postgres: pryzbyj postgres [local] UPDATE
>
> The large context is:
> 2024-07-26 15:22:19.280 CDT [3114201] LOG:  level: 1; CacheMemoryContext: 5211209088 total in 50067 blocks; 420688 free (14 chunks); 5210788400 used
>
> Note that there seemed to be no issue when I created 999 tables without
> partitioning:
>
> SELECT format('CREATE TABLE t%s(LIKE p)', i,i,i+1) FROM generate_series(1,999)i;

Thank you!  That was quick.
I'm looking into this.

------
Regards,
Alexander Korotkov
Supabase






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 22:37  Tom Lane <[email protected]>
  parent: Justin Pryzby <[email protected]>
  1 sibling, 1 reply; 15+ messages in thread

From: Tom Lane @ 2024-07-26 22:37 UTC (permalink / raw)
  To: Justin Pryzby <[email protected]>; +Cc: Alexander Korotkov <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Justin Pryzby <[email protected]> writes:
> On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote:
>> It would be nice to identify such cases and check which memory contexts are
>> growing and why.

> I reproduced the problem with this schema:

> SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%s int', i))a FROM generate_series(1,999)i);
> SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i;

> This used over 4 GB of RAM.

Interesting.  This doesn't bloat particularly much in a regular
pg_restore, even with --transaction-size=1000; but it does in
pg_upgrade, as you say.  I found that the bloat was occurring
during these long sequences of UPDATE commands issued by pg_upgrade:

-- For binary upgrade, recreate inherited column.
UPDATE pg_catalog.pg_attribute
SET attislocal = false
WHERE attname = 'i'
  AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass;

-- For binary upgrade, recreate inherited column.
UPDATE pg_catalog.pg_attribute
SET attislocal = false
WHERE attname = 'i1'
  AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass;

-- For binary upgrade, recreate inherited column.
UPDATE pg_catalog.pg_attribute
SET attislocal = false
WHERE attname = 'i2'
  AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass;

I think the problem is basically that each one of these commands
causes a relcache inval, for which we can't reclaim space right
away, so that we end up consuming O(N^2) cache space for an
N-column inherited table.

It's fairly easy to fix things so that this example doesn't cause
that to happen: we just need to issue these updates as one command
not N commands per table.  See attached.  However, I fear this should
just be considered a draft, because the other code for binary upgrade
in the immediate vicinity is just as aggressively stupid and
unoptimized as this bit, and can probably also be driven to O(N^2)
behavior with enough CHECK constraints etc.  We've gone out of our way
to make ALTER TABLE capable of handling many updates to a table's DDL
in one command, but whoever wrote this code appears not to have read
that memo, or at least to have believed that performance of pg_upgrade
isn't of concern.

> Note that there seemed to be no issue when I created 999 tables without
> partitioning:
> SELECT format('CREATE TABLE t%s(LIKE p)', i,i,i+1) FROM generate_series(1,999)i;

Yeah, because then we don't need to play games with attislocal.

			regards, tom lane



Attachments:

  [text/x-diff] reduce-pg_upgrade-cache-bloat-wip.patch (2.2K, 2-reduce-pg_upgrade-cache-bloat-wip.patch)
  download | inline diff:
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b8b1888bd3..19f98bdf43 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16094,6 +16094,12 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			 tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 			 tbinfo->relkind == RELKIND_PARTITIONED_TABLE))
 		{
+			bool		firstinhcol = true;
+
+			/*
+			 * Drop any dropped columns.  We don't really expect there to be a
+			 * lot, else this code would be pretty inefficient.
+			 */
 			for (j = 0; j < tbinfo->numatts; j++)
 			{
 				if (tbinfo->attisdropped[j])
@@ -16120,18 +16126,37 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 					appendPQExpBuffer(q, "DROP COLUMN %s;\n",
 									  fmtId(tbinfo->attnames[j]));
 				}
-				else if (!tbinfo->attislocal[j])
+			}
+
+			/*
+			 * Fix up inherited columns.  There could be a lot of these, so we
+			 * do the operation in a single SQL command; otherwise, we risk
+			 * O(N^2) relcache bloat thanks to repeatedly invalidating the
+			 * table's relcache entry.
+			 */
+			for (j = 0; j < tbinfo->numatts; j++)
+			{
+				if (!tbinfo->attisdropped[j] &&
+					!tbinfo->attislocal[j])
 				{
-					appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited column.\n");
-					appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n"
-										 "SET attislocal = false\n"
-										 "WHERE attname = ");
+					if (firstinhcol)
+					{
+						appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited columns.\n");
+						appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n"
+											 "SET attislocal = false\n"
+											 "WHERE attrelid = ");
+						appendStringLiteralAH(q, qualrelname, fout);
+						appendPQExpBufferStr(q, "::pg_catalog.regclass\n"
+											 "  AND attname IN (");
+						firstinhcol = false;
+					}
+					else
+						appendPQExpBufferStr(q, ", ");
 					appendStringLiteralAH(q, tbinfo->attnames[j], fout);
-					appendPQExpBufferStr(q, "\n  AND attrelid = ");
-					appendStringLiteralAH(q, qualrelname, fout);
-					appendPQExpBufferStr(q, "::pg_catalog.regclass;\n");
 				}
 			}
+			if (!firstinhcol)
+				appendPQExpBufferStr(q, ");\n");
 
 			/*
 			 * Add inherited CHECK constraints, if any.


^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 22:55  Alexander Korotkov <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 15+ messages in thread

From: Alexander Korotkov @ 2024-07-26 22:55 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Sat, Jul 27, 2024 at 1:37 AM Tom Lane <[email protected]> wrote:
> Justin Pryzby <[email protected]> writes:
> > On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote:
> >> It would be nice to identify such cases and check which memory contexts are
> >> growing and why.
>
> > I reproduced the problem with this schema:
>
> > SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%s int', i))a FROM generate_series(1,999)i);
> > SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i;
>
> > This used over 4 GB of RAM.
>
> Interesting.  This doesn't bloat particularly much in a regular
> pg_restore, even with --transaction-size=1000; but it does in
> pg_upgrade, as you say.  I found that the bloat was occurring
> during these long sequences of UPDATE commands issued by pg_upgrade:
>
> -- For binary upgrade, recreate inherited column.
> UPDATE pg_catalog.pg_attribute
> SET attislocal = false
> WHERE attname = 'i'
>   AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass;
>
> -- For binary upgrade, recreate inherited column.
> UPDATE pg_catalog.pg_attribute
> SET attislocal = false
> WHERE attname = 'i1'
>   AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass;
>
> -- For binary upgrade, recreate inherited column.
> UPDATE pg_catalog.pg_attribute
> SET attislocal = false
> WHERE attname = 'i2'
>   AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass;
>
> I think the problem is basically that each one of these commands
> causes a relcache inval, for which we can't reclaim space right
> away, so that we end up consuming O(N^2) cache space for an
> N-column inherited table.

I was about to report the same.

> It's fairly easy to fix things so that this example doesn't cause
> that to happen: we just need to issue these updates as one command
> not N commands per table.  See attached.  However, I fear this should
> just be considered a draft, because the other code for binary upgrade
> in the immediate vicinity is just as aggressively stupid and
> unoptimized as this bit, and can probably also be driven to O(N^2)
> behavior with enough CHECK constraints etc.  We've gone out of our way
> to make ALTER TABLE capable of handling many updates to a table's DDL
> in one command, but whoever wrote this code appears not to have read
> that memo, or at least to have believed that performance of pg_upgrade
> isn't of concern.

I was thinking about counting actual number of queries, not TOC
entries for transaction number as a more universal solution.  But that
would require usage of psql_scan() or writing simpler alternative for
this particular purpose.  That looks quite annoying.  What do you
think?


------
Regards,
Alexander Korotkov
Supabase






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-26 23:06  Tom Lane <[email protected]>
  parent: Alexander Korotkov <[email protected]>
  0 siblings, 1 reply; 15+ messages in thread

From: Tom Lane @ 2024-07-26 23:06 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Alexander Korotkov <[email protected]> writes:
> On Sat, Jul 27, 2024 at 1:37 AM Tom Lane <[email protected]> wrote:
>> It's fairly easy to fix things so that this example doesn't cause
>> that to happen: we just need to issue these updates as one command
>> not N commands per table.

> I was thinking about counting actual number of queries, not TOC
> entries for transaction number as a more universal solution.  But that
> would require usage of psql_scan() or writing simpler alternative for
> this particular purpose.  That looks quite annoying.  What do you
> think?

The assumption underlying what we're doing now is that the number
of SQL commands per TOC entry is limited.  I'd prefer to fix the
code so that that assumption is correct, at least in normal cases.
I confess I'd not looked closely enough at the binary-upgrade support
code to realize it wasn't correct already :-(.  If we go that way,
we can fix this while also making pg_upgrade faster rather than
slower.  I also expect that it'll be a lot simpler than putting
a full SQL parser in pg_restore.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-27 03:00  Alexander Korotkov <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 15+ messages in thread

From: Alexander Korotkov @ 2024-07-27 03:00 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Sat, Jul 27, 2024 at 2:06 AM Tom Lane <[email protected]> wrote:
> Alexander Korotkov <[email protected]> writes:
> > On Sat, Jul 27, 2024 at 1:37 AM Tom Lane <[email protected]> wrote:
> >> It's fairly easy to fix things so that this example doesn't cause
> >> that to happen: we just need to issue these updates as one command
> >> not N commands per table.
>
> > I was thinking about counting actual number of queries, not TOC
> > entries for transaction number as a more universal solution.  But that
> > would require usage of psql_scan() or writing simpler alternative for
> > this particular purpose.  That looks quite annoying.  What do you
> > think?
>
> The assumption underlying what we're doing now is that the number
> of SQL commands per TOC entry is limited.  I'd prefer to fix the
> code so that that assumption is correct, at least in normal cases.
> I confess I'd not looked closely enough at the binary-upgrade support
> code to realize it wasn't correct already :-(.  If we go that way,
> we can fix this while also making pg_upgrade faster rather than
> slower.  I also expect that it'll be a lot simpler than putting
> a full SQL parser in pg_restore.

I'm good with that as soon as we're not going to meet many cases of
high number SQL commands per TOC entry.

J4F, I have an idea to count number of ';' sings and use it for
transaction size counter, since it is as upper bound estimate of
number of SQL commands :-)

------
Regards,
Alexander Korotkov
Supabase






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-27 03:08  Tom Lane <[email protected]>
  parent: Alexander Korotkov <[email protected]>
  0 siblings, 1 reply; 15+ messages in thread

From: Tom Lane @ 2024-07-27 03:08 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Alexander Korotkov <[email protected]> writes:
> J4F, I have an idea to count number of ';' sings and use it for
> transaction size counter, since it is as upper bound estimate of
> number of SQL commands :-)

Hmm ... that's not a completely silly idea.  Let's keep it in
the back pocket in case we can't easily reduce the number of
SQL commands in some cases.

It's late here, and I've got some other commitments tomorrow,
but I'll try to produce a patch to merge more of the SQL
commands in a day or two.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-28 21:24  Tom Lane <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 15+ messages in thread

From: Tom Lane @ 2024-07-28 21:24 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

I wrote:
> Alexander Korotkov <[email protected]> writes:
>> J4F, I have an idea to count number of ';' sings and use it for
>> transaction size counter, since it is as upper bound estimate of
>> number of SQL commands :-)

> Hmm ... that's not a completely silly idea.  Let's keep it in
> the back pocket in case we can't easily reduce the number of
> SQL commands in some cases.

After poking at this for awhile, we can fix Justin's example
case by avoiding repeated UPDATEs on pg_attribute, so I think
we should do that.  It seems clearly a win, with no downside
other than a small increment of complexity in pg_dump.

However, that's probably not sufficient to mark this issue
as closed.  It seems likely that there are other patterns
that would cause backend memory bloat.  One case that I found
is tables with a lot of inherited constraints (not partitions,
but old-style inheritance).  For example, load the output of
this Perl script into a database:

-----
for (my $i = 0; $i < 100; $i++)
{
	print "CREATE TABLE test_inh_check$i (\n";
	for (my $j = 0; $j < 1000; $j++)
	{
		print "a$j float check (a$j > 10.2),\n";
	}
	print "b float);\n";
	print "CREATE TABLE test_inh_check_child$i() INHERITS(test_inh_check$i);\n";
}
-----

pg_dump is horrendously slow on this, thanks to O(N^2) behavior in
ruleutils.c, and pg_upgrade is worse --- and leaks memory too in
HEAD/v17.  The slowness was there before, so I think the lack of
field complaints indicates that this isn't a real-world use case.
Still, it's bad if pg_upgrade fails when it would not have before,
and there may be other similar issues.

So I'm forced to the conclusion that we'd better make the transaction
size adaptive as per Alexander's suggestion.

In addition to the patches attached, I experimented with making
dumpTableSchema fold all the ALTER TABLE commands for a single table
into one command.  That's do-able without too much effort, but I'm now
convinced that we shouldn't.  It would break the semicolon-counting
hack for detecting that tables like these involve extra work.
I'm also not very confident that the backend won't have trouble with
ALTER TABLE commands containing hundreds of subcommands.  That's
something we ought to work on probably, but it's not a project that
I want to condition v17 pg_upgrade's stability on.

Anyway, proposed patches attached.  0001 is some trivial cleanup
that I noticed while working on the failed single-ALTER-TABLE idea.
0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
and 0003 is Alexander's suggestion.

			regards, tom lane



Attachments:

  [text/x-diff] v1-0001-Fix-missing-ONLY-in-one-dumpTableSchema-command.patch (1.8K, 2-v1-0001-Fix-missing-ONLY-in-one-dumpTableSchema-command.patch)
  download | inline diff:
From 34ebed72e9029f690e5d3f0cb7464670e83caa5c Mon Sep 17 00:00:00 2001
From: Tom Lane <[email protected]>
Date: Sun, 28 Jul 2024 13:02:27 -0400
Subject: [PATCH v1 1/3] Fix missing ONLY in one dumpTableSchema command.

The various ALTER [FOREIGN] TABLE commands issued by dumpTableSchema
all use ONLY, except for this one.  I think it's not a live bug
because we don't permit foreign tables to have children, but
it's still inconsistent.  I happened across this while refactoring
dumpTableSchema to merge all its ALTERs into one; while I'm not
certain we should actually do that, this seems like good cleanup.
---
 src/bin/pg_dump/pg_dump.c        | 2 +-
 src/bin/pg_dump/t/002_pg_dump.pl | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b8b1888bd3..7cd6a7fb97 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16344,7 +16344,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			if (tbinfo->relkind == RELKIND_FOREIGN_TABLE &&
 				tbinfo->attfdwoptions[j][0] != '\0')
 				appendPQExpBuffer(q,
-								  "ALTER FOREIGN TABLE %s ALTER COLUMN %s OPTIONS (\n"
+								  "ALTER FOREIGN TABLE ONLY %s ALTER COLUMN %s OPTIONS (\n"
 								  "    %s\n"
 								  ");\n",
 								  qualrelname,
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d3dd8784d6..5bcc2244d5 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1154,7 +1154,7 @@ my %tests = (
 
 	'ALTER FOREIGN TABLE foreign_table ALTER COLUMN c1 OPTIONS' => {
 		regexp => qr/^
-			\QALTER FOREIGN TABLE dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
+			\QALTER FOREIGN TABLE ONLY dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
 			\s+\Qcolumn_name 'col1'\E\n
 			\Q);\E\n
 			/xm,
-- 
2.43.5



  [text/x-diff] v1-0002-Reduce-number-of-commands-dumpTableSchema-emits-f.patch (8.5K, 3-v1-0002-Reduce-number-of-commands-dumpTableSchema-emits-f.patch)
  download | inline diff:
From c8f0d0252e292f276fe9631ae31e6aea11d294d2 Mon Sep 17 00:00:00 2001
From: Tom Lane <[email protected]>
Date: Sun, 28 Jul 2024 16:19:48 -0400
Subject: [PATCH v1 2/3] Reduce number of commands dumpTableSchema emits for
 binary upgrade.

Avoid issuing a separate SQL UPDATE command for each column when
directly manipulating pg_attribute contents in binary upgrade mode.
With the separate updates, we triggered a relcache invalidation with
each update.  For a table with N columns, that causes O(N^2) relcache
bloat in txn_size mode because the table's newly-created relcache
entry can't be flushed till end of transaction.  Reducing the number
of commands is marginally faster as well as avoiding that problem.

While at it, likewise avoid issuing a separate UPDATE on pg_constraint
for each inherited constraint.  This is less exciting, first because
inherited (non-partitioned) constraints are relatively rare, and
second because the backend has a good deal of trouble anyway with
restoring tables containing many such constraints, due to
MergeConstraintsIntoExisting being horribly inefficient.  But it seems
more consistent to do it this way here too, and it surely can't hurt.

Per report from Justin Pryzby.  Back-patch to v17 where txn_size mode
was introduced.

Discussion: https://postgr.es/m/ZqEND4ZcTDBmcv31@pryzbyj2023
---
 src/bin/pg_dump/pg_dump.c | 132 ++++++++++++++++++++++++++++----------
 1 file changed, 97 insertions(+), 35 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 7cd6a7fb97..2b02148559 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -15670,6 +15670,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 	DumpOptions *dopt = fout->dopt;
 	PQExpBuffer q = createPQExpBuffer();
 	PQExpBuffer delq = createPQExpBuffer();
+	PQExpBuffer extra = createPQExpBuffer();
 	char	   *qrelname;
 	char	   *qualrelname;
 	int			numParents;
@@ -15736,7 +15737,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 		char	   *partkeydef = NULL;
 		char	   *ftoptions = NULL;
 		char	   *srvname = NULL;
-		char	   *foreign = "";
+		const char *foreign = "";
 
 		/*
 		 * Set reltypename, and collect any relkind-specific data that we
@@ -16094,51 +16095,98 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			 tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
 			 tbinfo->relkind == RELKIND_PARTITIONED_TABLE))
 		{
+			bool		firstitem;
+
+			/*
+			 * Drop any dropped columns.  Merge the pg_attribute manipulations
+			 * into a single SQL command, so that we don't cause repeated
+			 * relcache flushes on the target table.  Otherwise we risk O(N^2)
+			 * relcache bloat while dropping N columns.
+			 */
+			resetPQExpBuffer(extra);
+			firstitem = true;
 			for (j = 0; j < tbinfo->numatts; j++)
 			{
 				if (tbinfo->attisdropped[j])
 				{
-					appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate dropped column.\n");
-					appendPQExpBuffer(q, "UPDATE pg_catalog.pg_attribute\n"
-									  "SET attlen = %d, "
-									  "attalign = '%c', attbyval = false\n"
-									  "WHERE attname = ",
+					if (firstitem)
+					{
+						appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate dropped columns.\n"
+											 "UPDATE pg_catalog.pg_attribute\n"
+											 "SET attlen = v.dlen, "
+											 "attalign = v.dalign, "
+											 "attbyval = false\n"
+											 "FROM (VALUES ");
+						firstitem = false;
+					}
+					else
+						appendPQExpBufferStr(q, ",\n             ");
+					appendPQExpBufferChar(q, '(');
+					appendStringLiteralAH(q, tbinfo->attnames[j], fout);
+					appendPQExpBuffer(q, ", %d, '%c')",
 									  tbinfo->attlen[j],
 									  tbinfo->attalign[j]);
-					appendStringLiteralAH(q, tbinfo->attnames[j], fout);
-					appendPQExpBufferStr(q, "\n  AND attrelid = ");
-					appendStringLiteralAH(q, qualrelname, fout);
-					appendPQExpBufferStr(q, "::pg_catalog.regclass;\n");
-
-					if (tbinfo->relkind == RELKIND_RELATION ||
-						tbinfo->relkind == RELKIND_PARTITIONED_TABLE)
-						appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
-										  qualrelname);
-					else
-						appendPQExpBuffer(q, "ALTER FOREIGN TABLE ONLY %s ",
-										  qualrelname);
-					appendPQExpBuffer(q, "DROP COLUMN %s;\n",
+					/* The ALTER ... DROP COLUMN commands must come after */
+					appendPQExpBuffer(extra, "ALTER %sTABLE ONLY %s ",
+									  foreign, qualrelname);
+					appendPQExpBuffer(extra, "DROP COLUMN %s;\n",
 									  fmtId(tbinfo->attnames[j]));
 				}
-				else if (!tbinfo->attislocal[j])
+			}
+			if (!firstitem)
+			{
+				appendPQExpBufferStr(q, ") v(dname, dlen, dalign)\n"
+									 "WHERE attrelid = ");
+				appendStringLiteralAH(q, qualrelname, fout);
+				appendPQExpBufferStr(q, "::pg_catalog.regclass\n"
+									 "  AND attname = v.dname;\n");
+				/* Now we can issue the actual DROP COLUMN commands */
+				appendBinaryPQExpBuffer(q, extra->data, extra->len);
+			}
+
+			/*
+			 * Fix up inherited columns.  As above, do the pg_attribute
+			 * manipulations in a single SQL command.
+			 */
+			firstitem = true;
+			for (j = 0; j < tbinfo->numatts; j++)
+			{
+				if (!tbinfo->attisdropped[j] &&
+					!tbinfo->attislocal[j])
 				{
-					appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited column.\n");
-					appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n"
-										 "SET attislocal = false\n"
-										 "WHERE attname = ");
+					if (firstitem)
+					{
+						appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited columns.\n");
+						appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n"
+											 "SET attislocal = false\n"
+											 "WHERE attrelid = ");
+						appendStringLiteralAH(q, qualrelname, fout);
+						appendPQExpBufferStr(q, "::pg_catalog.regclass\n"
+											 "  AND attname IN (");
+						firstitem = false;
+					}
+					else
+						appendPQExpBufferStr(q, ", ");
 					appendStringLiteralAH(q, tbinfo->attnames[j], fout);
-					appendPQExpBufferStr(q, "\n  AND attrelid = ");
-					appendStringLiteralAH(q, qualrelname, fout);
-					appendPQExpBufferStr(q, "::pg_catalog.regclass;\n");
 				}
 			}
+			if (!firstitem)
+				appendPQExpBufferStr(q, ");\n");
 
 			/*
 			 * Add inherited CHECK constraints, if any.
 			 *
 			 * For partitions, they were already dumped, and conislocal
 			 * doesn't need fixing.
+			 *
+			 * As above, issue only one direct manipulation of pg_constraint.
+			 * Although it is tempting to merge the ALTER ADD CONSTRAINT
+			 * commands into one as well, refrain for now due to concern about
+			 * possible backend memory bloat if there are many such
+			 * constraints.
 			 */
+			resetPQExpBuffer(extra);
+			firstitem = true;
 			for (k = 0; k < tbinfo->ncheck; k++)
 			{
 				ConstraintInfo *constr = &(tbinfo->checkexprs[k]);
@@ -16146,18 +16194,31 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 				if (constr->separate || constr->conislocal || tbinfo->ispartition)
 					continue;
 
-				appendPQExpBufferStr(q, "\n-- For binary upgrade, set up inherited constraint.\n");
+				if (firstitem)
+					appendPQExpBufferStr(q, "\n-- For binary upgrade, set up inherited constraints.\n");
 				appendPQExpBuffer(q, "ALTER %sTABLE ONLY %s ADD CONSTRAINT %s %s;\n",
 								  foreign, qualrelname,
 								  fmtId(constr->dobj.name),
 								  constr->condef);
-				appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_constraint\n"
-									 "SET conislocal = false\n"
-									 "WHERE contype = 'c' AND conname = ");
-				appendStringLiteralAH(q, constr->dobj.name, fout);
-				appendPQExpBufferStr(q, "\n  AND conrelid = ");
-				appendStringLiteralAH(q, qualrelname, fout);
-				appendPQExpBufferStr(q, "::pg_catalog.regclass;\n");
+				/* Update pg_constraint after all the ALTER TABLEs */
+				if (firstitem)
+				{
+					appendPQExpBufferStr(extra, "UPDATE pg_catalog.pg_constraint\n"
+										 "SET conislocal = false\n"
+										 "WHERE contype = 'c' AND conrelid = ");
+					appendStringLiteralAH(extra, qualrelname, fout);
+					appendPQExpBufferStr(extra, "::pg_catalog.regclass\n");
+					appendPQExpBufferStr(extra, "  AND conname IN (");
+					firstitem = false;
+				}
+				else
+					appendPQExpBufferStr(extra, ", ");
+				appendStringLiteralAH(extra, constr->dobj.name, fout);
+			}
+			if (!firstitem)
+			{
+				appendPQExpBufferStr(extra, ");\n");
+				appendBinaryPQExpBuffer(q, extra->data, extra->len);
 			}
 
 			if (numParents > 0 && !tbinfo->ispartition)
@@ -16445,6 +16506,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 
 	destroyPQExpBuffer(q);
 	destroyPQExpBuffer(delq);
+	destroyPQExpBuffer(extra);
 	free(qrelname);
 	free(qualrelname);
 }
-- 
2.43.5



  [text/x-diff] v1-0003-Count-individual-SQL-commands-in-pg_restore-s-tra.patch (2.8K, 4-v1-0003-Count-individual-SQL-commands-in-pg_restore-s-tra.patch)
  download | inline diff:
From 7cfea69d3f5df4c15681f4902a86b366f0ed4292 Mon Sep 17 00:00:00 2001
From: Tom Lane <[email protected]>
Date: Sun, 28 Jul 2024 16:40:35 -0400
Subject: [PATCH v1 3/3] Count individual SQL commands in pg_restore's
 --transaction-size mode.

The initial implementation counted one action per TOC entry (except
for some special cases for multi-blob BLOBS entries).  This assumes
that TOC entries are all about equally complex, but it turns out
that that assumption doesn't hold up very well in binary-upgrade
mode.  For example, even after the previous patch I was able to
cause backend bloat with tables having many inherited constraints.

To fix, count multi-command TOC entries as N actions, allowing the
transaction size to be scaled down when we hit a complex TOC entry.
Rather than add a SQL parser to pg_restore, approximate "multi
command" by counting semicolons in the TOC entry's defn string.
This will be fooled by semicolons appearing in string literals ---
but the error is in the conservative direction, so it doesn't seem
worth working harder.  The biggest risk is with function/procedure
TOC entries, but we can just explicitly skip those.

(This is undoubtedly a hack, and maybe someday we'll be able to
revert it after fixing the backend's bloat issues or rethinking
what pg_dump emits in binary upgrade mode.  But that surely isn't
a project for v17.)

Thanks to Alexander Korotkov for the let's-count-semicolons idea.

Per report from Justin Pryzby.  Back-patch to v17 where txn_size mode
was introduced.

Discussion: https://postgr.es/m/ZqEND4ZcTDBmcv31@pryzbyj2023
---
 src/bin/pg_dump/pg_backup_archiver.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 68e321212d..8c20c263c4 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -3827,10 +3827,32 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 	{
 		IssueACLPerBlob(AH, te);
 	}
-	else
+	else if (te->defn && strlen(te->defn) > 0)
 	{
-		if (te->defn && strlen(te->defn) > 0)
-			ahprintf(AH, "%s\n\n", te->defn);
+		ahprintf(AH, "%s\n\n", te->defn);
+
+		/*
+		 * If the defn string contains multiple SQL commands, txn_size mode
+		 * should count it as N actions not one.  But rather than build a full
+		 * SQL parser, approximate this by counting semicolons.  One case
+		 * where that tends to be badly fooled is function definitions, so
+		 * ignore them.  (restore_toc_entry will count one action anyway.)
+		 */
+		if (ropt->txn_size > 0 &&
+			strcmp(te->desc, "FUNCTION") != 0 &&
+			strcmp(te->desc, "PROCEDURE") != 0)
+		{
+			const char *p = te->defn;
+			int			nsemis = 0;
+
+			while ((p = strchr(p, ';')) != NULL)
+			{
+				nsemis++;
+				p++;
+			}
+			if (nsemis > 1)
+				AH->txnCount += nsemis - 1;
+		}
 	}
 
 	/*
-- 
2.43.5



^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2024-07-31 13:39  Alexander Korotkov <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 15+ messages in thread

From: Alexander Korotkov @ 2024-07-31 13:39 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

On Mon, Jul 29, 2024 at 12:24 AM Tom Lane <[email protected]> wrote:
> So I'm forced to the conclusion that we'd better make the transaction
> size adaptive as per Alexander's suggestion.
>
> In addition to the patches attached, I experimented with making
> dumpTableSchema fold all the ALTER TABLE commands for a single table
> into one command.  That's do-able without too much effort, but I'm now
> convinced that we shouldn't.  It would break the semicolon-counting
> hack for detecting that tables like these involve extra work.
> I'm also not very confident that the backend won't have trouble with
> ALTER TABLE commands containing hundreds of subcommands.  That's
> something we ought to work on probably, but it's not a project that
> I want to condition v17 pg_upgrade's stability on.
>
> Anyway, proposed patches attached.  0001 is some trivial cleanup
> that I noticed while working on the failed single-ALTER-TABLE idea.
> 0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
> and 0003 is Alexander's suggestion.

Nice to see you picked up my idea.  I took a look over the patchset.
Looks good to me.

------
Regards,
Alexander Korotkov
Supabase






^ permalink  raw  reply  [nested|flat] 15+ messages in thread

* Re: pg_upgrade failing for 200+ million Large Objects
@ 2026-03-23 19:47  PP L <[email protected]>
  parent: Alexander Korotkov <[email protected]>
  0 siblings, 0 replies; 15+ messages in thread

From: PP L @ 2026-03-23 19:47 UTC (permalink / raw)
  To: Alexander Korotkov <[email protected]>; Tom Lane <[email protected]>; +Cc: Justin Pryzby <[email protected]>; Nathan Bossart <[email protected]>; Michael Banck <[email protected]>; Laurenz Albe <[email protected]>; vignesh C <[email protected]>; Kumar, Sachin <[email protected]>; Robins Tharakan <[email protected]>; Jan Wieck <[email protected]>; Bruce Momjian <[email protected]>; Andrew Dunstan <[email protected]>; Magnus Hagander <[email protected]>; Peter Eisentraut <[email protected]>; pgsql-hackers

Hello hackers,

I wanted to revive this thread specifically around the attislocal
optimization discussion. As part of
https://github.com/postgres/postgres/commit/b3f0e0503f3, we now batch all
the attislocal UPDATEs together, hence making it more performant.

I think we might be able to go one step further and completely skip the
attislocal UPDATE for partition tables. This is because the attislocal
UPDATE is done immediately after 'CREATE TABLE', during the 'ATTACH
PARTITION' step(see attislocal being set to false in
MergeAttributesIntoExisting). The UPDATEs emitted by pg_dump are therefore
redundant. Even with batching, the single UPDATE still modifies N(no of
columns) rows causing N relcache invalidations. This same workflow is then
repeated by ATTACH PARTITION causing another N relcache invalidations.

Skipping the attislocal UPDATE definitely speeds up the runtime if there
are a lot of partition tables because we will avoid quite a lot of relcache
invalidations and rebuild calls. Since this optimization removes the
attislocal UPDATE completely, the effect will be even more pronounced for
wider partition tables.

There's already precedent for this:
* attinhcount is never explicitly set by pg_dump. It is only modified by
MergeAttributesIntoExisting during ATTACH PARTITION
* conislocal for CHECK constraints is explicitly not fixed for partitions.
See comment "No need to fix conislocal: ATTACH PARTITION does that" in
dumpTableSchema

The only risk I can foresee is the window between CREATE TABLE and ATTACH
PARTITION where attislocal will be incorrectly set to true. But I think
this window is small enough to not worry about since ATTACH PARTITION
immediately succeeds CREATE TABLE(maybe barring some other minor updates)

Here is a simple patch
```
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 137161aa5e0..d3d7403228a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -17734,7 +17734,8 @@ dumpTableSchema(Archive *fout, const TableInfo
*tbinfo)
                        for (j = 0; j < tbinfo->numatts; j++)
                        {
                                if (!tbinfo->attisdropped[j] &&
-                                       !tbinfo->attislocal[j])
+                                       !tbinfo->attislocal[j] &&
+                                       !tbinfo->ispartition)
                                {
                                        if (firstitem)
                                        {
```

I tried a few experiments on my local macbook and noticed an improvement of
around ~33-36% (% can vary mostly depending on the number of columns)

Setup(master branch): 300 partitioned root tables with 200 leaves each =
60000 partition tables

300 columns per partition:
    baseline    : ~30 mins
    with patch : ~20 mins (~33% faster)

700 columns per partition:
    baseline   : ~66 mins
    with patch : ~42 mins (~36% faster)

Thanks
Nikhil
Broadcom Inc.

On Mon, 23 Mar 2026 at 11:50, Alexander Korotkov <[email protected]>
wrote:

> On Mon, Jul 29, 2024 at 12:24 AM Tom Lane <[email protected]> wrote:
> > So I'm forced to the conclusion that we'd better make the transaction
> > size adaptive as per Alexander's suggestion.
> >
> > In addition to the patches attached, I experimented with making
> > dumpTableSchema fold all the ALTER TABLE commands for a single table
> > into one command.  That's do-able without too much effort, but I'm now
> > convinced that we shouldn't.  It would break the semicolon-counting
> > hack for detecting that tables like these involve extra work.
> > I'm also not very confident that the backend won't have trouble with
> > ALTER TABLE commands containing hundreds of subcommands.  That's
> > something we ought to work on probably, but it's not a project that
> > I want to condition v17 pg_upgrade's stability on.
> >
> > Anyway, proposed patches attached.  0001 is some trivial cleanup
> > that I noticed while working on the failed single-ALTER-TABLE idea.
> > 0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
> > and 0003 is Alexander's suggestion.
>
> Nice to see you picked up my idea.  I took a look over the patchset.
> Looks good to me.
>
> ------
> Regards,
> Alexander Korotkov
> Supabase
>
>
>
>
>

^ permalink  raw  reply  [nested|flat] 15+ messages in thread

end of thread, other threads:[~2026-03-23 19:47 UTC | newest]

Thread overview: 15+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-04-01 19:28 Re: pg_upgrade failing for 200+ million Large Objects Tom Lane <[email protected]>
2024-04-01 19:37 ` Nathan Bossart <[email protected]>
2024-07-24 14:17 ` Justin Pryzby <[email protected]>
2024-07-26 19:53   ` Alexander Korotkov <[email protected]>
2024-07-26 20:05     ` Tom Lane <[email protected]>
2024-07-26 20:36     ` Justin Pryzby <[email protected]>
2024-07-26 21:42       ` Alexander Korotkov <[email protected]>
2024-07-26 22:37       ` Tom Lane <[email protected]>
2024-07-26 22:55         ` Alexander Korotkov <[email protected]>
2024-07-26 23:06           ` Tom Lane <[email protected]>
2024-07-27 03:00             ` Alexander Korotkov <[email protected]>
2024-07-27 03:08               ` Tom Lane <[email protected]>
2024-07-28 21:24                 ` Tom Lane <[email protected]>
2024-07-31 13:39                   ` Alexander Korotkov <[email protected]>
2026-03-23 19:47                     ` PP L <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox