Re: PARALLEL CTAS - Steve Midgley

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Steve Midgley <[email protected]>
To: Shane Borden <[email protected]>
Cc: Rob Sargent <[email protected]>
Cc: [email protected]
Subject: Re: PARALLEL CTAS
Date: Mon, 12 Dec 2022 10:19:00 -0800
Message-ID: <CAJexoSLz2dvbVeJHbFnYNTCPbA5qLNrs4W0nPQL1dsPV2nG1xg@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<CAFcck8HhsTh70ZjQEuvm2kfx0LULJAZR3NZNHvSEX1Y5BaxAMg@mail.gmail.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>

On Mon, Dec 12, 2022 at 9:36 AM Shane Borden <[email protected]> wrote:

> I saw this as an option.  Let’s say you have a million row table.  If you
> copy from STDIN, do you have to spool the source table to a local file
> first?  Looking to replace a CTAS thats doing this all within SQL
> statements now.
> ---
>
> Thanks,
>
>
> Shane Borden
> [email protected]
>
> On Dec 12, 2022, at 12:25 PM, Rob Sargent <[email protected]> wrote:
>
> On 12/12/22 10:13, Shane Borden wrote:
>
> The issue is there are certain performance benefits to be had by doing
> parallel CTAS operations and when converting from Oracle to PostgreSQL
> switching to a “COPY” operation isn’t feasible.
> ---
>
> Thanks,
>
> Today I suspect you're left with something like the following:
> - CTAS from source where 1=2 (i.e. table definition via select semantics)
> - copy from stdin (filled with intended CTAS select)
>
> I'm not at all familiar with Oracle / CTAS (fair warning). But I did spend
some time, years ago, building a fast import using COPY/STDIN in Postgres
awhile back. I copied that code into a Ruby language gist
<https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da;
demonstrating the core concept recently.

The key idea, IMO, is to use an intermediate language like Ruby to manage
the "spooling" from the source file to STDIN. That system could presumably
fork processes to have more than one COPY from the file going at a time
(and coordinate each thread to access different rows of the source file).

However, my experience was that if you want the absolutely fastest import,
you should copy the source file to a fast disk/array on the Postgres server
itself, and then run the COPY command pointing to that file locally, not
piping from STDIN over a network connection. Postgres, at least in my
testing, maxes I/O on a single COPY command pretty heartily - so much so
that you might find performance otherwise becomes a problem, esp if your
DBs are on that disk/array. IIRC, I was seeing millions of rows loaded per
second on a high performing Postgres server, using this method. Also, IIRC,
I removed the indexes from the table, again to max throughput, and then
added them back after the COPY completed (not sure if modern Postgres COPY
needs that - this was years ago).

I think total time to completion was also faster, as the index rebuilds
were faster after the fact, than done during the loading (again I'm foggy
so maybe inaccurate - this was years ago), but I was loading into a blank
table -- rebuilding already constructed indexes might have an overwhelming
cost.

If you have to do a network based COPY from STDIN, then my gist above will
hopefully show the way - and I wouldn't be surprised if running multiple of
these simultaneously gets better performance than a single COPY. At that
point, you'll just be fine tuning to max saturation on the network
bandwidth - not the max capability of the postgres server, which the local
disk COPY does.

Hopefully this input is relevant - please ignore if I'm missing your goal
entirely!

Best,
Steve

view thread (7+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: PARALLEL CTAS
  In-Reply-To: <CAJexoSLz2dvbVeJHbFnYNTCPbA5qLNrs4W0nPQL1dsPV2nG1xg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox