PARALLEL CTAS

public inbox for [email protected]  
help / color / mirror / Atom feed

PARALLEL CTAS
7+ messages / 5 participants
[nested] [flat]

* PARALLEL CTAS
@ 2022-12-12 16:16  Shane Borden <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Shane Borden @ 2022-12-12 16:16 UTC (permalink / raw)
  To: [email protected]

Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections.  I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet.

Does anyone know the status of this feature / patch that was discussed in this link?

https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.g...

Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle?

---

Thanks,

Shane Borden
[email protected]

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: PARALLEL CTAS
@ 2022-12-12 16:55  Erik Brandsberg <[email protected]>
  parent: Shane Borden <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Erik Brandsberg @ 2022-12-12 16:55 UTC (permalink / raw)
  To: Shane Borden <[email protected]>; +Cc: [email protected]

Copy operations are by far the most efficient way to do bulk loads.  The
fdw extension isn't very efficient based on prior experience.

On Mon, Dec 12, 2022 at 11:17 AM Shane Borden <[email protected]> wrote:

> Based on testing thus far, it appears that the only way to get a bulk load
> operation (similar to oracle) would be to use the COPY command or to use
> multiple connections and break up the inserts across connections.  I did
> see a possibility of using the postgres_fdw extension and use the batch
> functionality, but I haven’t tested that yet.
>
> Does anyone know the status of this feature / patch that was discussed in
> this link?
>
>
> https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.g...
>
> Is there a way in PostgreSQL to parallelize a CTAS statement similar to
> how you can do it in Oracle?
>
> ---
>
> Thanks,
>
>
> Shane Borden
> [email protected]
>
>


^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: PARALLEL CTAS
@ 2022-12-12 17:13  Shane Borden <[email protected]>
  parent: Erik Brandsberg <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Shane Borden @ 2022-12-12 17:13 UTC (permalink / raw)
  To: Erik Brandsberg <[email protected]>; +Cc: [email protected]

The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible.
---

Thanks,


Shane Borden
[email protected]

> On Dec 12, 2022, at 11:55 AM, Erik Brandsberg <[email protected]> wrote:
> 
> Copy operations are by far the most efficient way to do bulk loads.  The fdw extension isn't very efficient based on prior experience.
> 
> On Mon, Dec 12, 2022 at 11:17 AM Shane Borden <[email protected] <mailto:[email protected]>> wrote:
>> Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections.  I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet.
>> 
>> Does anyone know the status of this feature / patch that was discussed in this link?
>> 
>> https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.g...
>> 
>> Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle?
>> 
>> ---
>> 
>> Thanks,
>> 
>> 
>> Shane Borden
>> [email protected] <mailto:[email protected]>



^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: PARALLEL CTAS
@ 2022-12-12 17:25  Rob Sargent <[email protected]>
  parent: Shane Borden <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Rob Sargent @ 2022-12-12 17:25 UTC (permalink / raw)
  To: [email protected]

On 12/12/22 10:13, Shane Borden wrote:
> The issue is there are certain performance benefits to be had by doing 
> parallel CTAS operations and when converting from Oracle to PostgreSQL 
> switching to a “COPY” operation isn’t feasible.
> ---
>
> Thanks,
>
Today I suspect you're left with something like the following:
- CTAS from source where 1=2 (i.e. table definition via select semantics)
- copy from stdin (filled with intended CTAS select)






^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: PARALLEL CTAS
@ 2022-12-12 17:36  Shane Borden <[email protected]>
  parent: Rob Sargent <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Shane Borden @ 2022-12-12 17:36 UTC (permalink / raw)
  To: Rob Sargent <[email protected]>; +Cc: [email protected]

I saw this as an option.  Let’s say you have a million row table.  If you copy from STDIN, do you have to spool the source table to a local file first?  Looking to replace a CTAS thats doing this all within SQL statements now.
---

Thanks,


Shane Borden
[email protected]

> On Dec 12, 2022, at 12:25 PM, Rob Sargent <[email protected]> wrote:
> 
> On 12/12/22 10:13, Shane Borden wrote:
>> The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible.
>> ---
>> 
>> Thanks,
>> 
> Today I suspect you're left with something like the following:
> - CTAS from source where 1=2 (i.e. table definition via select semantics)
> - copy from stdin (filled with intended CTAS select)
> 
> 
> 



^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: PARALLEL CTAS
@ 2022-12-12 18:19  Steve Midgley <[email protected]>
  parent: Shane Borden <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Steve Midgley @ 2022-12-12 18:19 UTC (permalink / raw)
  To: Shane Borden <[email protected]>; +Cc: Rob Sargent <[email protected]>; [email protected]

On Mon, Dec 12, 2022 at 9:36 AM Shane Borden <[email protected]> wrote:

> I saw this as an option.  Let’s say you have a million row table.  If you
> copy from STDIN, do you have to spool the source table to a local file
> first?  Looking to replace a CTAS thats doing this all within SQL
> statements now.
> ---
>
> Thanks,
>
>
> Shane Borden
> [email protected]
>
> On Dec 12, 2022, at 12:25 PM, Rob Sargent <[email protected]> wrote:
>
> On 12/12/22 10:13, Shane Borden wrote:
>
> The issue is there are certain performance benefits to be had by doing
> parallel CTAS operations and when converting from Oracle to PostgreSQL
> switching to a “COPY” operation isn’t feasible.
> ---
>
> Thanks,
>
> Today I suspect you're left with something like the following:
> - CTAS from source where 1=2 (i.e. table definition via select semantics)
> - copy from stdin (filled with intended CTAS select)
>
> I'm not at all familiar with Oracle / CTAS (fair warning). But I did spend
some time, years ago, building a fast import using COPY/STDIN in Postgres
awhile back. I copied that code into a Ruby language gist
<https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da;
demonstrating the core concept recently.

The key idea, IMO, is to use an intermediate language like Ruby to manage
the "spooling" from the source file to STDIN. That system could presumably
fork processes to have more than one COPY from the file going at a time
(and coordinate each thread to access different rows of the source file).

However, my experience was that if you want the absolutely fastest import,
you should copy the source file to a fast disk/array on the Postgres server
itself, and then run the COPY command pointing to that file locally, not
piping from STDIN over a network connection. Postgres, at least in my
testing, maxes I/O on a single COPY command pretty heartily - so much so
that you might find performance otherwise becomes a problem, esp if your
DBs are on that disk/array. IIRC, I was seeing millions of rows loaded per
second on a high performing Postgres server, using this method. Also, IIRC,
I removed the indexes from the table, again to max throughput, and then
added them back after the COPY completed (not sure if modern Postgres COPY
needs that - this was years ago).

I think total time to completion was also faster, as the index rebuilds
were faster after the fact, than done during the loading (again I'm foggy
so maybe inaccurate - this was years ago), but I was loading into a blank
table -- rebuilding already constructed indexes might have an overwhelming
cost.

If you have to do a network based COPY from STDIN, then my gist above will
hopefully show the way - and I wouldn't be surprised if running multiple of
these simultaneously gets better performance than a single COPY. At that
point, you'll just be fine tuning to max saturation on the network
bandwidth - not the max capability of the postgres server, which the local
disk COPY does.

Hopefully this input is relevant - please ignore if I'm missing your goal
entirely!

Best,
Steve

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: PARALLEL CTAS
@ 2022-12-12 18:52  Tom Lane <[email protected]>
  parent: Steve Midgley <[email protected]>
  0 siblings, 0 replies; 7+ messages in thread

From: Tom Lane @ 2022-12-12 18:52 UTC (permalink / raw)
  To: Steve Midgley <[email protected]>; +Cc: Shane Borden <[email protected]>; Rob Sargent <[email protected]>; [email protected]

>> Today I suspect you're left with something like the following:
>> - CTAS from source where 1=2 (i.e. table definition via select semantics)
>> - copy from stdin (filled with intended CTAS select)

As far as I can tell, all supported versions of Postgres are perfectly
content to parallelize the source-row computation in a CREATE TABLE AS
SELECT, if they would parallelize the underlying SELECT.  Note that
this is not the same as INSERT INTO ... SELECT, which is a harder
problem because the target table might already have indexes, constraints,
etc.

If what you are concerned about is parallelization of the physical
insertions of the tuples, I'm pretty sure we don't have anything that
can do that today, including COPY.  COPY does have some batch-insertion
optimizations, but that's not parallelism.

Can you parallelize your problem at a higher level, ie do several
table loads at once?

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

end of thread, other threads:[~2022-12-12 18:52 UTC | newest]

Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2022-12-12 16:16 PARALLEL CTAS Shane Borden <[email protected]>
2022-12-12 16:55 ` Erik Brandsberg <[email protected]>
2022-12-12 17:13   ` Shane Borden <[email protected]>
2022-12-12 17:25     ` Rob Sargent <[email protected]>
2022-12-12 17:36       ` Shane Borden <[email protected]>
2022-12-12 18:19         ` Steve Midgley <[email protected]>
2022-12-12 18:52           ` Tom Lane <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox