public inbox for [email protected]help / color / mirror / Atom feed
PARALLEL CTAS 7+ messages / 5 participants [nested] [flat]
* PARALLEL CTAS @ 2022-12-12 16:16 Shane Borden <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Shane Borden @ 2022-12-12 16:16 UTC (permalink / raw) To: [email protected] Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections. I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet. Does anyone know the status of this feature / patch that was discussed in this link? https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.g... Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle? --- Thanks, Shane Borden [email protected] ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: PARALLEL CTAS @ 2022-12-12 16:55 Erik Brandsberg <[email protected]> parent: Shane Borden <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Erik Brandsberg @ 2022-12-12 16:55 UTC (permalink / raw) To: Shane Borden <[email protected]>; +Cc: [email protected] Copy operations are by far the most efficient way to do bulk loads. The fdw extension isn't very efficient based on prior experience. On Mon, Dec 12, 2022 at 11:17 AM Shane Borden <[email protected]> wrote: > Based on testing thus far, it appears that the only way to get a bulk load > operation (similar to oracle) would be to use the COPY command or to use > multiple connections and break up the inserts across connections. I did > see a possibility of using the postgres_fdw extension and use the batch > functionality, but I haven’t tested that yet. > > Does anyone know the status of this feature / patch that was discussed in > this link? > > > https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.g... > > Is there a way in PostgreSQL to parallelize a CTAS statement similar to > how you can do it in Oracle? > > --- > > Thanks, > > > Shane Borden > [email protected] > > ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: PARALLEL CTAS @ 2022-12-12 17:13 Shane Borden <[email protected]> parent: Erik Brandsberg <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Shane Borden @ 2022-12-12 17:13 UTC (permalink / raw) To: Erik Brandsberg <[email protected]>; +Cc: [email protected] The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible. --- Thanks, Shane Borden [email protected] > On Dec 12, 2022, at 11:55 AM, Erik Brandsberg <[email protected]> wrote: > > Copy operations are by far the most efficient way to do bulk loads. The fdw extension isn't very efficient based on prior experience. > > On Mon, Dec 12, 2022 at 11:17 AM Shane Borden <[email protected] <mailto:[email protected]>> wrote: >> Based on testing thus far, it appears that the only way to get a bulk load operation (similar to oracle) would be to use the COPY command or to use multiple connections and break up the inserts across connections. I did see a possibility of using the postgres_fdw extension and use the batch functionality, but I haven’t tested that yet. >> >> Does anyone know the status of this feature / patch that was discussed in this link? >> >> https://www.postgresql.org/message-id/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV%3DqpFJrR3AcrTS3g%40mail.g... >> >> Is there a way in PostgreSQL to parallelize a CTAS statement similar to how you can do it in Oracle? >> >> --- >> >> Thanks, >> >> >> Shane Borden >> [email protected] <mailto:[email protected]> ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: PARALLEL CTAS @ 2022-12-12 17:25 Rob Sargent <[email protected]> parent: Shane Borden <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Rob Sargent @ 2022-12-12 17:25 UTC (permalink / raw) To: [email protected] On 12/12/22 10:13, Shane Borden wrote: > The issue is there are certain performance benefits to be had by doing > parallel CTAS operations and when converting from Oracle to PostgreSQL > switching to a “COPY” operation isn’t feasible. > --- > > Thanks, > Today I suspect you're left with something like the following: - CTAS from source where 1=2 (i.e. table definition via select semantics) - copy from stdin (filled with intended CTAS select) ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: PARALLEL CTAS @ 2022-12-12 17:36 Shane Borden <[email protected]> parent: Rob Sargent <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Shane Borden @ 2022-12-12 17:36 UTC (permalink / raw) To: Rob Sargent <[email protected]>; +Cc: [email protected] I saw this as an option. Let’s say you have a million row table. If you copy from STDIN, do you have to spool the source table to a local file first? Looking to replace a CTAS thats doing this all within SQL statements now. --- Thanks, Shane Borden [email protected] > On Dec 12, 2022, at 12:25 PM, Rob Sargent <[email protected]> wrote: > > On 12/12/22 10:13, Shane Borden wrote: >> The issue is there are certain performance benefits to be had by doing parallel CTAS operations and when converting from Oracle to PostgreSQL switching to a “COPY” operation isn’t feasible. >> --- >> >> Thanks, >> > Today I suspect you're left with something like the following: > - CTAS from source where 1=2 (i.e. table definition via select semantics) > - copy from stdin (filled with intended CTAS select) > > > ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: PARALLEL CTAS @ 2022-12-12 18:19 Steve Midgley <[email protected]> parent: Shane Borden <[email protected]> 0 siblings, 1 reply; 7+ messages in thread From: Steve Midgley @ 2022-12-12 18:19 UTC (permalink / raw) To: Shane Borden <[email protected]>; +Cc: Rob Sargent <[email protected]>; [email protected] On Mon, Dec 12, 2022 at 9:36 AM Shane Borden <[email protected]> wrote: > I saw this as an option. Let’s say you have a million row table. If you > copy from STDIN, do you have to spool the source table to a local file > first? Looking to replace a CTAS thats doing this all within SQL > statements now. > --- > > Thanks, > > > Shane Borden > [email protected] > > On Dec 12, 2022, at 12:25 PM, Rob Sargent <[email protected]> wrote: > > On 12/12/22 10:13, Shane Borden wrote: > > The issue is there are certain performance benefits to be had by doing > parallel CTAS operations and when converting from Oracle to PostgreSQL > switching to a “COPY” operation isn’t feasible. > --- > > Thanks, > > Today I suspect you're left with something like the following: > - CTAS from source where 1=2 (i.e. table definition via select semantics) > - copy from stdin (filled with intended CTAS select) > > I'm not at all familiar with Oracle / CTAS (fair warning). But I did spend some time, years ago, building a fast import using COPY/STDIN in Postgres awhile back. I copied that code into a Ruby language gist <https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da; demonstrating the core concept recently. The key idea, IMO, is to use an intermediate language like Ruby to manage the "spooling" from the source file to STDIN. That system could presumably fork processes to have more than one COPY from the file going at a time (and coordinate each thread to access different rows of the source file). However, my experience was that if you want the absolutely fastest import, you should copy the source file to a fast disk/array on the Postgres server itself, and then run the COPY command pointing to that file locally, not piping from STDIN over a network connection. Postgres, at least in my testing, maxes I/O on a single COPY command pretty heartily - so much so that you might find performance otherwise becomes a problem, esp if your DBs are on that disk/array. IIRC, I was seeing millions of rows loaded per second on a high performing Postgres server, using this method. Also, IIRC, I removed the indexes from the table, again to max throughput, and then added them back after the COPY completed (not sure if modern Postgres COPY needs that - this was years ago). I think total time to completion was also faster, as the index rebuilds were faster after the fact, than done during the loading (again I'm foggy so maybe inaccurate - this was years ago), but I was loading into a blank table -- rebuilding already constructed indexes might have an overwhelming cost. If you have to do a network based COPY from STDIN, then my gist above will hopefully show the way - and I wouldn't be surprised if running multiple of these simultaneously gets better performance than a single COPY. At that point, you'll just be fine tuning to max saturation on the network bandwidth - not the max capability of the postgres server, which the local disk COPY does. Hopefully this input is relevant - please ignore if I'm missing your goal entirely! Best, Steve ^ permalink raw reply [nested|flat] 7+ messages in thread
* Re: PARALLEL CTAS @ 2022-12-12 18:52 Tom Lane <[email protected]> parent: Steve Midgley <[email protected]> 0 siblings, 0 replies; 7+ messages in thread From: Tom Lane @ 2022-12-12 18:52 UTC (permalink / raw) To: Steve Midgley <[email protected]>; +Cc: Shane Borden <[email protected]>; Rob Sargent <[email protected]>; [email protected] >> Today I suspect you're left with something like the following: >> - CTAS from source where 1=2 (i.e. table definition via select semantics) >> - copy from stdin (filled with intended CTAS select) As far as I can tell, all supported versions of Postgres are perfectly content to parallelize the source-row computation in a CREATE TABLE AS SELECT, if they would parallelize the underlying SELECT. Note that this is not the same as INSERT INTO ... SELECT, which is a harder problem because the target table might already have indexes, constraints, etc. If what you are concerned about is parallelization of the physical insertions of the tuples, I'm pretty sure we don't have anything that can do that today, including COPY. COPY does have some batch-insertion optimizations, but that's not parallelism. Can you parallelize your problem at a higher level, ie do several table loads at once? regards, tom lane ^ permalink raw reply [nested|flat] 7+ messages in thread
end of thread, other threads:[~2022-12-12 18:52 UTC | newest] Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2022-12-12 16:16 PARALLEL CTAS Shane Borden <[email protected]> 2022-12-12 16:55 ` Erik Brandsberg <[email protected]> 2022-12-12 17:13 ` Shane Borden <[email protected]> 2022-12-12 17:25 ` Rob Sargent <[email protected]> 2022-12-12 17:36 ` Shane Borden <[email protected]> 2022-12-12 18:19 ` Steve Midgley <[email protected]> 2022-12-12 18:52 ` Tom Lane <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox