public inbox for [email protected]help / color / mirror / Atom feed
Faster data load 9+ messages / 7 participants [nested] [flat]
* Faster data load @ 2024-09-05 20:14 Lok P <[email protected]> 0 siblings, 4 replies; 9+ messages in thread From: Lok P @ 2024-09-05 20:14 UTC (permalink / raw) To: pgsql-general <[email protected]> Hi, We are having a requirement to create approx 50 billion rows in a partition table(~1 billion rows per partition, 200+gb size daily partitions) for a performance test. We are currently using ' insert into <target table_partition> select.. From <source_table_partition> or <some transformed query>;' method . We have dropped all indexes and constraints First and then doing the load. Still it's taking 2-3 hours to populate one partition. Is there a faster way to achieve this? Few teammate suggesting to use copy command and use file load instead, which will be faster. So I wanted to understand, how different things it does behind the scenes as compared to insert as select command? As because it only deals with sql engine only. Additionally, when we were trying to create indexes post data load on one partition, it took 30+ minutes. Any possible way to make it faster? Is there any way to drive the above things in parallel by utilizing full database resources? It's postgres 15.4 Regards Lok ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-05 21:45 Ron Johnson <[email protected]> parent: Lok P <[email protected]> 3 siblings, 0 replies; 9+ messages in thread From: Ron Johnson @ 2024-09-05 21:45 UTC (permalink / raw) To: pgsql-general <[email protected]> On Thu, Sep 5, 2024 at 4:14 PM Lok P <[email protected]> wrote: > Hi, > > We are having a requirement to create approx 50 billion rows in a > partition table(~1 billion rows per partition, 200+gb size daily > partitions) for a performance test. We are currently using ' insert into > <target table_partition> select.. From <source_table_partition> or <some > transformed query>;' method . We have dropped all indexes and constraints > First and then doing the load. Still it's taking 2-3 hours to populate one > partition. > At three hours, that's 92,593 records/second. Seems pretty slow. How much of that time is taken by <some transformed query>? How big are the records? How fast is the hardware? Is there a faster way to achieve this? > Testing is the only way to know for sure. > Few teammate suggesting to use copy command and use file load instead, > which will be faster. So I wanted to understand, how different things it > does behind the scenes as compared to insert as select command? As because > it only deals with sql engine only. > COPY is highly optimized for buffered operation. INSERT... maybe not so much. But if the source data is already in a table, that would require piping the data to stdout and then back into the database. psql appdb -c "COPY (SELECT ...) TO STDOUT;" | psql appdb -c "COPY some_table FROM STDOUT;". Use binary mode, so text conversion isn't required. Maybe that's faster, maybe not. Additionally, when we were trying to create indexes post data load on one > partition, it took 30+ minutes. Any possible way to make it faster? > > Is there any way to drive the above things in parallel by utilizing full > database resources? > Put the destination tables in a different tablespace on a different controller. > It's postgres 15.4 > Why not 15.8? -- Death to America, and butter sauce. Iraq lobster! ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-05 22:21 Jeff Ross <[email protected]> parent: Lok P <[email protected]> 3 siblings, 0 replies; 9+ messages in thread From: Jeff Ross @ 2024-09-05 22:21 UTC (permalink / raw) To: [email protected] On 9/5/24 14:14, Lok P wrote: > Hi, > > We are having a requirement to create approx 50 billion rows in a > partition table(~1 billion rows per partition, 200+gb size daily > partitions) for a performance test. We are currently using ' insert > into <target table_partition> select.. From <source_table_partition> > or <some transformed query>;' method . We have dropped all indexes and > constraints First and then doing the load. Still it's taking 2-3 hours > to populate one partition. Is there a faster way to achieve this? > > Few teammate suggesting to use copy command and use file load instead, > which will be faster. So I wanted to understand, how different things > it does behind the scenes as compared to insert as select command? As > because it only deals with sql engine only. > > Additionally, when we were trying to create indexes post data load on > one partition, it took 30+ minutes. Any possible way to make it faster? > > Is there any way to drive the above things in parallel by utilizing > full database resources? > > It's postgres 15.4 > > Regards > Lok Try pg_bulkload to load the data--takes a little set up but it is very fast. Do pay attention to the caveats. For a performance test they probably won't be relevant. https://github.com/ossc-db/pg_bulkload?tab=readme-ov-file Jeff ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-06 03:49 Muhammad Usman Khan <[email protected]> parent: Lok P <[email protected]> 3 siblings, 1 reply; 9+ messages in thread From: Muhammad Usman Khan @ 2024-09-06 03:49 UTC (permalink / raw) To: Lok P <[email protected]>; +Cc: pgsql-general <[email protected]> Hi, You can use pg_partman. If your table is partitioned, you can manage partitions in parallel by distributing the load across partitions concurrently. Or you can use citus. It can be an excellent solution, especially for handling large data volumes and parallelizing data operations On Fri, 6 Sept 2024 at 01:14, Lok P <[email protected]> wrote: > Hi, > > We are having a requirement to create approx 50 billion rows in a > partition table(~1 billion rows per partition, 200+gb size daily > partitions) for a performance test. We are currently using ' insert into > <target table_partition> select.. From <source_table_partition> or <some > transformed query>;' method . We have dropped all indexes and constraints > First and then doing the load. Still it's taking 2-3 hours to populate one > partition. Is there a faster way to achieve this? > > Few teammate suggesting to use copy command and use file load instead, > which will be faster. So I wanted to understand, how different things it > does behind the scenes as compared to insert as select command? As because > it only deals with sql engine only. > > Additionally, when we were trying to create indexes post data load on one > partition, it took 30+ minutes. Any possible way to make it faster? > > Is there any way to drive the above things in parallel by utilizing full > database resources? > > It's postgres 15.4 > > Regards > Lok > ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-06 04:42 Lok P <[email protected]> parent: Muhammad Usman Khan <[email protected]> 0 siblings, 1 reply; 9+ messages in thread From: Lok P @ 2024-09-06 04:42 UTC (permalink / raw) To: Muhammad Usman Khan <[email protected]>; +Cc: pgsql-general <[email protected]> On Fri, 6 Sept, 2024, 9:20 am Muhammad Usman Khan, <[email protected]> wrote: > Hi, > > You can use pg_partman. If your table is partitioned, you can manage > partitions in parallel by distributing the load across partitions > concurrently. Or you can use citus. It can be an excellent solution, > especially for handling large data volumes and parallelizing data operations > Thank you. The tables are partitioned. Also during index creation we are trying to do it multiple partitions at same time from multiple sessions.But seeing out of memory error in 5th or 6th session. And even each sessions taking 30mins per partitions for index creation. Attach index partitions happening in seconds though. > > ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-06 05:12 Ron Johnson <[email protected]> parent: Lok P <[email protected]> 0 siblings, 0 replies; 9+ messages in thread From: Ron Johnson @ 2024-09-06 05:12 UTC (permalink / raw) To: pgsql-general <[email protected]> On Fri, Sep 6, 2024 at 12:43 AM Lok P <[email protected]> wrote: > Also during index creation we are trying to do it multiple partitions at > same time from multiple sessions.But seeing out of memory error in 5th or > 6th session. > Had that same problem during pg_restore. Reduced maintenance_work_mem and the problem went away. -- Death to America, and butter sauce. Iraq lobster! ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-08 17:45 Peter J. Holzer <[email protected]> parent: Lok P <[email protected]> 3 siblings, 1 reply; 9+ messages in thread From: Peter J. Holzer @ 2024-09-08 17:45 UTC (permalink / raw) To: [email protected] On 2024-09-06 01:44:00 +0530, Lok P wrote: > We are having a requirement to create approx 50 billion rows in a partition > table(~1 billion rows per partition, 200+gb size daily partitions) for a > performance test. We are currently using ' insert into <target table_partition> > select.. From <source_table_partition> or <some transformed query>;' method . > We have dropped all indexes and constraints First and then doing the load. > Still it's taking 2-3 hours to populate one partition. That seems quite slow. Is the table very wide or does it have a large number of indexes? > Is there a faster way to achieve this? > > Few teammate suggesting to use copy command and use file load instead, which > will be faster. I doubt that. I benchmarked several strategies for populating tables 5 years ago and (for my test data and on our hardware at the time - YMMV) s simple INSERT ... SELECT was more than twice as fast as 8 parallel COPY operations (and about 8 times as fast as a single COPY). Details will have changed since then (I should rerun that benchmark on a current system), but I'd be surprised if COPY became that much faster relative to INSERT ... SELECT. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | [email protected] | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-08 18:26 Adrian Klaver <[email protected]> parent: Peter J. Holzer <[email protected]> 0 siblings, 1 reply; 9+ messages in thread From: Adrian Klaver @ 2024-09-08 18:26 UTC (permalink / raw) To: [email protected] On 9/8/24 10:45, Peter J. Holzer wrote: > On 2024-09-06 01:44:00 +0530, Lok P wrote: >> We are having a requirement to create approx 50 billion rows in a partition >> table(~1 billion rows per partition, 200+gb size daily partitions) for a >> performance test. We are currently using ' insert into <target table_partition> >> select.. From <source_table_partition> or <some transformed query>;' method . >> We have dropped all indexes and constraints First and then doing the load. >> Still it's taking 2-3 hours to populate one partition. > > That seems quite slow. Is the table very wide or does it have a large > number of indexes? > >> Is there a faster way to achieve this? >> >> Few teammate suggesting to use copy command and use file load instead, which >> will be faster. > > I doubt that. > > I benchmarked several strategies for populating tables 5 years ago and > (for my test data and on our hardware at the time - YMMV) s simple > INSERT ... SELECT was more than twice as fast as 8 parallel COPY > operations (and about 8 times as fast as a single COPY). > > Details will have changed since then (I should rerun that benchmark on > a current system), but I'd be surprised if COPY became that much faster > relative to INSERT ... SELECT. Yeah they seem to have changed a great deal. Though you are correct in saying COPY is not faster then INSERT..SELECT select version(); version ----------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit select count(*) from nyc_taxi_duckdb ; count --------- 2846722 ll -h nyc_taxi.csv -rw-rw-r-- 1 aklaver aklaver 252M Sep 8 10:54 nyc_taxi.csv insert into nyc_duplicate select * from nyc_taxi_duckdb ; INSERT 0 2846722 Time: 7015.072 ms (00:07.015) truncate nyc_duplicate ; \copy nyc_duplicate from 'nyc_taxi.csv' with csv header COPY 2846722 Time: 8760.197 ms (00:08.760) copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header; COPY 2846722 Time: 7904.279 ms (00:07.904) Just to see what the coming attraction offers: select version(); version -------------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 17rc1 (Ubuntu 17~rc1-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit insert into nyc_duplicate select * from nyc_taxi_duckdb ; INSERT 0 2846722 Time: 5315.878 ms (00:05.316) \copy nyc_duplicate from 'nyc_taxi.csv' with csv header COPY 2846722 Time: 10042.129 ms (00:10.042) copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header; COPY 2846722 Time: 8422.503 ms (00:08.423) > > hp > -- Adrian Klaver [email protected] ^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Faster data load @ 2024-09-09 07:25 Dominique Devienne <[email protected]> parent: Adrian Klaver <[email protected]> 0 siblings, 0 replies; 9+ messages in thread From: Dominique Devienne @ 2024-09-09 07:25 UTC (permalink / raw) To: Adrian Klaver <[email protected]>; +Cc: [email protected] On Sun, Sep 8, 2024 at 8:27 PM Adrian Klaver <[email protected]> wrote: > > simple INSERT ... SELECT was more than twice as fast as 8 parallel COPY > > operations (and about 8 times as fast as a single COPY). > > Yeah they seem to have changed a great deal. Though you are correct in > saying COPY is not faster then INSERT..SELECT Interesting. Thanks for sharing. Although to be fair, that adds CSV parsing to the mix, something INSERT-FROM-SELECT does not have to do, skewing the results a little maybe. Comparing against COPY BINARY would be fairer, but less practical I guess. --DD ^ permalink raw reply [nested|flat] 9+ messages in thread
end of thread, other threads:[~2024-09-09 07:25 UTC | newest] Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2024-09-05 20:14 Faster data load Lok P <[email protected]> 2024-09-05 21:45 ` Ron Johnson <[email protected]> 2024-09-05 22:21 ` Jeff Ross <[email protected]> 2024-09-06 03:49 ` Muhammad Usman Khan <[email protected]> 2024-09-06 04:42 ` Lok P <[email protected]> 2024-09-06 05:12 ` Ron Johnson <[email protected]> 2024-09-08 17:45 ` Peter J. Holzer <[email protected]> 2024-09-08 18:26 ` Adrian Klaver <[email protected]> 2024-09-09 07:25 ` Dominique Devienne <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox