Re: Faster data load - Adrian Klaver

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Adrian Klaver <[email protected]>
To: [email protected]
Subject: Re: Faster data load
Date: Sun, 8 Sep 2024 11:26:42 -0700
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <CAKna9VaVsDzfOfOGu1+grStp9BBHFMKrH5DCEbbtGcQUWJ74KQ@mail.gmail.com>
	<[email protected]>

On 9/8/24 10:45, Peter J. Holzer wrote:
> On 2024-09-06 01:44:00 +0530, Lok P wrote:
>> We are having a requirement to create approx 50 billion rows in a partition
>> table(~1 billion rows per partition, 200+gb size daily partitions) for a
>> performance test. We are currently using ' insert into <target table_partition>
>> select.. From <source_table_partition> or <some transformed query>;' method .
>> We have dropped all indexes and constraints First and then doing the load.
>> Still it's taking 2-3 hours to populate one partition.
> 
> That seems quite slow. Is the table very wide or does it have a large
> number of indexes?
> 
>> Is there a faster way to achieve this?
>>
>> Few teammate suggesting to use copy command and use file load instead, which
>> will be faster.
> 
> I doubt that.
> 
> I benchmarked several strategies for populating tables 5 years ago and
> (for my test data and on our hardware at the time - YMMV) s simple
> INSERT ... SELECT was more than twice as fast as 8 parallel COPY
> operations (and about 8 times as fast as a single COPY). >
> Details will have changed since then (I should rerun that benchmark on
> a current system), but I'd be surprised if COPY became that much faster
> relative to INSERT ... SELECT.

Yeah they seem to have changed a great deal. Though you are correct in 
saying COPY is not faster then INSERT..SELECT


select version();
                                                               version 

-----------------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit


select count(*) from nyc_taxi_duckdb ;
   count
---------
  2846722

ll -h nyc_taxi.csv
-rw-rw-r-- 1 aklaver aklaver 252M Sep  8 10:54 nyc_taxi.csv

insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 7015.072 ms (00:07.015)

truncate  nyc_duplicate ;

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 8760.197 ms (00:08.760)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 7904.279 ms (00:07.904)

Just to see what the coming attraction offers:

select version();
                                                                version 

--------------------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 17rc1 (Ubuntu 17~rc1-1.pgdg22.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit


insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 5315.878 ms (00:05.316)

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 10042.129 ms (00:10.042)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 8422.503 ms (00:08.423)

> 
>          hp
> 

-- 
Adrian Klaver
[email protected]

view thread (9+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Faster data load
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox