Faster data load

public inbox for [email protected]  
help / color / mirror / Atom feed

Faster data load
9+ messages / 7 participants
[nested] [flat]

* Faster data load
@ 2024-09-05 20:14  Lok P <[email protected]>
  0 siblings, 4 replies; 9+ messages in thread

From: Lok P @ 2024-09-05 20:14 UTC (permalink / raw)
  To: pgsql-general <[email protected]>

Hi,

We are having a requirement to create approx 50 billion rows in a partition
table(~1 billion rows per partition, 200+gb size daily partitions) for a
performance test. We are currently using ' insert into <target
table_partition> select.. From <source_table_partition> or <some
transformed query>;' method . We have dropped all indexes and constraints
First and then doing the load. Still it's taking 2-3 hours to populate one
partition. Is there a faster way to achieve this?

Few teammate suggesting to use copy command and use file load instead,
which will be faster. So I wanted to understand, how different things it
does behind the scenes as compared to insert as select command? As because
it only deals with sql engine only.

Additionally, when we were trying to create indexes post data load on one
partition, it took 30+ minutes. Any possible way to make it faster?

Is there any way to drive the above things in parallel by utilizing full
database resources?

It's postgres 15.4

Regards
Lok

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-05 21:45  Ron Johnson <[email protected]>
  parent: Lok P <[email protected]>
  3 siblings, 0 replies; 9+ messages in thread

From: Ron Johnson @ 2024-09-05 21:45 UTC (permalink / raw)
  To: pgsql-general <[email protected]>

On Thu, Sep 5, 2024 at 4:14 PM Lok P <[email protected]> wrote:

> Hi,
>
> We are having a requirement to create approx 50 billion rows in a
> partition table(~1 billion rows per partition, 200+gb size daily
> partitions) for a performance test. We are currently using ' insert into
> <target table_partition> select.. From <source_table_partition> or <some
> transformed query>;' method . We have dropped all indexes and constraints
> First and then doing the load. Still it's taking 2-3 hours to populate one
> partition.
>

At three hours, that's 92,593 records/second.  Seems pretty slow.

How much of that time is taken by <some transformed query>?
How big are the records?
How fast is the hardware?

Is there a faster way to achieve this?
>

Testing is the only way to know for sure.

> Few teammate suggesting to use copy command and use file load instead,
> which will be faster. So I wanted to understand, how different things it
> does behind the scenes as compared to insert as select command? As because
> it only deals with sql engine only.
>

COPY is highly optimized for buffered operation.  INSERT... maybe not so
much.

But if the source data is already in a table, that would require piping the
data to stdout and then back into the database.

psql appdb -c "COPY (SELECT ...) TO STDOUT;" | psql appdb -c "COPY
some_table FROM STDOUT;".  Use binary mode, so text conversion isn't
required.

Maybe that's faster, maybe not.

Additionally, when we were trying to create indexes post data load on one
> partition, it took 30+ minutes. Any possible way to make it faster?
>
> Is there any way to drive the above things in parallel by utilizing full
> database resources?
>

Put the destination tables in a different tablespace on a different
controller.

> It's postgres 15.4
>

Why not 15.8?

-- 
Death to America, and butter sauce.
Iraq lobster!

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-05 22:21  Jeff Ross <[email protected]>
  parent: Lok P <[email protected]>
  3 siblings, 0 replies; 9+ messages in thread

From: Jeff Ross @ 2024-09-05 22:21 UTC (permalink / raw)
  To: [email protected]

On 9/5/24 14:14, Lok P wrote:

> Hi,
>
> We are having a requirement to create approx 50 billion rows in a 
> partition table(~1 billion rows per partition, 200+gb size daily 
> partitions) for a performance test. We are currently using ' insert 
> into <target table_partition> select.. From <source_table_partition> 
> or <some transformed query>;' method . We have dropped all indexes and 
> constraints First and then doing the load. Still it's taking 2-3 hours 
> to populate one partition. Is there a faster way to achieve this?
>
> Few teammate suggesting to use copy command and use file load instead, 
> which will be faster. So I wanted to understand, how different things 
> it does behind the scenes as compared to insert as select command? As 
> because it only deals with sql engine only.
>
> Additionally, when we were trying to create indexes post data load on 
> one partition, it took 30+ minutes. Any possible way to make it faster?
>
> Is there any way to drive the above things in parallel by utilizing 
> full database resources?
>
> It's postgres 15.4
>
> Regards
> Lok

Try pg_bulkload to load the data--takes a little set up but it is very 
fast.  Do pay attention to the caveats.  For a performance test they 
probably won't be relevant.

https://github.com/ossc-db/pg_bulkload?tab=readme-ov-file

Jeff







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-06 03:49  Muhammad Usman Khan <[email protected]>
  parent: Lok P <[email protected]>
  3 siblings, 1 reply; 9+ messages in thread

From: Muhammad Usman Khan @ 2024-09-06 03:49 UTC (permalink / raw)
  To: Lok P <[email protected]>; +Cc: pgsql-general <[email protected]>

Hi,

You can use pg_partman. If your table is partitioned, you can manage
partitions in parallel by distributing the load across partitions
concurrently. Or you can use citus. It can be an excellent solution,
especially for handling large data volumes and parallelizing data operations


On Fri, 6 Sept 2024 at 01:14, Lok P <[email protected]> wrote:

> Hi,
>
> We are having a requirement to create approx 50 billion rows in a
> partition table(~1 billion rows per partition, 200+gb size daily
> partitions) for a performance test. We are currently using ' insert into
> <target table_partition> select.. From <source_table_partition> or <some
> transformed query>;' method . We have dropped all indexes and constraints
> First and then doing the load. Still it's taking 2-3 hours to populate one
> partition. Is there a faster way to achieve this?
>
> Few teammate suggesting to use copy command and use file load instead,
> which will be faster. So I wanted to understand, how different things it
> does behind the scenes as compared to insert as select command? As because
> it only deals with sql engine only.
>
> Additionally, when we were trying to create indexes post data load on one
> partition, it took 30+ minutes. Any possible way to make it faster?
>
> Is there any way to drive the above things in parallel by utilizing full
> database resources?
>
> It's postgres 15.4
>
> Regards
> Lok
>


^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-06 04:42  Lok P <[email protected]>
  parent: Muhammad Usman Khan <[email protected]>
  0 siblings, 1 reply; 9+ messages in thread

From: Lok P @ 2024-09-06 04:42 UTC (permalink / raw)
  To: Muhammad Usman Khan <[email protected]>; +Cc: pgsql-general <[email protected]>

On Fri, 6 Sept, 2024, 9:20 am Muhammad Usman Khan, <[email protected]>
wrote:

> Hi,
>
> You can use pg_partman. If your table is partitioned, you can manage
> partitions in parallel by distributing the load across partitions
> concurrently. Or you can use citus. It can be an excellent solution,
> especially for handling large data volumes and parallelizing data operations
>

Thank you.
The tables are partitioned. Also during index creation we are trying to do
it multiple partitions at same time from multiple sessions.But seeing out
of memory error in 5th or 6th session. And even each sessions taking 30mins
per partitions for index creation. Attach index partitions happening in
seconds though.

>
>

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-06 05:12  Ron Johnson <[email protected]>
  parent: Lok P <[email protected]>
  0 siblings, 0 replies; 9+ messages in thread

From: Ron Johnson @ 2024-09-06 05:12 UTC (permalink / raw)
  To: pgsql-general <[email protected]>

On Fri, Sep 6, 2024 at 12:43 AM Lok P <[email protected]> wrote:

> Also during index creation we are trying to do it multiple partitions at
> same time from multiple sessions.But seeing out of memory error in 5th or
> 6th session.
>

Had that same problem during pg_restore.  Reduced maintenance_work_mem and
the problem went away.

-- 
Death to America, and butter sauce.
Iraq lobster!


^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-08 17:45  Peter J. Holzer <[email protected]>
  parent: Lok P <[email protected]>
  3 siblings, 1 reply; 9+ messages in thread

From: Peter J. Holzer @ 2024-09-08 17:45 UTC (permalink / raw)
  To: [email protected]

On 2024-09-06 01:44:00 +0530, Lok P wrote:
> We are having a requirement to create approx 50 billion rows in a partition
> table(~1 billion rows per partition, 200+gb size daily partitions) for a
> performance test. We are currently using ' insert into <target table_partition>
> select.. From <source_table_partition> or <some transformed query>;' method .
> We have dropped all indexes and constraints First and then doing the load.
> Still it's taking 2-3 hours to populate one partition.

That seems quite slow. Is the table very wide or does it have a large
number of indexes?

> Is there a faster way to achieve this? 
> 
> Few teammate suggesting to use copy command and use file load instead, which
> will be faster.

I doubt that.

I benchmarked several strategies for populating tables 5 years ago and
(for my test data and on our hardware at the time - YMMV) s simple
INSERT ... SELECT was more than twice as fast as 8 parallel COPY
operations (and about 8 times as fast as a single COPY).

Details will have changed since then (I should rerun that benchmark on
a current system), but I'd be surprised if COPY became that much faster
relative to INSERT ... SELECT.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | [email protected]         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-08 18:26  Adrian Klaver <[email protected]>
  parent: Peter J. Holzer <[email protected]>
  0 siblings, 1 reply; 9+ messages in thread

From: Adrian Klaver @ 2024-09-08 18:26 UTC (permalink / raw)
  To: [email protected]

On 9/8/24 10:45, Peter J. Holzer wrote:
> On 2024-09-06 01:44:00 +0530, Lok P wrote:
>> We are having a requirement to create approx 50 billion rows in a partition
>> table(~1 billion rows per partition, 200+gb size daily partitions) for a
>> performance test. We are currently using ' insert into <target table_partition>
>> select.. From <source_table_partition> or <some transformed query>;' method .
>> We have dropped all indexes and constraints First and then doing the load.
>> Still it's taking 2-3 hours to populate one partition.
> 
> That seems quite slow. Is the table very wide or does it have a large
> number of indexes?
> 
>> Is there a faster way to achieve this?
>>
>> Few teammate suggesting to use copy command and use file load instead, which
>> will be faster.
> 
> I doubt that.
> 
> I benchmarked several strategies for populating tables 5 years ago and
> (for my test data and on our hardware at the time - YMMV) s simple
> INSERT ... SELECT was more than twice as fast as 8 parallel COPY
> operations (and about 8 times as fast as a single COPY). >
> Details will have changed since then (I should rerun that benchmark on
> a current system), but I'd be surprised if COPY became that much faster
> relative to INSERT ... SELECT.

Yeah they seem to have changed a great deal. Though you are correct in 
saying COPY is not faster then INSERT..SELECT


select version();
                                                               version 

-----------------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit


select count(*) from nyc_taxi_duckdb ;
   count
---------
  2846722

ll -h nyc_taxi.csv
-rw-rw-r-- 1 aklaver aklaver 252M Sep  8 10:54 nyc_taxi.csv

insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 7015.072 ms (00:07.015)

truncate  nyc_duplicate ;

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 8760.197 ms (00:08.760)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 7904.279 ms (00:07.904)

Just to see what the coming attraction offers:

select version();
                                                                version 

--------------------------------------------------------------------------------------------------------------------------------------
  PostgreSQL 17rc1 (Ubuntu 17~rc1-1.pgdg22.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit


insert into nyc_duplicate select * from nyc_taxi_duckdb ;
INSERT 0 2846722
Time: 5315.878 ms (00:05.316)

\copy nyc_duplicate from 'nyc_taxi.csv' with csv header
COPY 2846722
Time: 10042.129 ms (00:10.042)

copy nyc_duplicate from '/tmp/nyc_taxi.csv' with csv header;
COPY 2846722
Time: 8422.503 ms (00:08.423)

> 
>          hp
> 

-- 
Adrian Klaver
[email protected]







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: Faster data load
@ 2024-09-09 07:25  Dominique Devienne <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 0 replies; 9+ messages in thread

From: Dominique Devienne @ 2024-09-09 07:25 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: [email protected]

On Sun, Sep 8, 2024 at 8:27 PM Adrian Klaver <[email protected]> wrote:
> >  simple INSERT ... SELECT was more than twice as fast as 8 parallel COPY
> > operations (and about 8 times as fast as a single COPY).
>
> Yeah they seem to have changed a great deal. Though you are correct in
> saying COPY is not faster then INSERT..SELECT

Interesting. Thanks for sharing. Although to be fair, that adds CSV parsing
to the mix, something INSERT-FROM-SELECT does not have to do,
skewing the results a little maybe. Comparing against COPY BINARY
would be fairer, but less practical I guess. --DD






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

end of thread, other threads:[~2024-09-09 07:25 UTC | newest]

Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-05 20:14 Faster data load Lok P <[email protected]>
2024-09-05 21:45 ` Ron Johnson <[email protected]>
2024-09-05 22:21 ` Jeff Ross <[email protected]>
2024-09-06 03:49 ` Muhammad Usman Khan <[email protected]>
2024-09-06 04:42   ` Lok P <[email protected]>
2024-09-06 05:12     ` Ron Johnson <[email protected]>
2024-09-08 17:45 ` Peter J. Holzer <[email protected]>
2024-09-08 18:26   ` Adrian Klaver <[email protected]>
2024-09-09 07:25     ` Dominique Devienne <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox