Re: Moving delta data faster

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Moving delta data faster
5+ messages / 2 participants
[nested] [flat]

* Re: Moving delta data faster
@ 2024-04-06 15:47  yudhi s <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: yudhi s @ 2024-04-06 15:47 UTC (permalink / raw)
  To: veem v <[email protected]>; Adrian Klaver <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]>

Thank you Adrian, Greg and Veem.

I tried writing a small routine to see how the performance differs in these
four approaches i.e. Upsert VS traditional update+insert VS Merge vs
Truncate+load.

Initially I was thinking Upsert will perform the same as Merge as the logic
looks similar but it seems it's the worst performing among all, not sure
why , yet to know the reason though. Truncate+ load seems to be the best
performing among all. Hope i am doing it correctly. Please correct me if
I'm wrong.

UPSERT approach execution time: *00:00:20.921343*
UPSERT approach rows: 1000000

insert/update approach execution time: *00:00:15.53612*
insert/update approach update rows : 500000
insert/update approach Insert rows: 500000

MERGE approach execution time: *00:00:14.884623*
MERGE approach rows: 1000000

truncate load approach execution time:* 00:00:07.428826*
truncate load rows: 1000000

********* Routine ********

**************** UPSERT Testcase ***********
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of UPSERT
start_time := clock_timestamp();
INSERT INTO target_tab (id, column1, column2)
SELECT id, column1, column2
FROM source_tab
ON CONFLICT (id) DO UPDATE
SET
column1 = EXCLUDED.column1,
column2 = EXCLUDED.column2;
get diagnostics rows_upserted=row_count;
end_time := clock_timestamp();
RAISE NOTICE 'UPSERT approach execution time: %', end_time - start_time;
RAISE NOTICE 'UPSERT approach rows: %', rows_upserted;

rollback;
END $$;


**************** Traditional Insert+update Testcase ***********
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
BEGIN
-- Measure performance of insert/update approach
start_time := clock_timestamp();
-- Update existing records
UPDATE target_tab AS t
SET
column1 = s.column1,
column2 = s.column2
FROM source_tab AS s
WHERE t.id = s.id;
get diagnostics rows_updated=row_count;

-- Insert new records
INSERT INTO target_tab (id, column1, column2)
SELECT s.id, s.column1, s.column2
FROM source_tab AS s
LEFT JOIN target_tab AS t ON s.id = t.id
WHERE t.id IS NULL;
get diagnostics rows_inserted=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'insert/update approach execution time: %', end_time -
start_time;
RAISE NOTICE 'insert/update approach update rows : %', rows_updated;
RAISE NOTICE 'insert/update approach Insert rows: %', rows_inserted;

rollback;
END $$;


**************** MERGE Testcase ***********
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_merged integer:=0;
begin

start_time := clock_timestamp();

merge into
target_tab t
using source_tab s on
t. id = s. id
when matched then
update
set column1 = s.column1,
column2 = s.column2
when not matched then
insert
values (id, column1, column2);
get diagnostics rows_merged=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'MERGE approach execution time: %', end_time - start_time;
RAISE NOTICE 'MERGE approach rows: %', rows_merged;

rollback;
END $$;


**************** Truncate+load Testcase ***********
drop table source_tab;
drop table target_tab;

CREATE TABLE source_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

-- Create target table
CREATE TABLE target_tab (
id SERIAL PRIMARY KEY,
column1 VARCHAR(100),
column2 VARCHAR(100)
);

INSERT INTO source_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 1000000) AS i;

INSERT INTO target_tab (column1, column2)
SELECT
'Value ' || i,
'Value ' || (i * 2)
FROM generate_series(1, 500000) AS i;

DO $$
DECLARE
start_time timestamp;
end_time timestamp;
rows_inserted integer:=0;
rows_updated integer:=0;
rows_upserted integer:=0;
rows_truncate_loaded integer:=0;
begin

start_time := clock_timestamp();

truncate table target_tab;

INSERT INTO target_tab (id, column1, column2)
SELECT s.id, s.column1, s.column2
FROM source_tab AS s;
get diagnostics rows_truncate_loaded=row_count;

end_time := clock_timestamp();
RAISE NOTICE 'truncate load approach execution time: %', end_time -
start_time;
RAISE NOTICE 'truncate load rows: %', rows_truncate_loaded;

rollback;
END $$;


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Moving delta data faster
@ 2024-04-06 16:55  Adrian Klaver <[email protected]>
  parent: yudhi s <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Adrian Klaver @ 2024-04-06 16:55 UTC (permalink / raw)
  To: yudhi s <[email protected]>; veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]>

On 4/6/24 08:47, yudhi s wrote:
> Thank you Adrian, Greg and Veem.
> 
> I tried writing a small routine to see how the performance differs in 
> these four approaches i.e. Upsert VS traditional update+insert VS Merge 
> vs Truncate+load.
> 
> Initially I was thinking Upsert will perform the same as Merge as the 
> logic looks similar but it seems it's the worst performing among all, 
> not sure why , yet to know the reason though. Truncate+ load seems to be 
> the best performing among all. Hope i am doing it correctly. Please 
> correct me if I'm wrong.

Your original problem description was:

"Then subsequently these rows will be inserted/updated based on the 
delta number of rows that got inserted/updated in the source database. 
In some cases these changed data can flow multiple times per day to the 
downstream i.e. postgres database and in other cases once daily."

If the above is not a hard rule, then yes up to some point just 
replacing the data in mass would be the simplest/fastest method. You 
could cut a step out by doing something like TRUNCATE target_tab and 
then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO source_tab.

-- 
Adrian Klaver
[email protected]

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Moving delta data faster
@ 2024-04-06 20:04  yudhi s <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: yudhi s @ 2024-04-06 20:04 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]>

On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <[email protected]>
wrote:

>
> Your original problem description was:
>
> "Then subsequently these rows will be inserted/updated based on the
> delta number of rows that got inserted/updated in the source database.
> In some cases these changed data can flow multiple times per day to the
> downstream i.e. postgres database and in other cases once daily."
>
> If the above is not a hard rule, then yes up to some point just
> replacing the data in mass would be the simplest/fastest method. You
> could cut a step out by doing something like TRUNCATE target_tab and
> then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
> source_tab.
>
> Yes, actually i didn't realize that truncate table transactional/online
here in postgres. In other databases like Oracle its downtime for the read
queries on the target table, as data will be vanished from the target table
post truncate(until the data load happens) and those are auto commit.
Thanks Veem for sharing that  option.

 I also think that truncate will be faster if the changes/delta is large ,
but if its handful of rows like <5%of the rows in the table then
Upsert/Merge will be better performant. And also the down side of the
truncate option is,  it does ask to bring/export all the data from source
to the S3 file which may take longer as compared to bringing just the delta
records. Correct me if I'm wrong.

However I am still not able to understand why the upsert is less performant
than merge, could you throw some light on this please?


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Moving delta data faster
@ 2024-04-06 20:55  Adrian Klaver <[email protected]>
  parent: yudhi s <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Adrian Klaver @ 2024-04-06 20:55 UTC (permalink / raw)
  To: yudhi s <[email protected]>; +Cc: veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]>

On 4/6/24 13:04, yudhi s wrote:
> 
> On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
>     Your original problem description was:
> 
>     "Then subsequently these rows will be inserted/updated based on the
>     delta number of rows that got inserted/updated in the source database.
>     In some cases these changed data can flow multiple times per day to the
>     downstream i.e. postgres database and in other cases once daily."
> 
>     If the above is not a hard rule, then yes up to some point just
>     replacing the data in mass would be the simplest/fastest method. You
>     could cut a step out by doing something like TRUNCATE target_tab and
>     then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
>     source_tab.
> 
> Yes, actually i didn't realize that truncate table transactional/online 
> here in postgres. In other databases like Oracle its downtime for the 
> read queries on the target table, as data will be vanished from the 
> target table post truncate(until the data load happens) and those are 
> auto commit. Thanks Veem for sharing that  option.
> 
>   I also think that truncate will be faster if the changes/delta is 
> large , but if its handful of rows like <5%of the rows in the table then 
> Upsert/Merge will be better performant. And also the down side of the 
> truncate option is,  it does ask to bring/export all the data from 
> source to the S3 file which may take longer as compared to bringing just 
> the delta records. Correct me if I'm wrong.

Since you still have not specified how the data is stored in S3 and how 
you propose to move them into Postgres I can't really answer.

> 
> However I am still not able to understand why the upsert is less 
> performant than merge, could you throw some light on this please?
> 

I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to 
raising a unique violation or exclusion constraint violation error. For 
each individual row proposed for insertion, either the insertion 
proceeds, or, if an arbiter constraint or index specified by 
conflict_target is violated, the alternative conflict_action is taken. 
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative 
action. ON CONFLICT DO UPDATE updates the existing row that conflicts 
with the row proposed for insertion as its alternative action."

vs this:

"First, the MERGE command performs a join from data_source to 
target_table_name producing zero or more candidate change rows. For each 
candidate change row, the status of MATCHED or NOT MATCHED is set just 
once, after which WHEN clauses are evaluated in the order specified. For 
each candidate change row, the first clause to evaluate as true is 
executed. No more than one WHEN clause is executed for any candidate 
change row."

Where ON CONFLICT attempts the INSERT then on failure does the UPDATE 
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on 
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT 
MATCHED takes the appropriate action for the first WHEN match. In other 
words it goes directly to the appropriate action.

-- 
Adrian Klaver
[email protected]

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Moving delta data faster
@ 2024-04-09 16:36  yudhi s <[email protected]>
  parent: Adrian Klaver <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: yudhi s @ 2024-04-09 16:36 UTC (permalink / raw)
  To: Adrian Klaver <[email protected]>; +Cc: veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]>

On Sun, Apr 7, 2024 at 2:25 AM Adrian Klaver <[email protected]>
wrote:

>
> I have no idea how this works in the code, but my suspicion is it is due
> to the following:
>
> https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT
>
> "The optional ON CONFLICT clause specifies an alternative action to
> raising a unique violation or exclusion constraint violation error. For
> each individual row proposed for insertion, either the insertion
> proceeds, or, if an arbiter constraint or index specified by
> conflict_target is violated, the alternative conflict_action is taken.
> ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative
> action. ON CONFLICT DO UPDATE updates the existing row that conflicts
> with the row proposed for insertion as its alternative action."
>
> vs this:
>
> "First, the MERGE command performs a join from data_source to
> target_table_name producing zero or more candidate change rows. For each
> candidate change row, the status of MATCHED or NOT MATCHED is set just
> once, after which WHEN clauses are evaluated in the order specified. For
> each candidate change row, the first clause to evaluate as true is
> executed. No more than one WHEN clause is executed for any candidate
> change row."
>
> Where ON CONFLICT attempts the INSERT then on failure does the UPDATE
> for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on
> the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT
> MATCHED takes the appropriate action for the first WHEN match. In other
> words it goes directly to the appropriate action.
>
>
Thank you Adrian. I think you are spoton on the cause of upsert becoming
slower than Merge. Below is the explain plan I captured for both the
operations and it looks like even the planning time is small for the
UPSERT, as because it operates on the constraint i.e the Update will wait
for all the failure records from the INSERT and thus it takes longer. The
Merge seems to be evaluated on the Joins i.e it is directly able to get the
set of rows which has to be Updated rather waiting for the INSERT to make
it fail based on the PK constraint.

************** Explain plan for UPSERT *****************

Insert on public.target_tab (cost=0.00..17353.00 rows=0 width=0) (actual
time=19957.569..19957.570 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: target_tab_pkey
Tuples Inserted: 500000
Conflicting Tuples: 500000
Buffers: shared hit=8545735 written=10094
-> Seq Scan on public.source_tab (cost=0.00..17353.00 rows=1000000
width=29) (actual time=0.006..208.306 rows=1000000 loops=1)
Output: source_tab.id, source_tab.column1, source_tab.column2
Buffers: shared hit=7353
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency =
'1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4',
search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem =
'8MB'
Query Identifier: -1356019529835809419
Planning:
Buffers: shared hit=41
Planning Time: 0.199 ms
Execution Time: 19959.261 ms

************** Explain plan for Merge *****************

Merge on public.target_tab t (cost=17368.00..53460.01 rows=0 width=0)
(actual time=14209.966..14209.968 rows=0 loops=1)
Tuples: inserted=500000 updated=500000
Buffers: shared hit=5040097 written=10460, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Hash Left Join (cost=17368.00..53460.01 rows=1000000 width=35) (actual
time=179.233..1332.264 rows=1000000 loops=1)
Output: t.ctid, s.column1, s.column2, s.id
Inner Unique: true
Hash Cond: (s.id = t.id)
Buffers: shared hit=11029, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Seq Scan on public.source_tab s (cost=0.00..17353.00 rows=1000000
width=29) (actual time=0.008..268.506 rows=1000000 loops=1)
Output: s.column1, s.column2, s.id
Buffers: shared hit=7353
-> Hash (cost=8676.00..8676.00 rows=500000 width=10) (actual
time=178.101..178.102 rows=500000 loops=1)
Output: t.ctid, t.id
Buckets: 524288 Batches: 2 Memory Usage: 14824kB
Buffers: shared hit=3676, temp written=977
I/O Timings: temp write=5.904
-> Seq Scan on public.target_tab t (cost=0.00..8676.00 rows=500000
width=10) (actual time=0.007..66.441 rows=500000 loops=1)
Output: t.ctid, t.id
Buffers: shared hit=3676
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency =
'1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4',
search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem =
'8MB'
Query Identifier: -2297080081674771467
Planning:
Buffers: shared hit=85
Planning Time: 0.466 ms
Execution Time: 14212.061 ms


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2024-04-09 16:36 UTC | newest]

Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-04-06 15:47 Re: Moving delta data faster yudhi s <[email protected]>
2024-04-06 16:55 ` Adrian Klaver <[email protected]>
2024-04-06 20:04   ` yudhi s <[email protected]>
2024-04-06 20:55     ` Adrian Klaver <[email protected]>
2024-04-09 16:36       ` yudhi s <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox