public inbox for [email protected]help / color / mirror / Atom feed
Re: Moving delta data faster 5+ messages / 2 participants [nested] [flat]
* Re: Moving delta data faster @ 2024-04-06 15:47 yudhi s <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: yudhi s @ 2024-04-06 15:47 UTC (permalink / raw) To: veem v <[email protected]>; Adrian Klaver <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]> Thank you Adrian, Greg and Veem. I tried writing a small routine to see how the performance differs in these four approaches i.e. Upsert VS traditional update+insert VS Merge vs Truncate+load. Initially I was thinking Upsert will perform the same as Merge as the logic looks similar but it seems it's the worst performing among all, not sure why , yet to know the reason though. Truncate+ load seems to be the best performing among all. Hope i am doing it correctly. Please correct me if I'm wrong. UPSERT approach execution time: *00:00:20.921343* UPSERT approach rows: 1000000 insert/update approach execution time: *00:00:15.53612* insert/update approach update rows : 500000 insert/update approach Insert rows: 500000 MERGE approach execution time: *00:00:14.884623* MERGE approach rows: 1000000 truncate load approach execution time:* 00:00:07.428826* truncate load rows: 1000000 ********* Routine ******** **************** UPSERT Testcase *********** drop table source_tab; drop table target_tab; CREATE TABLE source_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); -- Create target table CREATE TABLE target_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); INSERT INTO source_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 1000000) AS i; INSERT INTO target_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 500000) AS i; DO $$ DECLARE start_time timestamp; end_time timestamp; rows_inserted integer:=0; rows_updated integer:=0; rows_upserted integer:=0; rows_merged integer:=0; BEGIN -- Measure performance of UPSERT start_time := clock_timestamp(); INSERT INTO target_tab (id, column1, column2) SELECT id, column1, column2 FROM source_tab ON CONFLICT (id) DO UPDATE SET column1 = EXCLUDED.column1, column2 = EXCLUDED.column2; get diagnostics rows_upserted=row_count; end_time := clock_timestamp(); RAISE NOTICE 'UPSERT approach execution time: %', end_time - start_time; RAISE NOTICE 'UPSERT approach rows: %', rows_upserted; rollback; END $$; **************** Traditional Insert+update Testcase *********** drop table source_tab; drop table target_tab; CREATE TABLE source_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); -- Create target table CREATE TABLE target_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); INSERT INTO source_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 1000000) AS i; INSERT INTO target_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 500000) AS i; DO $$ DECLARE start_time timestamp; end_time timestamp; rows_inserted integer:=0; rows_updated integer:=0; rows_upserted integer:=0; rows_merged integer:=0; BEGIN -- Measure performance of insert/update approach start_time := clock_timestamp(); -- Update existing records UPDATE target_tab AS t SET column1 = s.column1, column2 = s.column2 FROM source_tab AS s WHERE t.id = s.id; get diagnostics rows_updated=row_count; -- Insert new records INSERT INTO target_tab (id, column1, column2) SELECT s.id, s.column1, s.column2 FROM source_tab AS s LEFT JOIN target_tab AS t ON s.id = t.id WHERE t.id IS NULL; get diagnostics rows_inserted=row_count; end_time := clock_timestamp(); RAISE NOTICE 'insert/update approach execution time: %', end_time - start_time; RAISE NOTICE 'insert/update approach update rows : %', rows_updated; RAISE NOTICE 'insert/update approach Insert rows: %', rows_inserted; rollback; END $$; **************** MERGE Testcase *********** drop table source_tab; drop table target_tab; CREATE TABLE source_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); -- Create target table CREATE TABLE target_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); INSERT INTO source_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 1000000) AS i; INSERT INTO target_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 500000) AS i; DO $$ DECLARE start_time timestamp; end_time timestamp; rows_inserted integer:=0; rows_updated integer:=0; rows_upserted integer:=0; rows_merged integer:=0; begin start_time := clock_timestamp(); merge into target_tab t using source_tab s on t. id = s. id when matched then update set column1 = s.column1, column2 = s.column2 when not matched then insert values (id, column1, column2); get diagnostics rows_merged=row_count; end_time := clock_timestamp(); RAISE NOTICE 'MERGE approach execution time: %', end_time - start_time; RAISE NOTICE 'MERGE approach rows: %', rows_merged; rollback; END $$; **************** Truncate+load Testcase *********** drop table source_tab; drop table target_tab; CREATE TABLE source_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); -- Create target table CREATE TABLE target_tab ( id SERIAL PRIMARY KEY, column1 VARCHAR(100), column2 VARCHAR(100) ); INSERT INTO source_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 1000000) AS i; INSERT INTO target_tab (column1, column2) SELECT 'Value ' || i, 'Value ' || (i * 2) FROM generate_series(1, 500000) AS i; DO $$ DECLARE start_time timestamp; end_time timestamp; rows_inserted integer:=0; rows_updated integer:=0; rows_upserted integer:=0; rows_truncate_loaded integer:=0; begin start_time := clock_timestamp(); truncate table target_tab; INSERT INTO target_tab (id, column1, column2) SELECT s.id, s.column1, s.column2 FROM source_tab AS s; get diagnostics rows_truncate_loaded=row_count; end_time := clock_timestamp(); RAISE NOTICE 'truncate load approach execution time: %', end_time - start_time; RAISE NOTICE 'truncate load rows: %', rows_truncate_loaded; rollback; END $$; ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Moving delta data faster @ 2024-04-06 16:55 Adrian Klaver <[email protected]> parent: yudhi s <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Adrian Klaver @ 2024-04-06 16:55 UTC (permalink / raw) To: yudhi s <[email protected]>; veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]> On 4/6/24 08:47, yudhi s wrote: > Thank you Adrian, Greg and Veem. > > I tried writing a small routine to see how the performance differs in > these four approaches i.e. Upsert VS traditional update+insert VS Merge > vs Truncate+load. > > Initially I was thinking Upsert will perform the same as Merge as the > logic looks similar but it seems it's the worst performing among all, > not sure why , yet to know the reason though. Truncate+ load seems to be > the best performing among all. Hope i am doing it correctly. Please > correct me if I'm wrong. Your original problem description was: "Then subsequently these rows will be inserted/updated based on the delta number of rows that got inserted/updated in the source database. In some cases these changed data can flow multiple times per day to the downstream i.e. postgres database and in other cases once daily." If the above is not a hard rule, then yes up to some point just replacing the data in mass would be the simplest/fastest method. You could cut a step out by doing something like TRUNCATE target_tab and then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO source_tab. -- Adrian Klaver [email protected] ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Moving delta data faster @ 2024-04-06 20:04 yudhi s <[email protected]> parent: Adrian Klaver <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: yudhi s @ 2024-04-06 20:04 UTC (permalink / raw) To: Adrian Klaver <[email protected]>; +Cc: veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]> On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <[email protected]> wrote: > > Your original problem description was: > > "Then subsequently these rows will be inserted/updated based on the > delta number of rows that got inserted/updated in the source database. > In some cases these changed data can flow multiple times per day to the > downstream i.e. postgres database and in other cases once daily." > > If the above is not a hard rule, then yes up to some point just > replacing the data in mass would be the simplest/fastest method. You > could cut a step out by doing something like TRUNCATE target_tab and > then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO > source_tab. > > Yes, actually i didn't realize that truncate table transactional/online here in postgres. In other databases like Oracle its downtime for the read queries on the target table, as data will be vanished from the target table post truncate(until the data load happens) and those are auto commit. Thanks Veem for sharing that option. I also think that truncate will be faster if the changes/delta is large , but if its handful of rows like <5%of the rows in the table then Upsert/Merge will be better performant. And also the down side of the truncate option is, it does ask to bring/export all the data from source to the S3 file which may take longer as compared to bringing just the delta records. Correct me if I'm wrong. However I am still not able to understand why the upsert is less performant than merge, could you throw some light on this please? ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Moving delta data faster @ 2024-04-06 20:55 Adrian Klaver <[email protected]> parent: yudhi s <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Adrian Klaver @ 2024-04-06 20:55 UTC (permalink / raw) To: yudhi s <[email protected]>; +Cc: veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]> On 4/6/24 13:04, yudhi s wrote: > > On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <[email protected] > <mailto:[email protected]>> wrote: > > > Your original problem description was: > > "Then subsequently these rows will be inserted/updated based on the > delta number of rows that got inserted/updated in the source database. > In some cases these changed data can flow multiple times per day to the > downstream i.e. postgres database and in other cases once daily." > > If the above is not a hard rule, then yes up to some point just > replacing the data in mass would be the simplest/fastest method. You > could cut a step out by doing something like TRUNCATE target_tab and > then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO > source_tab. > > Yes, actually i didn't realize that truncate table transactional/online > here in postgres. In other databases like Oracle its downtime for the > read queries on the target table, as data will be vanished from the > target table post truncate(until the data load happens) and those are > auto commit. Thanks Veem for sharing that option. > > I also think that truncate will be faster if the changes/delta is > large , but if its handful of rows like <5%of the rows in the table then > Upsert/Merge will be better performant. And also the down side of the > truncate option is, it does ask to bring/export all the data from > source to the S3 file which may take longer as compared to bringing just > the delta records. Correct me if I'm wrong. Since you still have not specified how the data is stored in S3 and how you propose to move them into Postgres I can't really answer. > > However I am still not able to understand why the upsert is less > performant than merge, could you throw some light on this please? > I have no idea how this works in the code, but my suspicion is it is due to the following: https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT "The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or exclusion constraint violation error. For each individual row proposed for insertion, either the insertion proceeds, or, if an arbiter constraint or index specified by conflict_target is violated, the alternative conflict_action is taken. ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative action. ON CONFLICT DO UPDATE updates the existing row that conflicts with the row proposed for insertion as its alternative action." vs this: "First, the MERGE command performs a join from data_source to target_table_name producing zero or more candidate change rows. For each candidate change row, the status of MATCHED or NOT MATCHED is set just once, after which WHEN clauses are evaluated in the order specified. For each candidate change row, the first clause to evaluate as true is executed. No more than one WHEN clause is executed for any candidate change row." Where ON CONFLICT attempts the INSERT then on failure does the UPDATE for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT MATCHED takes the appropriate action for the first WHEN match. In other words it goes directly to the appropriate action. -- Adrian Klaver [email protected] ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Moving delta data faster @ 2024-04-09 16:36 yudhi s <[email protected]> parent: Adrian Klaver <[email protected]> 0 siblings, 0 replies; 5+ messages in thread From: yudhi s @ 2024-04-09 16:36 UTC (permalink / raw) To: Adrian Klaver <[email protected]>; +Cc: veem v <[email protected]>; Greg Sabino Mullane <[email protected]>; pgsql-general <[email protected]> On Sun, Apr 7, 2024 at 2:25 AM Adrian Klaver <[email protected]> wrote: > > I have no idea how this works in the code, but my suspicion is it is due > to the following: > > https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT > > "The optional ON CONFLICT clause specifies an alternative action to > raising a unique violation or exclusion constraint violation error. For > each individual row proposed for insertion, either the insertion > proceeds, or, if an arbiter constraint or index specified by > conflict_target is violated, the alternative conflict_action is taken. > ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative > action. ON CONFLICT DO UPDATE updates the existing row that conflicts > with the row proposed for insertion as its alternative action." > > vs this: > > "First, the MERGE command performs a join from data_source to > target_table_name producing zero or more candidate change rows. For each > candidate change row, the status of MATCHED or NOT MATCHED is set just > once, after which WHEN clauses are evaluated in the order specified. For > each candidate change row, the first clause to evaluate as true is > executed. No more than one WHEN clause is executed for any candidate > change row." > > Where ON CONFLICT attempts the INSERT then on failure does the UPDATE > for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on > the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT > MATCHED takes the appropriate action for the first WHEN match. In other > words it goes directly to the appropriate action. > > Thank you Adrian. I think you are spoton on the cause of upsert becoming slower than Merge. Below is the explain plan I captured for both the operations and it looks like even the planning time is small for the UPSERT, as because it operates on the constraint i.e the Update will wait for all the failure records from the INSERT and thus it takes longer. The Merge seems to be evaluated on the Joins i.e it is directly able to get the set of rows which has to be Updated rather waiting for the INSERT to make it fail based on the PK constraint. ************** Explain plan for UPSERT ***************** Insert on public.target_tab (cost=0.00..17353.00 rows=0 width=0) (actual time=19957.569..19957.570 rows=0 loops=1) Conflict Resolution: UPDATE Conflict Arbiter Indexes: target_tab_pkey Tuples Inserted: 500000 Conflicting Tuples: 500000 Buffers: shared hit=8545735 written=10094 -> Seq Scan on public.source_tab (cost=0.00..17353.00 rows=1000000 width=29) (actual time=0.006..208.306 rows=1000000 loops=1) Output: source_tab.id, source_tab.column1, source_tab.column2 Buffers: shared hit=7353 Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency = '1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4', search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem = '8MB' Query Identifier: -1356019529835809419 Planning: Buffers: shared hit=41 Planning Time: 0.199 ms Execution Time: 19959.261 ms ************** Explain plan for Merge ***************** Merge on public.target_tab t (cost=17368.00..53460.01 rows=0 width=0) (actual time=14209.966..14209.968 rows=0 loops=1) Tuples: inserted=500000 updated=500000 Buffers: shared hit=5040097 written=10460, temp read=4143 written=4143 I/O Timings: temp read=26.746 write=68.596 -> Hash Left Join (cost=17368.00..53460.01 rows=1000000 width=35) (actual time=179.233..1332.264 rows=1000000 loops=1) Output: t.ctid, s.column1, s.column2, s.id Inner Unique: true Hash Cond: (s.id = t.id) Buffers: shared hit=11029, temp read=4143 written=4143 I/O Timings: temp read=26.746 write=68.596 -> Seq Scan on public.source_tab s (cost=0.00..17353.00 rows=1000000 width=29) (actual time=0.008..268.506 rows=1000000 loops=1) Output: s.column1, s.column2, s.id Buffers: shared hit=7353 -> Hash (cost=8676.00..8676.00 rows=500000 width=10) (actual time=178.101..178.102 rows=500000 loops=1) Output: t.ctid, t.id Buckets: 524288 Batches: 2 Memory Usage: 14824kB Buffers: shared hit=3676, temp written=977 I/O Timings: temp write=5.904 -> Seq Scan on public.target_tab t (cost=0.00..8676.00 rows=500000 width=10) (actual time=0.007..66.441 rows=500000 loops=1) Output: t.ctid, t.id Buffers: shared hit=3676 Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency = '1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4', search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem = '8MB' Query Identifier: -2297080081674771467 Planning: Buffers: shared hit=85 Planning Time: 0.466 ms Execution Time: 14212.061 ms ^ permalink raw reply [nested|flat] 5+ messages in thread
end of thread, other threads:[~2024-04-09 16:36 UTC | newest] Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2024-04-06 15:47 Re: Moving delta data faster yudhi s <[email protected]> 2024-04-06 16:55 ` Adrian Klaver <[email protected]> 2024-04-06 20:04 ` yudhi s <[email protected]> 2024-04-06 20:55 ` Adrian Klaver <[email protected]> 2024-04-09 16:36 ` yudhi s <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox