public inbox for [email protected]  
help / color / mirror / Atom feed
A way to optimize sql about the last temporary-related row
3+ messages / 2 participants
[nested] [flat]

* A way to optimize sql about the last temporary-related row
@ 2024-06-27 15:20 [email protected] <[email protected]>
  2024-06-27 16:16 ` Re: A way to optimize sql about the last temporary-related row David Rowley <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: [email protected] @ 2024-06-27 15:20 UTC (permalink / raw)
  To: [email protected]

Hello everyone,
Sorry to bother you but I have a query that is driving me crazy.

I need to have the last valid record at a temporal level according to a 
specific parameter.

First some data:
Linux Rocky 8.10 environment, minimal installation (on VM KVM with 
Fedora 40).
Postgresql 16.3, installed by official Postgresql guide.
effective_cache_size = '1000 MB';
shared_buffers = '500 MB';
work_mem = '16MB';
The changes are deliberately minimal to be able to all to simulate the 
problem.

Table script:
CREATE TABLE test_table
(
  pk_id int NOT NULL,
  integer_field_1 int ,
  integer_field_2 int,
  datetime_field_1 timestamp,
  primary key (pk_id)
)

-- insert 4M records
insert into test_table(pk_id) select generate_series(1,4000000,1);

-- now set some random data, distribuited between specific ranges (as in 
my production table)
update test_table set
datetime_field_1 = timestamp '2000-01-01 00:00:00' + random() * 
(timestamp '2024-05-31 23:59:59' - timestamp '2000-01-01 00:00:00'),
integer_field_1 = floor(random() * (6-1+1) + 1)::int,
integer_field_2 = floor(random() * (200000-1+1) + 1)::int;


-- indexes
CREATE INDEX idx_test_table_integer_field_1 ON test_table(integer_field_1);
CREATE INDEX xtest_table_datetime_field_1 ON test_table(datetime_field_1 
desc);
CREATE INDEX idx_test_table_integer_field_2 ON test_table(integer_field_2);


--vacuum

vacuum full test_table;



Now the query:
explain (verbose, buffers, analyze)
with last_table_ids as materialized(
   select xx from (
   select LAST_VALUE(pk_id) over (partition by integer_field_2 order by 
datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) xx
   from test_table
   where integer_field_1 = 1
   and datetime_field_1 <= CURRENT_TIMESTAMP
   ) ww group by ww.xx

),
last_row_per_ids as (
   select tt.* from last_table_ids lt
   inner join test_table tt on (tt.pk_id = lt.xx)

)

select * /* or count(*) */ from last_row_per_ids;


This query, on my PC, takes 46 seconds!!!
I was expecting about 2-3 seconds (according with my other queries in 
this table) but it seems that the xtest_table_datetime_field_1 index is 
not being used.

Do you think there is a way to optimize the query?

Thanks so much for the support,

Agharta







^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: A way to optimize sql about the last temporary-related row
  2024-06-27 15:20 A way to optimize sql about the last temporary-related row [email protected] <[email protected]>
@ 2024-06-27 16:16 ` David Rowley <[email protected]>
  2024-06-28 07:20   ` Re: A way to optimize sql about the last temporary-related row [email protected] <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: David Rowley @ 2024-06-27 16:16 UTC (permalink / raw)
  To: [email protected]; +Cc: PostgreSQL General <[email protected]>

On Fri, 28 Jun 2024, 3:20 am [email protected], <[email protected]>
wrote:

>
> Now the query:
> explain (verbose, buffers, analyze)
> with last_table_ids as materialized(
>    select xx from (
>    select LAST_VALUE(pk_id) over (partition by integer_field_2 order by
> datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
> FOLLOWING) xx
>    from test_table
>    where integer_field_1 = 1
>    and datetime_field_1 <= CURRENT_TIMESTAMP
>    ) ww group by ww.xx
>
> ),
> last_row_per_ids as (
>    select tt.* from last_table_ids lt
>    inner join test_table tt on (tt.pk_id = lt.xx)
>
> )
>
> select * /* or count(*) */ from last_row_per_ids;
>
>
> This query, on my PC, takes 46 seconds!!!
>

(Away from laptop and using my phone)

Something like:

select distinct on (integer_field_2) * from test_table where
integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by
integer_field_2,datetime_field_1 desc;

Might run a bit faster.  However if it's slow due to I/O then maybe not
much faster.  Your version took about 5 seconds on my phone and my version
ran in 1.5 seconds.

It's difficult for me to check the results match with each query from my
phone. A quick scan of the first 10 or so records looked good.

If the updated query is still too slow on cold cache then faster disks
might be needed.

David

>


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: A way to optimize sql about the last temporary-related row
  2024-06-27 15:20 A way to optimize sql about the last temporary-related row [email protected] <[email protected]>
  2024-06-27 16:16 ` Re: A way to optimize sql about the last temporary-related row David Rowley <[email protected]>
@ 2024-06-28 07:20   ` [email protected] <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: [email protected] @ 2024-06-28 07:20 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: PostgreSQL General <[email protected]>

HOO-HA! This is HUGE!

Only 2.2 seconds on my data!!!! Amazing!

distinct on (field) *followed by "*" *is a hidden gem!

Thank you so much and thanks to everyone who helped me!  Thank you very 
much!!

Cheers,

Agharta



Il 27/06/24 6:16 PM, David Rowley ha scritto:
>
>
> On Fri, 28 Jun 2024, 3:20 am [email protected], 
> <[email protected]> wrote:
>
>
>     Now the query:
>     explain (verbose, buffers, analyze)
>     with last_table_ids as materialized(
>        select xx from (
>        select LAST_VALUE(pk_id) over (partition by integer_field_2
>     order by
>     datetime_field_1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>     FOLLOWING) xx
>        from test_table
>        where integer_field_1 = 1
>        and datetime_field_1 <= CURRENT_TIMESTAMP
>        ) ww group by ww.xx
>
>     ),
>     last_row_per_ids as (
>        select tt.* from last_table_ids lt
>        inner join test_table tt on (tt.pk_id = lt.xx)
>
>     )
>
>     select * /* or count(*) */ from last_row_per_ids;
>
>
>     This query, on my PC, takes 46 seconds!!!
>
>
> (Away from laptop and using my phone)
>
> Something like:
>
> select distinct on (integer_field_2) * from test_table where 
> integer_field_1 = 1 and datetime_field_1 <= CURRENT_TIMESTAMP order by 
> integer_field_2,datetime_field_1 desc;
>
> Might run a bit faster.  However if it's slow due to I/O then maybe 
> not much faster.  Your version took about 5 seconds on my phone and my 
> version ran in 1.5 seconds.
>
> It's difficult for me to check the results match with each query from 
> my phone. A quick scan of the first 10 or so records looked good.
>
> If the updated query is still too slow on cold cache then faster disks 
> might be needed.
>
> David
>

^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2024-06-28 07:20 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-06-27 15:20 A way to optimize sql about the last temporary-related row [email protected] <[email protected]>
2024-06-27 16:16 ` David Rowley <[email protected]>
2024-06-28 07:20   ` [email protected] <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox