public inbox for [email protected]
help / color / mirror / Atom feedchecking for a NULL date in a partitioned table kills performance
16+ messages / 9 participants
[nested] [flat]
* checking for a NULL date in a partitioned table kills performance
@ 2024-08-22 21:44 Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-23 04:40 ` Re: checking for a NULL date in a partitioned table kills performance Muhammad Usman Khan <[email protected]>
2024-08-23 06:45 ` Re: checking for a NULL date in a partitioned table kills performance [email protected]
0 siblings, 3 replies; 16+ messages in thread
From: Sbob @ 2024-08-22 21:44 UTC (permalink / raw)
To: [email protected] <[email protected]>
All;
I am running a select from a partitioned table. The table (and all the
partitions) have an index on contract_date like this:
CREATE INDEX on part_tab (contract_date) where contract_date >
'2022-01-01'::date
The table (including all partitions) has 32million rows
The db server is an aurora postgresql instance with 128GB of ram and 16
vcpu's
The shared buffers is set to 90GB and effective_cache_size is also 90GB
I set default_statistics_target to 1000 and ram a vacuum analyze on the
table
I am selecting a number of columns and specifying this where clause:
WHERE (
(contract_date IS NULL)
OR
(contract_date > '2022-01-01'::date)
)
This takes 15 seconds to run and an explain says it's doing a table scan
on all partitions (the query is not specifying the partition key)
If I change the where clause to look like this:
WHERE (
(contract_date > '2022-01-01'::date)
)
Then it performs index scans on all the partitions and runs in about 600ms
If i leave the where clause off entirely it performs table scans of the
partitions and takes approx 18 seconds to run
I am trying to get the performance to less than 2sec,
I have tried adding indexes on the table and all partitions like this:
CREATE INDEX ON table (contract_date NULLS FIRST) ;
but the performance with the full where clause is the same:
WHERE (
(contract_date IS NULL)
OR
(contract_date > '2022-01-01'::date)
)
runs in 15 seconds and scans all partitions
I also tried indexes i=on the table and all partitions like this:
CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
but I get the same result, table scans on all partitions and it runs in
15 seconds
Any help or advice ?
Thanks in advance
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-22 23:06 ` Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2 siblings, 1 reply; 16+ messages in thread
From: Rui DeSousa @ 2024-08-22 23:06 UTC (permalink / raw)
To: Sbob <[email protected]>; +Cc: [email protected] <[email protected]>
> On Aug 22, 2024, at 5:44 PM, Sbob <[email protected]> wrote:
>
> All;
>
> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date like this:
> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>
> The table (including all partitions) has 32million rows
> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>
> The shared buffers is set to 90GB and effective_cache_size is also 90GB
> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>
> I am selecting a number of columns and specifying this where clause:
>
> WHERE (
> (contract_date IS NULL)
> OR
> (contract_date > '2022-01-01'::date)
> )
>
> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not specifying the partition key)
> If I change the where clause to look like this:
>
> WHERE (
> (contract_date > '2022-01-01'::date)
> )
>
> Then it performs index scans on all the partitions and runs in about 600ms
>
> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run
>
> I am trying to get the performance to less than 2sec,
> I have tried adding indexes on the table and all partitions like this:
> CREATE INDEX ON table (contract_date NULLS FIRST) ;
> but the performance with the full where clause is the same:
>
> WHERE (
> (contract_date IS NULL)
> OR
> (contract_date > '2022-01-01'::date)
> )
>
> runs in 15 seconds and scans all partitions
>
> I also tried indexes i=on the table and all partitions like this:
> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>
> but I get the same result, table scans on all partitions and it runs in 15 seconds
>
> Any help or advice ?
>
> Thanks in advance
>
>
What is contract_date and when will it be null?
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
@ 2024-08-22 23:26 ` Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
0 siblings, 1 reply; 16+ messages in thread
From: Sbob @ 2024-08-22 23:26 UTC (permalink / raw)
To: Rui DeSousa <[email protected]>; +Cc: [email protected] <[email protected]>
On 8/22/24 5:06 PM, Rui DeSousa wrote:
>
>> On Aug 22, 2024, at 5:44 PM, Sbob <[email protected]> wrote:
>>
>> All;
>>
>> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date like this:
>> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>>
>> The table (including all partitions) has 32million rows
>> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>>
>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>>
>> I am selecting a number of columns and specifying this where clause:
>>
>> WHERE (
>> (contract_date IS NULL)
>> OR
>> (contract_date > '2022-01-01'::date)
>> )
>>
>> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not specifying the partition key)
>> If I change the where clause to look like this:
>>
>> WHERE (
>> (contract_date > '2022-01-01'::date)
>> )
>>
>> Then it performs index scans on all the partitions and runs in about 600ms
>>
>> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run
>>
>> I am trying to get the performance to less than 2sec,
>> I have tried adding indexes on the table and all partitions like this:
>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>> but the performance with the full where clause is the same:
>>
>> WHERE (
>> (contract_date IS NULL)
>> OR
>> (contract_date > '2022-01-01'::date)
>> )
>>
>> runs in 15 seconds and scans all partitions
>>
>> I also tried indexes i=on the table and all partitions like this:
>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>
>> but I get the same result, table scans on all partitions and it runs in 15 seconds
>>
>> Any help or advice ?
>>
>> Thanks in advance
>>
>>
> What is contract_date and when will it be null?
it's a date data type and it allows NULL's not sure why, this is a
client's system
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-22 23:32 ` Sbob <[email protected]>
2024-08-23 00:01 ` Re: checking for a NULL date in a partitioned table kills performance David G. Johnston <[email protected]>
2024-08-23 00:05 ` Re: checking for a NULL date in a partitioned table kills performance Tom Lane <[email protected]>
2024-08-23 00:07 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-23 11:39 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
0 siblings, 4 replies; 16+ messages in thread
From: Sbob @ 2024-08-22 23:32 UTC (permalink / raw)
To: Rui DeSousa <[email protected]>; +Cc: [email protected] <[email protected]>
On 8/22/24 5:26 PM, Sbob wrote:
>
> On 8/22/24 5:06 PM, Rui DeSousa wrote:
>>
>>> On Aug 22, 2024, at 5:44 PM, Sbob <[email protected]> wrote:
>>>
>>> All;
>>>
>>> I am running a select from a partitioned table. The table (and all
>>> the partitions) have an index on contract_date like this:
>>> CREATE INDEX on part_tab (contract_date) where contract_date >
>>> '2022-01-01'::date
>>>
>>> The table (including all partitions) has 32million rows
>>> The db server is an aurora postgresql instance with 128GB of ram and
>>> 16 vcpu's
>>>
>>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>>> I set default_statistics_target to 1000 and ram a vacuum analyze on
>>> the table
>>>
>>> I am selecting a number of columns and specifying this where clause:
>>>
>>> WHERE (
>>> (contract_date IS NULL)
>>> OR
>>> (contract_date > '2022-01-01'::date)
>>> )
>>>
>>> This takes 15 seconds to run and an explain says it's doing a table
>>> scan on all partitions (the query is not specifying the partition key)
>>> If I change the where clause to look like this:
>>>
>>> WHERE (
>>> (contract_date > '2022-01-01'::date)
>>> )
>>>
>>> Then it performs index scans on all the partitions and runs in about
>>> 600ms
>>>
>>> If i leave the where clause off entirely it performs table scans of
>>> the partitions and takes approx 18 seconds to run
>>>
>>> I am trying to get the performance to less than 2sec,
>>> I have tried adding indexes on the table and all partitions like this:
>>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>>> but the performance with the full where clause is the same:
>>>
>>> WHERE (
>>> (contract_date IS NULL)
>>> OR
>>> (contract_date > '2022-01-01'::date)
>>> )
>>>
>>> runs in 15 seconds and scans all partitions
>>>
>>> I also tried indexes i=on the table and all partitions like this:
>>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>>
>>> but I get the same result, table scans on all partitions and it runs
>>> in 15 seconds
>>>
>>> Any help or advice ?
>>>
>>> Thanks in advance
>>>
>>>
>> What is contract_date and when will it be null?
>
>
> it's a date data type and it allows NULL's not sure why, this is a
> client's system
>
>
29 million of the 32 million rows in the table have NULL for contract_date
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-23 00:01 ` David G. Johnston <[email protected]>
3 siblings, 0 replies; 16+ messages in thread
From: David G. Johnston @ 2024-08-23 00:01 UTC (permalink / raw)
To: Sbob <[email protected]>; +Cc: Rui DeSousa <[email protected]>; [email protected] <[email protected]>
On Thu, Aug 22, 2024 at 4:32 PM Sbob <[email protected]> wrote:
>
> 29 million of the 32 million rows in the table have NULL for contract_date
>
>
Your expectation that this query should use an index is flawed. Indexes
are for highly selective queries. Finding nulls on that table is not
selective.
David J.
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-23 00:05 ` Tom Lane <[email protected]>
2024-08-23 09:54 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
3 siblings, 1 reply; 16+ messages in thread
From: Tom Lane @ 2024-08-23 00:05 UTC (permalink / raw)
To: Sbob <[email protected]>; +Cc: Rui DeSousa <[email protected]>; [email protected] <[email protected]>
Sbob <[email protected]> writes:
> 29 million of the 32 million rows in the table have NULL for contract_date
[ blink... ] So your query is selecting at least 29/32nds of the
table, plus however much matches the contract_date > '2022-01-01'
alternative. I'm not sure how you expect that to be significantly
cheaper than scanning the whole table.
regards, tom lane
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-23 00:05 ` Re: checking for a NULL date in a partitioned table kills performance Tom Lane <[email protected]>
@ 2024-08-23 09:54 ` Rui DeSousa <[email protected]>
0 siblings, 0 replies; 16+ messages in thread
From: Rui DeSousa @ 2024-08-23 09:54 UTC (permalink / raw)
To: Tom Lane <[email protected]>; +Cc: Sbob <[email protected]>; [email protected] <[email protected]>
> On Aug 22, 2024, at 8:05 PM, Tom Lane <[email protected]> wrote:
>
> Sbob <[email protected]> writes:
>> 29 million of the 32 million rows in the table have NULL for contract_date
>
> [ blink... ] So your query is selecting at least 29/32nds of the
> table, plus however much matches the contract_date > '2022-01-01'
> alternative. I'm not sure how you expect that to be significantly
> cheaper than scanning the whole table.
>
> regards, tom lane
^^^ This is best answer; however, what is the actual query and how is it used? I assume it’s a analytical query and not actually extracting the rows. Is for a dashboard or an ad-hoc query?
Here’s simple example; from 4/1 is uses the index 3/1 it does a full table scan. Depending on what you using the query for you could use a covered index or a materialized view.
prod=# create index emp_idx3 on emp (contract_date);
CREATE INDEX
Time: 6036.030 ms (00:06.036)
prod=# select sum(site_id) from emp where contract_date > '4/1/2024';
sum
-----------
927473447
(1 row)
Time: 711.774 ms
prod=# select sum(site_id) from emp where contract_date > '3/1/2024';
sum
------------
1128971203
(1 row)
Time: 1945.397 ms (00:01.945)
prod=# select sum(site_id) from emp where contract_date > '3/1/2024' or contract_date is null;
sum
------------
3823075309
(1 row)
Time: 1821.284 ms (00:01.821)
prod=# explain select sum(site_id) from emp where contract_date > '4/1/2024';
QUERY PLAN
--------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=754070.31..754070.32 rows=1 width=8)
-> Gather (cost=754069.59..754070.30 rows=7 width=8)
Workers Planned: 7
-> Partial Aggregate (cost=753069.59..753069.60 rows=1 width=8)
-> Parallel Bitmap Heap Scan on emp (cost=5343.20..752867.80 rows=80715 width=4)
Recheck Cond: (contract_date > '2024-04-01'::date)
-> Bitmap Index Scan on emp_idx3 (cost=0.00..5201.95 rows=565002 width=0)
Index Cond: (contract_date > '2024-04-01'::date)
JIT:
Functions: 7
Options: Inlining true, Optimization true, Expressions true, Deforming true
(11 rows)
Time: 1.196 ms
prod=# explain select sum(site_id) from emp where contract_date > '3/1/2024';
QUERY PLAN
---------------------------------------------------------------------------------------
Finalize Aggregate (cost=764566.90..764566.91 rows=1 width=8)
-> Gather (cost=764566.18..764566.89 rows=7 width=8)
Workers Planned: 7
-> Partial Aggregate (cost=763566.18..763566.19 rows=1 width=8)
-> Parallel Seq Scan on emp (cost=0.00..763320.15 rows=98411 width=4)
Filter: (contract_date > '2024-03-01'::date)
JIT:
Functions: 7
Options: Inlining true, Optimization true, Expressions true, Deforming true
(9 rows)
Time: 1.172 ms
prod=# drop index emp_idx3;
DROP INDEX
Time: 8.663 ms
prod=# create index emp_idx3 on emp (contract_date) include (site_id);
CREATE INDEX
Time: 7002.860 ms (00:07.003)
prod=# select sum(site_id) from emp where contract_date > '4/1/2024';
sum
-----------
927473447
(1 row)
Time: 56.450 ms
prod=# select sum(site_id) from emp where contract_date > '3/1/2024';
sum
------------
1128971203
(1 row)
Time: 49.115 ms
prod=# select sum(site_id) from emp where contract_date > '3/1/2024' or contract_date is null;
sum
------------
3823075309
(1 row)
Time: 702.962 ms
prod=# explain select sum(site_id) from emp where contract_date > '3/1/2024' or contract_date is null;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=216035.47..216035.48 rows=1 width=8)
-> Gather (cost=216034.74..216035.45 rows=7 width=8)
Workers Planned: 7
-> Partial Aggregate (cost=215034.74..215034.75 rows=1 width=8)
-> Parallel Index Only Scan using emp_idx3 on emp (cost=0.44..213686.81 rows=539174 width=4)
Filter: ((contract_date > '2024-03-01'::date) OR (contract_date IS NULL))
JIT:
Functions: 5
Options: Inlining false, Optimization false, Expressions true, Deforming true
(9 rows)
Time: 0.972 ms
prod=# select count(*) from emp;
count
----------
16862243
(1 row)
Time: 629.995 ms
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-23 00:07 ` Rui DeSousa <[email protected]>
2024-08-23 13:47 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
3 siblings, 1 reply; 16+ messages in thread
From: Rui DeSousa @ 2024-08-23 00:07 UTC (permalink / raw)
To: Sbob <[email protected]>; +Cc: [email protected] <[email protected]>
> On Aug 22, 2024, at 7:32 PM, Sbob <[email protected]> wrote:
>
>
> On 8/22/24 5:26 PM, Sbob wrote:
>>
>> On 8/22/24 5:06 PM, Rui DeSousa wrote:
>>>
>>>> On Aug 22, 2024, at 5:44 PM, Sbob <[email protected]> wrote:
>>>>
>>>> All;
>>>>
>>>> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date like this:
>>>> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>>>>
>>>> The table (including all partitions) has 32million rows
>>>> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>>>>
>>>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>>>> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>>>>
>>>> I am selecting a number of columns and specifying this where clause:
>>>>
>>>> WHERE (
>>>> (contract_date IS NULL)
>>>> OR
>>>> (contract_date > '2022-01-01'::date)
>>>> )
>>>>
>>>> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not specifying the partition key)
>>>> If I change the where clause to look like this:
>>>>
>>>> WHERE (
>>>> (contract_date > '2022-01-01'::date)
>>>> )
>>>>
>>>> Then it performs index scans on all the partitions and runs in about 600ms
>>>>
>>>> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run
>>>>
>>>> I am trying to get the performance to less than 2sec,
>>>> I have tried adding indexes on the table and all partitions like this:
>>>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>>>> but the performance with the full where clause is the same:
>>>>
>>>> WHERE (
>>>> (contract_date IS NULL)
>>>> OR
>>>> (contract_date > '2022-01-01'::date)
>>>> )
>>>>
>>>> runs in 15 seconds and scans all partitions
>>>>
>>>> I also tried indexes i=on the table and all partitions like this:
>>>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>>>
>>>> but I get the same result, table scans on all partitions and it runs in 15 seconds
>>>>
>>>> Any help or advice ?
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>> What is contract_date and when will it be null?
>>
>>
>> it's a date data type and it allows NULL's not sure why, this is a client's system
>>
>>
> 29 million of the 32 million rows in the table have NULL for contract_date
>
NULLs are not indexed thus the OR predicate invalidate the use of the index.
Since you are already creating a partial index just include the NULLs. It index will get used for both of your queries.
create index table_idx1
on table (contract_date)
where contract_date > ‘1/1/2022’
or contract_date is null
;
The reason why I asked when is contract_date null is because attributes in a table should be non nullable. If it’s nullable then that begs the question if it belong in that table in the first place; and sometimes the answer is yes. I just see a lot of half baked schemas out there. I refer to them as organically designed schemas.
-Rui.
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-23 00:07 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
@ 2024-08-23 13:47 ` Sbob <[email protected]>
0 siblings, 0 replies; 16+ messages in thread
From: Sbob @ 2024-08-23 13:47 UTC (permalink / raw)
To: Rui DeSousa <[email protected]>; +Cc: [email protected] <[email protected]>
On 8/22/24 6:07 PM, Rui DeSousa wrote:
>
>
>> On Aug 22, 2024, at 7:32 PM, Sbob <[email protected]> wrote:
>>
>>
>> On 8/22/24 5:26 PM, Sbob wrote:
>>>
>>> On 8/22/24 5:06 PM, Rui DeSousa wrote:
>>>>
>>>>> On Aug 22, 2024, at 5:44 PM, Sbob <[email protected]> wrote:
>>>>>
>>>>> All;
>>>>>
>>>>> I am running a select from a partitioned table. The table (and all
>>>>> the partitions) have an index on contract_date like this:
>>>>> CREATE INDEX on part_tab (contract_date) where contract_date >
>>>>> '2022-01-01'::date
>>>>>
>>>>> The table (including all partitions) has 32million rows
>>>>> The db server is an aurora postgresql instance with 128GB of ram
>>>>> and 16 vcpu's
>>>>>
>>>>> The shared buffers is set to 90GB and effective_cache_size is also
>>>>> 90GB
>>>>> I set default_statistics_target to 1000 and ram a vacuum analyze
>>>>> on the table
>>>>>
>>>>> I am selecting a number of columns and specifying this where clause:
>>>>>
>>>>> WHERE (
>>>>> (contract_date IS NULL)
>>>>> OR
>>>>> (contract_date > '2022-01-01'::date)
>>>>> )
>>>>>
>>>>> This takes 15 seconds to run and an explain says it's doing a
>>>>> table scan on all partitions (the query is not specifying the
>>>>> partition key)
>>>>> If I change the where clause to look like this:
>>>>>
>>>>> WHERE (
>>>>> (contract_date > '2022-01-01'::date)
>>>>> )
>>>>>
>>>>> Then it performs index scans on all the partitions and runs in
>>>>> about 600ms
>>>>>
>>>>> If i leave the where clause off entirely it performs table scans
>>>>> of the partitions and takes approx 18 seconds to run
>>>>>
>>>>> I am trying to get the performance to less than 2sec,
>>>>> I have tried adding indexes on the table and all partitions like this:
>>>>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>>>>> but the performance with the full where clause is the same:
>>>>>
>>>>> WHERE (
>>>>> (contract_date IS NULL)
>>>>> OR
>>>>> (contract_date > '2022-01-01'::date)
>>>>> )
>>>>>
>>>>> runs in 15 seconds and scans all partitions
>>>>>
>>>>> I also tried indexes i=on the table and all partitions like this:
>>>>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>>>>
>>>>> but I get the same result, table scans on all partitions and it
>>>>> runs in 15 seconds
>>>>>
>>>>> Any help or advice ?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>> What is contract_date and when will it be null?
>>>
>>>
>>> it's a date data type and it allows NULL's not sure why, this is a
>>> client's system
>>>
>>>
>> 29 million of the 32 million rows in the table have NULL for
>> contract_date
>>
>
> NULLs are not indexed thus the OR predicate invalidate the use of the
> index.
>
> Since you are already creating a partial index just include the NULLs.
> It index will get used for both of your queries.
>
> create index table_idx1
> on table (contract_date)
> where contract_date > ‘1/1/2022’
> or contract_date is null
> ;
>
>
> The reason why I asked when is contract_date null is because
> attributes in a table should be non nullable. If it’s nullable then
> that begs the question if it belong in that table in the first place;
> and sometimes the answer is yes. I just see a lot of half baked
> schemas out there. I refer to them as organically designed schemas.
>
> -Rui.
>
I agree, I will find out from the client
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-23 11:39 ` Doug Reynolds <[email protected]>
2024-08-23 15:17 ` Re: checking for a NULL date in a partitioned table kills performance Scott Ribe <[email protected]>
3 siblings, 1 reply; 16+ messages in thread
From: Doug Reynolds @ 2024-08-23 11:39 UTC (permalink / raw)
To: Pgsql-admin <[email protected]>
I don’t see how an index is going to help since virtually all of the rows are null AND contract_date isn’t the partition key.
Perhaps, you could try a UNION ALL with one query selecting the date and the other selecting where the date is null.
You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query.
Sent from my iPhone
> On Aug 22, 2024, at 7:43 PM, Sbob <[email protected]> wrote:
>
>
>> On 8/22/24 5:26 PM, Sbob wrote:
>>
>>> On 8/22/24 5:06 PM, Rui DeSousa wrote:
>>>
>>>> On Aug 22, 2024, at 5:44 PM, Sbob <[email protected]> wrote:
>>>>
>>>> All;
>>>>
>>>> I am running a select from a partitioned table. The table (and all the partitions) have an index on contract_date like this:
>>>> CREATE INDEX on part_tab (contract_date) where contract_date > '2022-01-01'::date
>>>>
>>>> The table (including all partitions) has 32million rows
>>>> The db server is an aurora postgresql instance with 128GB of ram and 16 vcpu's
>>>>
>>>> The shared buffers is set to 90GB and effective_cache_size is also 90GB
>>>> I set default_statistics_target to 1000 and ram a vacuum analyze on the table
>>>>
>>>> I am selecting a number of columns and specifying this where clause:
>>>>
>>>> WHERE (
>>>> (contract_date IS NULL)
>>>> OR
>>>> (contract_date > '2022-01-01'::date)
>>>> )
>>>>
>>>> This takes 15 seconds to run and an explain says it's doing a table scan on all partitions (the query is not specifying the partition key)
>>>> If I change the where clause to look like this:
>>>>
>>>> WHERE (
>>>> (contract_date > '2022-01-01'::date)
>>>> )
>>>>
>>>> Then it performs index scans on all the partitions and runs in about 600ms
>>>>
>>>> If i leave the where clause off entirely it performs table scans of the partitions and takes approx 18 seconds to run
>>>>
>>>> I am trying to get the performance to less than 2sec,
>>>> I have tried adding indexes on the table and all partitions like this:
>>>> CREATE INDEX ON table (contract_date NULLS FIRST) ;
>>>> but the performance with the full where clause is the same:
>>>>
>>>> WHERE (
>>>> (contract_date IS NULL)
>>>> OR
>>>> (contract_date > '2022-01-01'::date)
>>>> )
>>>>
>>>> runs in 15 seconds and scans all partitions
>>>>
>>>> I also tried indexes i=on the table and all partitions like this:
>>>> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>>>>
>>>> but I get the same result, table scans on all partitions and it runs in 15 seconds
>>>>
>>>> Any help or advice ?
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>> What is contract_date and when will it be null?
>>
>>
>> it's a date data type and it allows NULL's not sure why, this is a client's system
>>
>>
> 29 million of the 32 million rows in the table have NULL for contract_date
>
>
>
>
>
>
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-23 11:39 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
@ 2024-08-23 15:17 ` Scott Ribe <[email protected]>
2024-08-23 15:42 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
2024-08-23 15:49 ` Re: checking for a NULL date in a partitioned table kills performance Wetmore, Matthew (CTR) <[email protected]>
0 siblings, 2 replies; 16+ messages in thread
From: Scott Ribe @ 2024-08-23 15:17 UTC (permalink / raw)
To: Pgsql-admin <[email protected]>
> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <[email protected]> wrote:
>
> You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query.
I don't see how that could be better than just creating a partial index on it WHERE contract_date IS NULL--and anyway I'm sure you're right that no index would help. No matter what, it seems that sequential scans of all partitions will be required since most rows have it null, and it's not even related to the partition key.
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-23 11:39 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
2024-08-23 15:17 ` Re: checking for a NULL date in a partitioned table kills performance Scott Ribe <[email protected]>
@ 2024-08-23 15:42 ` Doug Reynolds <[email protected]>
2024-08-23 16:08 ` Re: checking for a NULL date in a partitioned table kills performance Scott Ribe <[email protected]>
1 sibling, 1 reply; 16+ messages in thread
From: Doug Reynolds @ 2024-08-23 15:42 UTC (permalink / raw)
To: Pgsql-admin <[email protected]>
The only difference is that you would be reading from one index instead of two, which could be more efficient.
Sent from my iPhone
> On Aug 23, 2024, at 11:19 AM, Scott Ribe <[email protected]> wrote:
>
>
>>
>> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <[email protected]> wrote:
>>
>> You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query.
>
> I don't see how that could be better than just creating a partial index on it WHERE contract_date IS NULL--and anyway I'm sure you're right that no index would help. No matter what, it seems that sequential scans of all partitions will be required since most rows have it null, and it's not even related to the partition key.
>
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-23 11:39 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
2024-08-23 15:17 ` Re: checking for a NULL date in a partitioned table kills performance Scott Ribe <[email protected]>
2024-08-23 15:42 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
@ 2024-08-23 16:08 ` Scott Ribe <[email protected]>
0 siblings, 0 replies; 16+ messages in thread
From: Scott Ribe @ 2024-08-23 16:08 UTC (permalink / raw)
To: Doug Reynolds <[email protected]>; +Cc: Pgsql-admin <[email protected]>
> On Aug 23, 2024, at 9:42 AM, Doug Reynolds <[email protected]> wrote:
>
> The only difference is that you would be reading from one index instead of two, which could be more efficient.
Ah yes, that's a good point to take into consideration in such a case.
In the one at hand though, if statistics are correct, neither index is going to be used, given the 90% of rows with NULL values. Using an index would just waste time compared to a simple sequential scan.
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Re: checking for a NULL date in a partitioned table kills performance Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:32 ` Re: checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-23 11:39 ` Re: checking for a NULL date in a partitioned table kills performance Doug Reynolds <[email protected]>
2024-08-23 15:17 ` Re: checking for a NULL date in a partitioned table kills performance Scott Ribe <[email protected]>
@ 2024-08-23 15:49 ` Wetmore, Matthew (CTR) <[email protected]>
1 sibling, 0 replies; 16+ messages in thread
From: Wetmore, Matthew (CTR) @ 2024-08-23 15:49 UTC (permalink / raw)
To: Scott Ribe <[email protected]>; Pgsql-admin <[email protected]>
I have had this issue in the past.
The real admin fix to this is to have a NULL replacement character that prevents this.
This does a few things:
1. An index will index on a replacement character ( I use <->)
2. A join is easier on a replacement character than NULL (Nulls last/first avoided)
3. Stops all evil NULL rules.
We strive to fix things, but the real solution, IMHO, is better arch design and better fundamental understanding of how NULL works.
Partitioned tables under 500M-750M rows will always have these performance issues.
Great ideas on the workaround’s though, I do understand sometimes you inherit a bad db.
From: Scott Ribe <[email protected]>
Date: Friday, August 23, 2024 at 8:17 AM
To: Pgsql-admin <[email protected]>
Subject: Re: checking for a NULL date in a partitioned table kills performance
> On Aug 23, 2024, at 5: 39 AM, Doug Reynolds <mav@ wastegate. net> wrote: > > You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query. I don't see
> On Aug 23, 2024, at 5:39 AM, Doug Reynolds <[email protected]> wrote:
>
> You could try something really ugly where you make a function index that COALESCEs the nulls to 1-1-1900 and use the COALESCE in the query.
I don't see how that could be better than just creating a partial index on it WHERE contract_date IS NULL--and anyway I'm sure you're right that no index would help. No matter what, it seems that sequential scans of all partitions will be required since most rows have it null, and it's not even related to the partition key.
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-23 04:40 ` Muhammad Usman Khan <[email protected]>
2 siblings, 0 replies; 16+ messages in thread
From: Muhammad Usman Khan @ 2024-08-23 04:40 UTC (permalink / raw)
To: Sbob <[email protected]>; +Cc: [email protected] <[email protected]>
Hi Sbob,
Have you tried using the following indexes ?
B-tree
-
The default and most commonly used type, ideal for equality, range
and Pattern queries.
BRIN (Block Range INdex)
-
Compact indexes that are efficient for large tables where the data is
naturally ordered.
On Fri, 23 Aug 2024 at 02:45, Sbob <[email protected]> wrote:
> All;
>
> I am running a select from a partitioned table. The table (and all the
> partitions) have an index on contract_date like this:
> CREATE INDEX on part_tab (contract_date) where contract_date >
> '2022-01-01'::date
>
> The table (including all partitions) has 32million rows
> The db server is an aurora postgresql instance with 128GB of ram and 16
> vcpu's
>
> The shared buffers is set to 90GB and effective_cache_size is also 90GB
> I set default_statistics_target to 1000 and ram a vacuum analyze on the
> table
>
> I am selecting a number of columns and specifying this where clause:
>
> WHERE (
> (contract_date IS NULL)
> OR
> (contract_date > '2022-01-01'::date)
> )
>
> This takes 15 seconds to run and an explain says it's doing a table scan
> on all partitions (the query is not specifying the partition key)
> If I change the where clause to look like this:
>
> WHERE (
> (contract_date > '2022-01-01'::date)
> )
>
> Then it performs index scans on all the partitions and runs in about 600ms
>
> If i leave the where clause off entirely it performs table scans of the
> partitions and takes approx 18 seconds to run
>
> I am trying to get the performance to less than 2sec,
> I have tried adding indexes on the table and all partitions like this:
> CREATE INDEX ON table (contract_date NULLS FIRST) ;
> but the performance with the full where clause is the same:
>
> WHERE (
> (contract_date IS NULL)
> OR
> (contract_date > '2022-01-01'::date)
> )
>
> runs in 15 seconds and scans all partitions
>
> I also tried indexes i=on the table and all partitions like this:
> CREATE INDEX ON table (contract_date) WHERE contract_date IS NULL;
>
> but I get the same result, table scans on all partitions and it runs in
> 15 seconds
>
> Any help or advice ?
>
> Thanks in advance
>
>
>
^ permalink raw reply [nested|flat] 16+ messages in thread
* Re: checking for a NULL date in a partitioned table kills performance
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
@ 2024-08-23 06:45 ` [email protected]
2 siblings, 0 replies; 16+ messages in thread
From: [email protected] @ 2024-08-23 06:45 UTC (permalink / raw)
To: [email protected]
Sbob schrieb am 22.08.2024 um 23:44:
>
> I am selecting a number of columns and specifying this where clause:
>
> WHERE (
> (contract_date IS NULL)
> OR
> (contract_date > '2022-01-01'::date)
> )
>
It's not the check for NULL, it's the OR that makes this perform so badly.
I typically never set columns used for range queries to NULL.
Would using infinity instead of NULL be a viable option here?
Then you can remove the OR condition entirely.
^ permalink raw reply [nested|flat] 16+ messages in thread
end of thread, other threads:[~2024-08-23 16:08 UTC | newest]
Thread overview: 16+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-08-22 21:44 checking for a NULL date in a partitioned table kills performance Sbob <[email protected]>
2024-08-22 23:06 ` Rui DeSousa <[email protected]>
2024-08-22 23:26 ` Sbob <[email protected]>
2024-08-22 23:32 ` Sbob <[email protected]>
2024-08-23 00:01 ` David G. Johnston <[email protected]>
2024-08-23 00:05 ` Tom Lane <[email protected]>
2024-08-23 09:54 ` Rui DeSousa <[email protected]>
2024-08-23 00:07 ` Rui DeSousa <[email protected]>
2024-08-23 13:47 ` Sbob <[email protected]>
2024-08-23 11:39 ` Doug Reynolds <[email protected]>
2024-08-23 15:17 ` Scott Ribe <[email protected]>
2024-08-23 15:42 ` Doug Reynolds <[email protected]>
2024-08-23 16:08 ` Scott Ribe <[email protected]>
2024-08-23 15:49 ` Wetmore, Matthew (CTR) <[email protected]>
2024-08-23 04:40 ` Muhammad Usman Khan <[email protected]>
2024-08-23 06:45 ` [email protected]
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox