MIME-Version: 1.0
References: <CAB+=1TX+Av1Fx+Q4YOmUGioUoa8TQ8kGa1h06zPSEona2az39A@mail.gmail.com>
In-Reply-To: <CAB+=1TX+Av1Fx+Q4YOmUGioUoa8TQ8kGa1h06zPSEona2az39A@mail.gmail.com>
From: Lok P <loknath.73@gmail.com>
Date: Sun, 9 Jun 2024 09:45:02 +0530
Message-ID: <CAKna9VajLFW=9Z1Y9ar0WJXKeGTgYXivFtBmdt=gXJoLs4s2Rw@mail.gmail.com>
Subject: Re: How to create efficient index in this scenario?
To: veem v <veema0000@gmail.com>
Cc: pgsql-general <pgsql-general@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000997095061a6d477e"
Archived-At: <https://www.postgresql.org/message-id/CAKna9VajLFW%3D9Z1Y9ar0WJXKeGTgYXivFtBmdt%3DgXJoLs4s2Rw%40mail.gmail.com>
Precedence: bulk

--000000000000997095061a6d477e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sat, Jun 8, 2024 at 7:03=E2=80=AFPM veem v <veema0000@gmail.com> wrote:

> Hi ,
> It's postgres version 15.4. A table is daily range partitioned on a colum=
n
> transaction_timestamp. It has a unique identifier which is the ideal for
> primary key (say transaction_id) , however as there is a limitation in
> which we have to include the partition key as part of the primary key, so
> it has to be a composite index. Either it has to be
> (transaction_id,transaction_timestamp) or ( transaction_timestamp,
> transaction_id). But which one should we go for, if both of the columns g=
et
> used in all the queries?
>
> We will always be using transaction_timestamp as mostly a range predicate
> filter/join in the query and the transaction_id will be mostly used as a
> join condition/direct filter in the queries. So we were wondering, which
> column should we be using  as a leading column in this index?
>
> There is a blog below (which is for oracle), showing how the index should
> be chosen and it states ,  "*Stick the columns you do range scans on last
> in the index, filters that get equality predicates should come first.* ",
> and in that case we should have the PK created as in the order
> (transaction_id,transaction_timestamp). It's because making the range
> predicate as a leading column won't help use that as an access predicate
> but as an filter predicate thus will read more blocks and thus more IO.
> Does this hold true in postgres too?
>
> https://ctandrewsayer.wordpress.com/2017/03/24/the-golden-rule-of-indexin=
g/
>

I believe the analogy holds true here in postgres too and the index in this
case should be on (transaction_id, transaction_timestamp).


>
>
> Additionally there is another scenario in which we have the requirement t=
o
> have another timestamp column (say create_timestamp) to be added as part =
of
> the primary key along with transaction_id and we are going to query this
> table frequently by the column create_timestamp as a range predicate. And
> ofcourse we will also have the range predicate filter on partition key
> "transaction_timestamp". But we may or may not have join/filter on column
> transaction_id, so in this scenario we should go for
>  (create_timestamp,transaction_id,transaction_timestamp). because
> "transaction_timestamp" is set as partition key , so putting it last
> doesn't harm us. Will this be the correct order or any other index order =
is
> appropriate?
>
>
>
In this case , the index should be on (
create_timestamp,transaction_id,transaction_timestamp), considering the
fact that you will always have queries with "create_timestamp" as predicate
and may not have transaction_id in the query predicate.

--000000000000997095061a6d477e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><div class=3D"gmail_quote"><div=
 dir=3D"ltr" class=3D"gmail_attr">On Sat, Jun 8, 2024 at 7:03=E2=80=AFPM ve=
em v &lt;<a href=3D"mailto:veema0000@gmail.com">veema0000@gmail.com</a>&gt;=
 wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"ltr">Hi ,=C2=A0<div>It&#39;s postgres version 15.4. A table is daily ra=
nge partitioned on a column transaction_timestamp. It has a unique identifi=
er which is the ideal for primary key (say transaction_id) , however as the=
re is a limitation in which we have to include the partition key as part of=
 the primary key, so it has to be a composite index. Either it has to be (t=
ransaction_id,transaction_timestamp) or ( transaction_timestamp, transactio=
n_id). But which one should we go for, if both of the columns get used in a=
ll the queries?<br><br>We will always be using transaction_timestamp as mos=
tly a range predicate filter/join in the query and the transaction_id will =
be mostly used as a join condition/direct filter in the queries. So we were=
 wondering, which column should we be using =C2=A0as a leading column in th=
is index?<br><br>There is a blog below (which is for oracle), showing how t=
he index should be chosen and it states , =C2=A0&quot;<i>Stick the columns =
you do range scans on last in the index, filters that get equality predicat=
es should come first.</i> &quot;, and in that case we should have the PK cr=
eated as in the order (transaction_id,transaction_timestamp). It&#39;s beca=
use making the range predicate as a leading column won&#39;t help use that =
as an access predicate but as an filter predicate thus will read more block=
s and thus more IO. Does this hold true in postgres too?<br><br><a href=3D"=
https://ctandrewsayer.wordpress.com/2017/03/24/the-golden-rule-of-indexing/=
" target=3D"_blank">https://ctandrewsayer.wordpress.com/2017/03/24/the-gold=
en-rule-of-indexing/</a></div></div></blockquote><div><br></div><div>I beli=
eve=C2=A0the analogy holds true here in postgres too and the index in this =
case should be on (transaction_id, transaction_timestamp).<br class=3D"gmai=
l-Apple-interchange-newline"></div><div>=C2=A0</div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex"><div dir=3D"ltr"><div><br><br>Additionally there=
 is another scenario in which we have the requirement to have another times=
tamp column (say create_timestamp) to be added as part of the primary key a=
long with transaction_id and we are going to query this table frequently by=
 the column create_timestamp as a range predicate. And ofcourse we will als=
o have the range predicate filter on partition key &quot;transaction_timest=
amp&quot;. But we may or may not have join/filter on column transaction_id,=
 so in this scenario we should go for =C2=A0(create_timestamp,transaction_i=
d,transaction_timestamp). because &quot;transaction_timestamp&quot; is set =
as partition key , so putting it last doesn&#39;t harm us. Will this be the=
 correct order or any other index order is appropriate?<br></div><div><br><=
/div><div><br></div></div></blockquote><div><br></div><div>In this case , t=
he index should be on ( create_timestamp,transaction_id,transaction_timesta=
mp), considering the fact that you will always=C2=A0have queries with &quot=
;create_timestamp&quot; as predicate and may not have transaction_id in the=
 query predicate.</div></div></div>

--000000000000997095061a6d477e--