MIME-Version: 1.0
In-Reply-To: 
 <CY1PR0201MB189792E39DD5C36B59EEC362FFDD0@CY1PR0201MB1897.namprd02.prod.outlook.com>
References: 
 <CAJhrTGxffyaiu1JwhYkY3noy5ukiyREpzv5yT_5CGubVzMXmqw@mail.gmail.com>
 <CAFj8pRCxqJsOtpahwtdh4M12OGEs9z-zvz8EfAWE19s+Wh075w@mail.gmail.com>
 <CAJhrTGxKR099VKLPYm-scCpj7En0cD9TegHQM9AFDzjs0tkXQw@mail.gmail.com>
 <CAFj8pRBqvyQrEK-msP5DO7aame+Sfc_yCuxRUDQ+xxhXM3tdfQ@mail.gmail.com>
 <CY1PR0201MB189792E39DD5C36B59EEC362FFDD0@CY1PR0201MB1897.namprd02.prod.outlook.com>
From: Yevhenii Kurtov <yevhenii.kurtov@gmail.com>
Date: Thu, 29 Jun 2017 12:17:44 +0700
Message-ID: 
 <CAJhrTGyckvhYsN3Y3jzZWX-PtW1R1iewP=q9nToke4taantqjA@mail.gmail.com>
Subject: Re: 
To: Brad DeJong <Brad.Dejong@infor.com>
Cc: Pavel Stehule <pavel.stehule@gmail.com>,
	"pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Content-Type: multipart/alternative; boundary="001a1147e7a0b282340553126b9d"
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

--001a1147e7a0b282340553126b9d
Content-Type: text/plain; charset="UTF-8"

Hello folks,

Thank you very much for analysis and suggested - there is a lot to learn
here. I just  tried UNION queries and got following error:

ERROR:  FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT

I made a table dump for anyone who wants to give it a spin
https://app.box.com/s/464b12glmlk5o4gvzz7krc4c8s2fxlwr
and here is the gist for the original commands
https://gist.github.com/lessless/33215d0c147645db721e74e07498ac53

On Wed, Jun 28, 2017 at 8:10 PM, Brad DeJong <Brad.Dejong@infor.com> wrote:

>
>
> On 2017-06-28, Pavel Stehule wrote ...
> > On 2017-06-28, Yevhenii Kurtov wrote ...
> >> On 2017-06-28, Pavel Stehule wrote ...
> >>> On 2017-06-28, Yevhenii Kurtov wrote ...
> >>>> We have a query that is run almost each second and it's very
> important to squeeze every other ms out of it. The query is:
> >>>> ...
> >>>> I added following index: CREATE INDEX ON campaign_jobs(id, status,
> failed_at, started_at, priority DESC, times_failed);
> >>>> ...
> >>> There are few issues
> >>> a) parametrized LIMIT
> >>> b) complex predicate with lot of OR
> >>> c) slow external sort
> >>>
> >>> b) signalize maybe some strange in design .. try to replace "OR" by
> "UNION" query
> >>> c) if you can and you have good enough memory .. try to increase
> work_mem .. maybe 20MB
> >>>
> >>> if you change query to union queries, then you can use conditional
> indexes
> >>>
> >>> create index(id) where status = 0;
> >>> create index(failed_at) where status = 2;
> >>> create index(started_at) where status = 1;
> >>
> >> Can you please give a tip how to rewrite the query with UNION clause?
> >
> > SELECT c0."id" FROM "campaign_jobs" AS c0
> > WHERE (((c0."status" = $1) AND NOT (c0."id" = ANY($2))))
> > UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> > WHERE ((c0."status" = $3) AND (c0."failed_at" > $4))
> > UNION SELECT c0."id" FROM "campaign_jobs" AS c0
> > WHERE ((c0."status" = $5) AND (c0."started_at" < $6))
> > ORDER BY c0."priority" DESC, c0."times_failed"
> > LIMIT $7
> > FOR UPDATE SKIP LOCKED
>
>
> Normally (at least for developers I've worked with), that kind of query
> structure is used when the "status" values don't overlap and don't change
> from query to query. Judging from Pavel's suggested conditional indexes
> (i.e. "where status = <constant>"), he also thinks that is likely.
>
> Give the optimizer that information so that it can use it. Assuming $1 = 0
> and $3 = 2 and $5 = 1, substitute literals. Substitute literal for $7 in
> limit. Push order by and limit to each branch of the union all (or does
> Postgres figure that out automatically?) Replace union with union all (not
> sure about Postgres, but allows other dbms to avoid sorting and merging
> result sets to eliminate duplicates). (Use of UNION ALL assumes that "id"
> is unique across rows as implied by only "id" being selected with FOR
> UPDATE. If multiple rows can have the same "id", then use UNION to
> eliminate the duplicates.)
>
> SELECT "id" FROM "campaign_jobs" WHERE "status" = 0 AND NOT "id" = ANY($1)
>   UNION ALL
> SELECT "id" FROM "campaign_jobs" WHERE "status" = 2 AND "failed_at" > $2
>   UNION ALL
> SELECT "id" FROM "campaign_jobs" WHERE "status" = 1 AND "started_at" < $3
> ORDER BY "priority" DESC, "times_failed"
> LIMIT 100
> FOR UPDATE SKIP LOCKED
>
>
> Another thing that you could try is to push the ORDER BY and LIMIT to the
> branches of the UNION (or does Postgres figure that out automatically?) and
> use slightly different indexes. This may not make sense for all the
> branches but one nice thing about UNION is that each branch can be tweaked
> independently. Also, there are probably unmentioned functional dependencies
> that you can use to reduce the index size and/or improve your match rate.
> Example - if status = 1 means that the campaign_job has started but not
> failed or completed, then you may know that started_at is set, but
> failed_at and ended_at are null. The < comparison in and of itself implies
> that only rows where "started_at" is not null will match the condition.
>
> SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE (((c0."status" = 0) AND
> NOT (c0."id" = ANY($1)))) ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT 100
> UNION ALL
> SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 2) AND
> (c0."failed_at" > $2)) ORDER BY c0."priority" DESC, c0."times_failed" LIMIT
> 100
> UNION ALL
> SELECT c0."id" FROM "campaign_jobs" AS c0 WHERE ((c0."status" = 1) AND
> (c0."started_at" < $3)) ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT 100
> ORDER BY c0."priority" DESC, c0."times_failed"
> LIMIT 100
> FOR UPDATE SKIP LOCKED
>
> Including the "priority", "times_failed" and "id" columns in the indexes
> along with "failed_at"/"started_at" allows the optimizer to do index only
> scans. (May still have to do random I/O to the data page to determine tuple
> version visibility but I don't think that can be eliminated.)
>
> create index ... ("priority" desc, "times_failed", "id")
>  where "status" = 0;
> create index ... ("priority" desc, "times_failed", "id", "failed_at")
> where "status" = 2 and "failed_at" is not null;
> create index ... ("priority" desc, "times_failed", "id", "started_at")
> where "status" = 1 and "started_at" is not null; -- and ended_at is null
> and ...
>
>
> I'm assuming that the optimizer knows that "where status = 1 and
> started_at < $3" implies "and started_at is not null" and will consider the
> conditional index. If not, then the "and started_at is not null" needs to
> be explicit.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--001a1147e7a0b282340553126b9d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello folks,<div><br></div><div>Thank you very much for an=
alysis and suggested - there is a lot to learn here. I just =C2=A0tried UNI=
ON queries and got following error:<br><br>ERROR: =C2=A0FOR UPDATE is not a=
llowed with UNION/INTERSECT/EXCEPT<br></div><div><br></div><div>I made a ta=
ble dump for anyone who wants to give it a spin=C2=A0<a href=3D"https://app=
.box.com/s/464b12glmlk5o4gvzz7krc4c8s2fxlwr">https://app.box.com/s/464b12gl=
mlk5o4gvzz7krc4c8s2fxlwr</a></div><div>and here is the gist for the origina=
l commands=C2=A0<a href=3D"https://gist.github.com/lessless/33215d0c147645d=
b721e74e07498ac53">https://gist.github.com/lessless/33215d0c147645db721e74e=
07498ac53</a></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail=
_quote">On Wed, Jun 28, 2017 at 8:10 PM, Brad DeJong <span dir=3D"ltr">&lt;=
<a href=3D"mailto:Brad.Dejong@infor.com" target=3D"_blank">Brad.Dejong@info=
r.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
On 2017-06-28, Pavel Stehule wrote ...<br>
&gt; On 2017-06-28, Yevhenii Kurtov wrote ...<br>
&gt;&gt; On 2017-06-28, Pavel Stehule wrote ...<br>
&gt;&gt;&gt; On 2017-06-28, Yevhenii Kurtov wrote ...<br>
<span class=3D"">&gt;&gt;&gt;&gt; We have a query that is run almost each s=
econd and it&#39;s very important to squeeze every other ms out of it. The =
query is:<br>
</span>&gt;&gt;&gt;&gt; ...<br>
<span class=3D"">&gt;&gt;&gt;&gt; I added following index: CREATE INDEX ON =
campaign_jobs(id, status, failed_at, started_at, priority DESC, times_faile=
d);<br>
</span>&gt;&gt;&gt;&gt; ...<br>
<span class=3D"">&gt;&gt;&gt; There are few issues<br>
&gt;&gt;&gt; a) parametrized LIMIT<br>
&gt;&gt;&gt; b) complex predicate with lot of OR<br>
&gt;&gt;&gt; c) slow external sort<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; b) signalize maybe some strange in design .. try to replace &q=
uot;OR&quot; by &quot;UNION&quot; query<br>
&gt;&gt;&gt; c) if you can and you have good enough memory .. try to increa=
se work_mem .. maybe 20MB<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; if you change query to union queries, then you can use conditi=
onal indexes<br>
&gt;&gt;&gt;<br>
&gt;&gt;&gt; create index(id) where status =3D 0;<br>
&gt;&gt;&gt; create index(failed_at) where status =3D 2;<br>
&gt;&gt;&gt; create index(started_at) where status =3D 1;<br>
&gt;&gt;<br>
</span><span class=3D"">&gt;&gt; Can you please give a tip how to rewrite t=
he query with UNION clause?<br>
&gt;<br>
</span><span class=3D"">&gt; SELECT c0.&quot;id&quot; FROM &quot;campaign_j=
obs&quot; AS c0<br>
&gt; WHERE (((c0.&quot;status&quot; =3D $1) AND NOT (c0.&quot;id&quot; =3D =
ANY($2))))<br>
&gt; UNION SELECT c0.&quot;id&quot; FROM &quot;campaign_jobs&quot; AS c0<br=
>
&gt; WHERE ((c0.&quot;status&quot; =3D $3) AND (c0.&quot;failed_at&quot; &g=
t; $4))<br>
&gt; UNION SELECT c0.&quot;id&quot; FROM &quot;campaign_jobs&quot; AS c0<br=
>
&gt; WHERE ((c0.&quot;status&quot; =3D $5) AND (c0.&quot;started_at&quot; &=
lt; $6))<br>
&gt; ORDER BY c0.&quot;priority&quot; DESC, c0.&quot;times_failed&quot;<br>
&gt; LIMIT $7<br>
&gt; FOR UPDATE SKIP LOCKED<br>
<br>
<br>
</span>Normally (at least for developers I&#39;ve worked with), that kind o=
f query structure is used when the &quot;status&quot; values don&#39;t over=
lap and don&#39;t change from query to query. Judging from Pavel&#39;s sugg=
ested conditional indexes (i.e. &quot;where status =3D &lt;constant&gt;&quo=
t;), he also thinks that is likely.<br>
<br>
Give the optimizer that information so that it can use it. Assuming $1 =3D =
0 and $3 =3D 2 and $5 =3D 1, substitute literals. Substitute literal for $7=
 in limit. Push order by and limit to each branch of the union all (or does=
 Postgres figure that out automatically?) Replace union with union all (not=
 sure about Postgres, but allows other dbms to avoid sorting and merging re=
sult sets to eliminate duplicates). (Use of UNION ALL assumes that &quot;id=
&quot; is unique across rows as implied by only &quot;id&quot; being select=
ed with FOR UPDATE. If multiple rows can have the same &quot;id&quot;, then=
 use UNION to eliminate the duplicates.)<br>
<br>
SELECT &quot;id&quot; FROM &quot;campaign_jobs&quot; WHERE &quot;status&quo=
t; =3D 0 AND NOT &quot;id&quot; =3D ANY($1)<br>
=C2=A0 UNION ALL<br>
SELECT &quot;id&quot; FROM &quot;campaign_jobs&quot; WHERE &quot;status&quo=
t; =3D 2 AND &quot;failed_at&quot; &gt; $2<br>
=C2=A0 UNION ALL<br>
SELECT &quot;id&quot; FROM &quot;campaign_jobs&quot; WHERE &quot;status&quo=
t; =3D 1 AND &quot;started_at&quot; &lt; $3<br>
ORDER BY &quot;priority&quot; DESC, &quot;times_failed&quot;<br>
LIMIT 100<br>
FOR UPDATE SKIP LOCKED<br>
<br>
<br>
Another thing that you could try is to push the ORDER BY and LIMIT to the b=
ranches of the UNION (or does Postgres figure that out automatically?) and =
use slightly different indexes. This may not make sense for all the branche=
s but one nice thing about UNION is that each branch can be tweaked indepen=
dently. Also, there are probably unmentioned functional dependencies that y=
ou can use to reduce the index size and/or improve your match rate. Example=
 - if status =3D 1 means that the campaign_job has started but not failed o=
r completed, then you may know that started_at is set, but failed_at and en=
ded_at are null. The &lt; comparison in and of itself implies that only row=
s where &quot;started_at&quot; is not null will match the condition.<br>
<br>
SELECT c0.&quot;id&quot; FROM &quot;campaign_jobs&quot; AS c0 WHERE (((c0.&=
quot;status&quot; =3D 0) AND NOT (c0.&quot;id&quot; =3D ANY($1)))) ORDER BY=
 c0.&quot;priority&quot; DESC, c0.&quot;times_failed&quot; LIMIT 100<br>
UNION ALL<br>
SELECT c0.&quot;id&quot; FROM &quot;campaign_jobs&quot; AS c0 WHERE ((c0.&q=
uot;status&quot; =3D 2) AND (c0.&quot;failed_at&quot; &gt; $2)) ORDER BY c0=
.&quot;priority&quot; DESC, c0.&quot;times_failed&quot; LIMIT 100<br>
UNION ALL<br>
SELECT c0.&quot;id&quot; FROM &quot;campaign_jobs&quot; AS c0 WHERE ((c0.&q=
uot;status&quot; =3D 1) AND (c0.&quot;started_at&quot; &lt; $3)) ORDER BY c=
0.&quot;priority&quot; DESC, c0.&quot;times_failed&quot; LIMIT 100<br>
<span class=3D"">ORDER BY c0.&quot;priority&quot; DESC, c0.&quot;times_fail=
ed&quot;<br>
</span>LIMIT 100<br>
FOR UPDATE SKIP LOCKED<br>
<br>
Including the &quot;priority&quot;, &quot;times_failed&quot; and &quot;id&q=
uot; columns in the indexes along with &quot;failed_at&quot;/&quot;started_=
at&quot; allows the optimizer to do index only scans. (May still have to do=
 random I/O to the data page to determine tuple version visibility but I do=
n&#39;t think that can be eliminated.)<br>
<br>
create index ... (&quot;priority&quot; desc, &quot;times_failed&quot;, &quo=
t;id&quot;)=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0where &qu=
ot;status&quot; =3D 0;<br>
create index ... (&quot;priority&quot; desc, &quot;times_failed&quot;, &quo=
t;id&quot;, &quot;failed_at&quot;)=C2=A0 where &quot;status&quot; =3D 2 and=
 &quot;failed_at&quot; is not null;<br>
create index ... (&quot;priority&quot; desc, &quot;times_failed&quot;, &quo=
t;id&quot;, &quot;started_at&quot;) where &quot;status&quot; =3D 1 and &quo=
t;started_at&quot; is not null; -- and ended_at is null and ...<br>
<br>
<br>
I&#39;m assuming that the optimizer knows that &quot;where status =3D 1 and=
 started_at &lt; $3&quot; implies &quot;and started_at is not null&quot; an=
d will consider the conditional index. If not, then the &quot;and started_a=
t is not null&quot; needs to be explicit.<br>
<span class=3D"HOEnZb"><font color=3D"#888888"><br>
--<br>
Sent via pgsql-performance mailing list (<a href=3D"mailto:pgsql-performanc=
e@postgresql.org">pgsql-performance@postgresql.<wbr>org</a>)<br>
To make changes to your subscription:<br>
<a href=3D"http://www.postgresql.org/mailpref/pgsql-performance" rel=3D"nor=
eferrer" target=3D"_blank">http://www.postgresql.org/<wbr>mailpref/pgsql-pe=
rformance</a><br>
</font></span></blockquote></div><br></div>

--001a1147e7a0b282340553126b9d--