MIME-Version: 1.0
In-Reply-To: 
 <CAOg7f83Z6Cixa97jfK4MwC_gZUSnghpq+Uo=qUOufhOyBDiq6Q@mail.gmail.com>
References: 
 <CAOg7f833bP5LEHbQkHyjv5ZHA+P_LXrogwLDfOpgiu7jK=rRpQ@mail.gmail.com>
 <CAOg7f83Z6Cixa97jfK4MwC_gZUSnghpq+Uo=qUOufhOyBDiq6Q@mail.gmail.com>
From: Jeff Janes <jeff.janes@gmail.com>
Date: Mon, 26 Jun 2017 14:22:51 -0700
Message-ID: 
 <CAMkU=1wH2W7z6Roynwu0pQq991buSFAjBdmPacSmEiseXqU7Cg@mail.gmail.com>
Subject: Re: Fwd: Slow query from ~7M rows, joined to two tables of
 ~100 rows each
To: Chris Wilson <chris+postgresql@qwirx.com>
Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Content-Type: multipart/alternative; boundary="001a113f7f9cb0f89b0552e38de4"
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

--001a113f7f9cb0f89b0552e38de4
Content-Type: text/plain; charset="UTF-8"

On Fri, Jun 23, 2017 at 1:09 PM, Chris Wilson <chris+postgresql@qwirx.com>
wrote:

>
> The records can already be read in order from idx_metric_value.... If this
> was selected as the primary table, and metric_pos was joined to it, then
> the output would also be in order, and no sort would be needed.
>
> We should be able to use a merge join to metric_pos, because it can be
> read in order of id_metric (its primary key, and the first column in
> idx_metric_value...). If not, a hash join should be faster than a nested
> loop, if we only have to hash ~100 records.
>

Hash joins do not preserve order.  They could preserve the order of their
"first" input, but only if the hash join is all run in one batch and
doesn't spill to disk.  But the hash join code is never prepared to make a
guarantee that it won't spill to disk, and so never considers it to
preserve order.  It thinks it only needs to hash 100 rows, but it is never
absolutely certain of that, until it actually executes.

If I set enable_sort to false, then I do get the merge join you want (but
with asset_pos joined by nested loop index scan, not a hash join, for the
reason just stated above) but that is slower than the plan with the sort in
it, just like PostgreSQL thinks it will be.

If I vacuum your fact table, then it can switch to use index only scans.  I
then get a different plan, still using a sort, which runs in 1.6 seconds.
Sorting is not the slow step you think it is.

Be warned that "explain (analyze)" can substantially slow down and distort
this type of query, especially when sorting.  You should run "explain
(analyze, timing off)" first, and then only trust "explain (analyze)" if
the overall execution times between them are similar.


> If I remove one of the joins (asset_pos) then I get a merge join between
> two indexes, as expected, but it has a materialize just before it which
> makes no sense to me. Why do we need to materialize here? And why
> materialise 100 rows into 1.5 million rows? (explain.depesz.com
> <https://explain.depesz.com/s/7mkM>)
>


   ->  Materialize  (cost=0.14..4.89 rows=100 width=8) (actual
> time=0.018..228.265 rows=1504801 loops=1)
>          Buffers: shared hit=2
>          ->  Index Only Scan using idx_metric_pos_id_pos on metric_pos
>  (cost=0.14..4.64 rows=100 width=8) (actual time=0.013..0.133 rows=100
> loops=1)
>                Heap Fetches: 100
>                Buffers: shared hit=2
>
>
It doesn't need to materialize, it does it simply because it thinks it will
be faster (which it is, slightly).  You can prevent it from doing so by set
enable_materialize to off.  The reason it is faster is that with the
materialize, it can check all the visibility filters at once, rather than
having to do it repeatedly.  It is only materializing 100 rows, the 1504801
comes from the number of rows the projected out of the materialized table
(one for each row in the other side of the join, in this case), rather than
the number of rows contained within it.

And again, vacuum your tables.  Heap fetches aren't cheap.


> The size of the result set is approximately 91 MB (measured with psql -c |
> wc -c). Why does it take 4 seconds to transfer this much data over a UNIX
> socket on the same box?
>

It has to convert the data to a format used for the wire protocol (hardware
independent, and able to support user defined and composite types), and
then back again.

> work_mem = 100MB

Can you give it more than that?  How many simultaneous connections do you
expect?

Cheers,

Jeff

--001a113f7f9cb0f89b0552e38de4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On F=
ri, Jun 23, 2017 at 1:09 PM, Chris Wilson <span dir=3D"ltr">&lt;<a href=3D"=
mailto:chris+postgresql@qwirx.com" target=3D"_blank">chris+postgresql@qwirx=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex"><div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr"><div><br>=
</div><div>The records can already be read in order from idx_metric_value..=
.. If this was selected as the primary table, and metric_pos was joined to =
it, then the output would also be in order, and no sort would be needed.<br=
><br></div><div>We should be able to use a merge join to metric_pos, becaus=
e it can be read in order of id_metric (its primary key, and the first colu=
mn in idx_metric_value...). If not, a hash join should be faster than a nes=
ted loop, if we only have to hash ~100 records.<br></div></div></div></div>=
</blockquote><div><br></div><div>Hash joins do not preserve order.=C2=A0 Th=
ey could preserve the order of their &quot;first&quot; input, but only if t=
he hash join is all run in one batch and doesn&#39;t spill to disk.=C2=A0 B=
ut the hash join code is never prepared to make a guarantee that it won&#39=
;t spill to disk, and so never considers it to preserve order.=C2=A0 It thi=
nks it only needs to hash 100 rows, but it is never absolutely certain of t=
hat, until it actually executes.</div><div><br></div><div>If I set enable_s=
ort to false, then I do get the merge join you want (but with asset_pos joi=
ned by nested loop index scan, not a hash join, for the reason just stated =
above) but that is slower than the plan with the sort in it, just like Post=
greSQL thinks it will be.</div><div><br></div><div>If I vacuum your fact ta=
ble, then it can switch to use index only scans.=C2=A0 I then get a differe=
nt plan, still using a sort, which runs in 1.6 seconds.=C2=A0 Sorting is no=
t the slow step you think it is.</div><div><br></div><div>Be warned that &q=
uot;explain (analyze)&quot; can substantially slow down and distort this ty=
pe of query, especially when sorting.=C2=A0 You should run &quot;explain (a=
nalyze, timing off)&quot; first, and then only trust &quot;explain (analyze=
)&quot; if the overall execution times between them are similar.</div><div>=
<br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr"><div>If I re=
move one of the joins (asset_pos) then I get a merge join between two index=
es, as expected, but it has a materialize just before it which makes no sen=
se to me. Why do we need to materialize here? And why materialise 100 rows =
into 1.5 million rows? (<a href=3D"https://explain.depesz.com/s/7mkM" targe=
t=3D"_blank">explain.depesz.com</a>)</div></div></div></div></blockquote><d=
iv><br></div><div><br></div><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1e=
x"><div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr"><blockquote=
 style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"><div><div><font =
face=3D"monospace, monospace">=C2=A0 =C2=A0-&gt; =C2=A0Materialize =C2=A0(c=
ost=3D0.14..4.89 rows=3D100 width=3D8) (actual time=3D0.018..228.265 rows=
=3D1504801 loops=3D1)</font></div></div><div><div><font face=3D"monospace, =
monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D2</font>=
</div></div><div><div><font face=3D"monospace, monospace">=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0-&gt; =C2=A0Index Only Scan using idx_metric_pos_id_pos on=
 metric_pos =C2=A0(cost=3D0.14..4.64 rows=3D100 width=3D8) (actual time=3D0=
.013..0.133 rows=3D100 loops=3D1)</font></div></div><div><div><font face=3D=
"monospace, monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Heap Fetches: 100</font></div></div><div><div><font face=3D"monospace, m=
onospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: s=
hared hit=3D2</font></div></div></blockquote></div></div></div></blockquote=
><div><br></div><div><div>It doesn&#39;t need to materialize, it does it si=
mply because it thinks it will be faster (which it is, slightly).=C2=A0 You=
 can prevent it from doing so by set enable_materialize to off.=C2=A0 The r=
eason it is faster is that with the materialize, it can check all the visib=
ility filters at once, rather than having to do it repeatedly.=C2=A0 It is =
only materializing 100 rows, the 1504801 comes from the number of rows the =
projected out of the materialized table (one for each row in the other side=
 of the join, in this case), rather than the number of rows contained withi=
n it.</div></div><div><br></div><div>And again, vacuum your tables.=C2=A0 H=
eap fetches aren&#39;t cheap.</div><div><br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_quote"><div d=
ir=3D"ltr"><div><div><br></div></div><div>The size of the result set is app=
roximately 91 MB (measured with psql -c | wc -c). Why does it take 4 second=
s to transfer this much data over a UNIX socket on the same box?</div></div=
></div></div></blockquote><div><br></div><div>It has to convert the data to=
 a format used for the wire protocol (hardware independent, and able to sup=
port user defined and composite types), and then back again.</div><div><br>=
</div><div>&gt;=C2=A0<span style=3D"color:rgb(0,0,0);font-size:12.8px">work=
_mem =3D 100MB</span></div><div><span style=3D"color:rgb(0,0,0);font-size:1=
2.8px"><br></span></div><div><font color=3D"#000000"><span style=3D"font-si=
ze:12.8px">Can you give it more than that?=C2=A0 How many simultaneous conn=
ections do you expect?</span></font></div><div><br></div><div>Cheers,</div>=
<div><br></div><div>Jeff</div><div><br></div></div></div></div>

--001a113f7f9cb0f89b0552e38de4--