Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dPbV5-0001X6-Au for pgsql-performance@arkaria.postgresql.org; Mon, 26 Jun 2017 21:24:51 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1dPbV4-0006f8-Np for pgsql-performance@arkaria.postgresql.org; Mon, 26 Jun 2017 21:24:50 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1dPbTG-0003R9-Ar for pgsql-performance@postgresql.org; Mon, 26 Jun 2017 21:22:58 +0000 Received: from mail-vk0-x230.google.com ([2607:f8b0:400c:c05::230]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1dPbTB-0007J2-TB for pgsql-performance@postgresql.org; Mon, 26 Jun 2017 21:22:57 +0000 Received: by mail-vk0-x230.google.com with SMTP id r126so7323288vkg.0 for ; Mon, 26 Jun 2017 14:22:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=/XZEqjt2wyT5hWZdX/t0npkMAj0EKtydqn12xnPU8HE=; b=V71a/bKB6dacMwEZgw0IyqPd90uHJeBbjjZq7n65a9EkdtCzAFYA7Xhi9471pM1WfZ AbsjtQxdbkXwKEzG7rqW8Ij+ashw3QOQkiZo3tNa88HNNq3D5VjXKmXEE/yuhxep6Ysu QHl+L2qzJUktLysBDizwmdEoswDrWgnw1lfkEp3Z4VSjfqQEtjI3nhIk/KUTuoL5gQ+0 QnYx6yXBuWiKH/h+WVDwKCK26YXGyPfi1J9G6Xgtp51ypNeXmJqWwBc2T6WnFsBarg26 fzIgaqtZQV7h5f7qqi2yJVHOpJy6x3FIGuB42rUqRsXlClHR/yLftbyPKSXKjJeA+exS 0rAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=/XZEqjt2wyT5hWZdX/t0npkMAj0EKtydqn12xnPU8HE=; b=AyiAqm6MqMevwr0YorFQDuhFUhwl8LxdGIN3g5fvBPW/zm3F/DoXQ+bg0gHSyOZVba 2JHH6jCqr9hPCSYsDc/Z2+XVVPwZHRnjPaSop82EqYUKAmTjqG7IDPrJ7GGduXtj9y+W 3O2ol6Sw2EBWdyHx/AlMtoI9hUCBV2rWEKIEoUg3R3w7/lOzD39Xny27GrhG0StTKSZM jtyQKLhm8ktbhbjybblvPS2NWfrUG+tktNCdUVfbfrHU7wKNJl002Xuws+AkqV8qyNTn vupgehMk2J/L/+rn/k8P3CjAhGCbQObr3s1IrDk83wSUZZsrYn58YIfR5sgxaLCdw1/u 9VMQ== X-Gm-Message-State: AKS2vOyAZ7zE+gkIER9qEMqG5ITxIAdouSXTTLiyh4qVr4OqXgW2lOX6 OruTMaZYFjQPpHNZk3TjtgbIJcdE/aEGndM= X-Received: by 10.31.2.209 with SMTP id 200mr1042903vkc.65.1498512171790; Mon, 26 Jun 2017 14:22:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.159.39.198 with HTTP; Mon, 26 Jun 2017 14:22:51 -0700 (PDT) In-Reply-To: References: From: Jeff Janes Date: Mon, 26 Jun 2017 14:22:51 -0700 Message-ID: Subject: Re: Fwd: Slow query from ~7M rows, joined to two tables of ~100 rows each To: Chris Wilson Cc: "pgsql-performance@postgresql.org" Content-Type: multipart/alternative; boundary="001a113f7f9cb0f89b0552e38de4" List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --001a113f7f9cb0f89b0552e38de4 Content-Type: text/plain; charset="UTF-8" On Fri, Jun 23, 2017 at 1:09 PM, Chris Wilson wrote: > > The records can already be read in order from idx_metric_value.... If this > was selected as the primary table, and metric_pos was joined to it, then > the output would also be in order, and no sort would be needed. > > We should be able to use a merge join to metric_pos, because it can be > read in order of id_metric (its primary key, and the first column in > idx_metric_value...). If not, a hash join should be faster than a nested > loop, if we only have to hash ~100 records. > Hash joins do not preserve order. They could preserve the order of their "first" input, but only if the hash join is all run in one batch and doesn't spill to disk. But the hash join code is never prepared to make a guarantee that it won't spill to disk, and so never considers it to preserve order. It thinks it only needs to hash 100 rows, but it is never absolutely certain of that, until it actually executes. If I set enable_sort to false, then I do get the merge join you want (but with asset_pos joined by nested loop index scan, not a hash join, for the reason just stated above) but that is slower than the plan with the sort in it, just like PostgreSQL thinks it will be. If I vacuum your fact table, then it can switch to use index only scans. I then get a different plan, still using a sort, which runs in 1.6 seconds. Sorting is not the slow step you think it is. Be warned that "explain (analyze)" can substantially slow down and distort this type of query, especially when sorting. You should run "explain (analyze, timing off)" first, and then only trust "explain (analyze)" if the overall execution times between them are similar. > If I remove one of the joins (asset_pos) then I get a merge join between > two indexes, as expected, but it has a materialize just before it which > makes no sense to me. Why do we need to materialize here? And why > materialise 100 rows into 1.5 million rows? (explain.depesz.com > ) > -> Materialize (cost=0.14..4.89 rows=100 width=8) (actual > time=0.018..228.265 rows=1504801 loops=1) > Buffers: shared hit=2 > -> Index Only Scan using idx_metric_pos_id_pos on metric_pos > (cost=0.14..4.64 rows=100 width=8) (actual time=0.013..0.133 rows=100 > loops=1) > Heap Fetches: 100 > Buffers: shared hit=2 > > It doesn't need to materialize, it does it simply because it thinks it will be faster (which it is, slightly). You can prevent it from doing so by set enable_materialize to off. The reason it is faster is that with the materialize, it can check all the visibility filters at once, rather than having to do it repeatedly. It is only materializing 100 rows, the 1504801 comes from the number of rows the projected out of the materialized table (one for each row in the other side of the join, in this case), rather than the number of rows contained within it. And again, vacuum your tables. Heap fetches aren't cheap. > The size of the result set is approximately 91 MB (measured with psql -c | > wc -c). Why does it take 4 seconds to transfer this much data over a UNIX > socket on the same box? > It has to convert the data to a format used for the wire protocol (hardware independent, and able to support user defined and composite types), and then back again. > work_mem = 100MB Can you give it more than that? How many simultaneous connections do you expect? Cheers, Jeff --001a113f7f9cb0f89b0552e38de4 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On F= ri, Jun 23, 2017 at 1:09 PM, Chris Wilson <chris+postgresql@qwirx= .com> wrote:

=
The records can already be read in order from idx_metric_value..= .. If this was selected as the primary table, and metric_pos was joined to = it, then the output would also be in order, and no sort would be needed.
We should be able to use a merge join to metric_pos, becaus= e it can be read in order of id_metric (its primary key, and the first colu= mn in idx_metric_value...). If not, a hash join should be faster than a nes= ted loop, if we only have to hash ~100 records.
=

Hash joins do not preserve order.=C2=A0 Th= ey could preserve the order of their "first" input, but only if t= he hash join is all run in one batch and doesn't spill to disk.=C2=A0 B= ut the hash join code is never prepared to make a guarantee that it won'= ;t spill to disk, and so never considers it to preserve order.=C2=A0 It thi= nks it only needs to hash 100 rows, but it is never absolutely certain of t= hat, until it actually executes.

If I set enable_s= ort to false, then I do get the merge join you want (but with asset_pos joi= ned by nested loop index scan, not a hash join, for the reason just stated = above) but that is slower than the plan with the sort in it, just like Post= greSQL thinks it will be.

If I vacuum your fact ta= ble, then it can switch to use index only scans.=C2=A0 I then get a differe= nt plan, still using a sort, which runs in 1.6 seconds.=C2=A0 Sorting is no= t the slow step you think it is.

Be warned that &q= uot;explain (analyze)" can substantially slow down and distort this ty= pe of query, especially when sorting.=C2=A0 You should run "explain (a= nalyze, timing off)" first, and then only trust "explain (analyze= )" if the overall execution times between them are similar.
=
=C2=A0
If I re= move one of the joins (asset_pos) then I get a merge join between two index= es, as expected, but it has a materialize just before it which makes no sen= se to me. Why do we need to materialize here? And why materialise 100 rows = into 1.5 million rows? (explain.depesz.com)


=C2=A0 =C2=A0-> =C2=A0Materialize =C2=A0(c= ost=3D0.14..4.89 rows=3D100 width=3D8) (actual time=3D0.018..228.265 rows= =3D1504801 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D2=
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0-> =C2=A0Index Only Scan using idx_metric_pos_id_pos on= metric_pos =C2=A0(cost=3D0.14..4.64 rows=3D100 width=3D8) (actual time=3D0= .013..0.133 rows=3D100 loops=3D1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0Heap Fetches: 100
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: s= hared hit=3D2

It doesn't need to materialize, it does it si= mply because it thinks it will be faster (which it is, slightly).=C2=A0 You= can prevent it from doing so by set enable_materialize to off.=C2=A0 The r= eason it is faster is that with the materialize, it can check all the visib= ility filters at once, rather than having to do it repeatedly.=C2=A0 It is = only materializing 100 rows, the 1504801 comes from the number of rows the = projected out of the materialized table (one for each row in the other side= of the join, in this case), rather than the number of rows contained withi= n it.

And again, vacuum your tables.=C2=A0 H= eap fetches aren't cheap.


The size of the result set is app= roximately 91 MB (measured with psql -c | wc -c). Why does it take 4 second= s to transfer this much data over a UNIX socket on the same box?

It has to convert the data to= a format used for the wire protocol (hardware independent, and able to sup= port user defined and composite types), and then back again.

=
>=C2=A0work= _mem =3D 100MB

Can you give it more than that?=C2=A0 How many simultaneous conn= ections do you expect?

Cheers,
=

Jeff

--001a113f7f9cb0f89b0552e38de4--