MIME-Version: 1.0
In-Reply-To: <ea7b55a0-ab05-a539-0c08-c4c8e5c83bbd@catalyst.net.nz>
References: 
 <CADFyZw7aGoD0AaStxdyHByR5Qta=M5wx0v=iptKLhPUp+EOKvA@mail.gmail.com>
 <dc5d2b63-0a1b-ff76-c88a-87c2c41bd5b8@a-kretschmer.de>
 <CADFyZw4UanW5TbFajWKWhN9XcW+8gtCXw+kssHo47Wpr1A=zJw@mail.gmail.com>
 <DM5PR07MB28103FD558CB6628CECE07BEDAA90@DM5PR07MB2810.namprd07.prod.outlook.com>
 <CADFyZw6JrhsLR_eYOeCjjiQMzz5bepk6AfMRBu0hnaQg+vN-=A@mail.gmail.com>
 <a8fa0c8a-1269-338e-80a6-cb121574be7c@catalyst.net.nz>
 <CADFyZw5-WzdkACN39-ND9tvBwmvfdEhFaYL3Ds82a3Rwav-neA@mail.gmail.com>
 <ea7b55a0-ab05-a539-0c08-c4c8e5c83bbd@catalyst.net.nz>
From: Charles Nadeau <charles.nadeau@gmail.com>
Date: Sat, 15 Jul 2017 19:53:56 +0200
Message-ID: 
 <CADFyZw6S+vzHPvGnYMMQ4WcvcGkdkDX=bdo48knBmQcq_fpFFA@mail.gmail.com>
Subject: Re: Very poor read performance, query independent
To: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>,
	"pgsql-performa." <pgsql-performance@postgresql.org>
Content-Type: multipart/alternative; boundary="001a11420e08bb272a05545eda8e"
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

--001a11420e08bb272a05545eda8e
Content-Type: text/plain; charset="UTF-8"

Mark,

I increased the read ahead to 16384 and it doesn't improve performance. My
RAID 0 use a stripe size of 256k, the maximum size supported by the
controller.
Thanks!

Charles

On Sat, Jul 15, 2017 at 1:02 AM, Mark Kirkwood <
mark.kirkwood@catalyst.net.nz> wrote:

> Ah yes - that seems more sensible (but still slower than I would expect
> for 5 disks RAID 0). You should be able to get something like 5 * (single
> disk speed) i.e about 500MB/s.
>
> Might be worth increasing device read ahead (more than you have already).
> Some of these so-called 'smart' RAID cards need to be hit over the head
> before they will perform. E.g: I believe you have it set to 128 - I'd try
> 4096 or even 16384 (In the past I've used those settings on some extremely
> stupid cards that refused to max out their disks known speeds).
>
> Also worth investigating is RAID stripe size - for DW work it makes sense
> for it to be reasonably big (256K to 1M), which again will help speed is
> sequential scans.
>
> Cheers
>
> Mark
>
> P.s I used to work for Greenplum, so this type of problem came up a lot
> :-) . The best cards were the LSI and Areca!
>
>
>
> On 15/07/17 02:09, Charles Nadeau wrote:
>
>> Mark,
>>
>> First I must say that I changed my disks configuration from 4 disks in
>> RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during
>> the last ingest of data.
>> Here is the result test you asked. It was done with a cold cache:
>>
>>     flows=# \timing
>>     Timing is on.
>>     flows=# explain select count(*) from flows;
>>                                               QUERY PLAN
>>     ------------------------------------------------------------
>> -----------------------------------
>>      Finalize Aggregate  (cost=17214914.09..17214914.09 rows=1 width=8)
>>        ->  Gather  (cost=17214914.07..17214914.09 rows=1 width=8)
>>              Workers Planned: 1
>>              ->  Partial Aggregate  (cost=17213914.07..17213914.07
>>     rows=1 width=8)
>>                    ->  Parallel Seq Scan on flows
>>      (cost=0.00..17019464.49 rows=388899162 width=0)
>>     (5 rows)
>>
>>     Time: 171.835 ms
>>     flows=# select pg_relation_size('flows');
>>      pg_relation_size
>>     ------------------
>>          129865867264
>>     (1 row)
>>
>>     Time: 57.157 ms
>>     flows=# select count(*) from flows;
>>     LOG:  duration: 625546.522 ms  statement: select count(*) from flows;
>>        count
>>     -----------
>>      589831190
>>     (1 row)
>>
>>     Time: 625546.662 ms
>>
>> The throughput reported by Postgresql is almost 198MB/s, and the
>> throughput as mesured by dstat during the query execution was between 25
>> and 299MB/s. It is much better than what I had before! The i/o wait was
>> about 12% all through the query. One thing I noticed is the discrepency
>> between the read throughput reported by pg_activity and the one reported by
>> dstat: pg_activity always report a value lower than dstat.
>>
>> Besides the change of disks configuration, here is what contributed the
>> most to the improvment of the performance so far:
>>
>>     Using Hugepage
>>     Increasing effective_io_concurrency to 256
>>     Reducing random_page_cost from 22 to 4
>>     Reducing min_parallel_relation_size to 512kB to have more workers
>>     when doing sequential parallel scan of my biggest table
>>
>>
>> Thanks for recomending this test, I now know what the real throughput
>> should be!
>>
>> Charles
>>
>> On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <
>> mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>>
>> wrote:
>>
>>     Hmm - how are you measuring that sequential scan speed of 4MB/s?
>>     I'd recommend doing a very simple test e.g, here's one on my
>>     workstation - 13 GB single table on 1 SATA drive - cold cache
>>     after reboot, sequential scan using Postgres 9.6.2:
>>
>>     bench=#  EXPLAIN SELECT count(*) FROM pgbench_accounts;
>>                                          QUERY PLAN
>>     ------------------------------------------------------------
>> ------------------------
>>      Aggregate  (cost=2889345.00..2889345.01 rows=1 width=8)
>>        ->  Seq Scan on pgbench_accounts (cost=0.00..2639345.00
>>     rows=100000000 width=0)
>>     (2 rows)
>>
>>
>>     bench=#  SELECT pg_relation_size('pgbench_accounts');
>>      pg_relation_size
>>     ------------------
>>           13429514240
>>     (1 row)
>>
>>     bench=# SELECT count(*) FROM pgbench_accounts;
>>        count
>>     -----------
>>      100000000
>>     (1 row)
>>
>>     Time: 118884.277 ms
>>
>>
>>     So doing the math seq read speed is about 110MB/s (i.e 13 GB in
>>     120 sec). Sure enough, while I was running the query iostat showed:
>>
>>     Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s wMB/s
>>     avgrq-sz avgqu-sz   await r_await w_await  svctm %util
>>     sda               0.00     0.00  926.00    0.00 114.89  0.00
>> 254.10     1.90    2.03    2.03    0.00   1.08 100.00
>>
>>
>>     So might be useful for us to see something like that from your
>>     system - note you need to check you really have flushed the cache,
>>     and that no other apps are using the db.
>>
>>     regards
>>
>>     Mark
>>
>>
>>     On 12/07/17 00:46, Charles Nadeau wrote:
>>
>>         After reducing random_page_cost to 4 and testing more, I can
>>         report that the aggregate read throughput for parallel
>>         sequential scan is about 90MB/s. However the throughput for
>>         sequential scan is still around 4MB/s.
>>
>>
>>
>>
>>
>> --
>> Charles Nadeau Ph.D.
>> http://charlesnadeau.blogspot.com/
>>
>
>


-- 
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/

--001a11420e08bb272a05545eda8e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Mark,</div><div><br></div><div>I increased the read a=
head to 16384 and it doesn&#39;t improve performance. My RAID 0 use a strip=
e size of 256k, the maximum size supported by the controller.</div><div>Tha=
nks!</div><div><br></div><div>Charles</div></div><div class=3D"gmail_extra"=
><br><div class=3D"gmail_quote">On Sat, Jul 15, 2017 at 1:02 AM, Mark Kirkw=
ood <span dir=3D"ltr">&lt;<a href=3D"mailto:mark.kirkwood@catalyst.net.nz" =
target=3D"_blank">mark.kirkwood@catalyst.net.nz</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex">Ah yes - that seems more sensible (but still s=
lower than I would expect for 5 disks RAID 0). You should be able to get so=
mething like 5 * (single disk speed) i.e about 500MB/s.<br>
<br>
Might be worth increasing device read ahead (more than you have already). S=
ome of these so-called &#39;smart&#39; RAID cards need to be hit over the h=
ead before they will perform. E.g: I believe you have it set to 128 - I&#39=
;d try 4096 or even 16384 (In the past I&#39;ve used those settings on some=
 extremely stupid cards that refused to max out their disks known speeds).<=
br>
<br>
Also worth investigating is RAID stripe size - for DW work it makes sense f=
or it to be reasonably big (256K to 1M), which again will help speed is seq=
uential scans.<br>
<br>
Cheers<span class=3D"HOEnZb"><font color=3D"#888888"><br>
<br>
Mark<br>
</font></span><br>
P.s I used to work for Greenplum, so this type of problem came up a lot :-)=
 . The best cards were the LSI and Areca!<div><div class=3D"h5"><br>
<br>
<br>
On 15/07/17 02:09, Charles Nadeau wrote:<br>
</div></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div><div class=3D"h5">
Mark,<br>
<br>
First I must say that I changed my disks configuration from 4 disks in RAID=
 10 to 5 disks in RAID 0 because I almost ran out of disk space during the =
last ingest of data.<br>
Here is the result test you asked. It was done with a cold cache:<br>
<br>
=C2=A0 =C2=A0 flows=3D# \timing<br>
=C2=A0 =C2=A0 Timing is on.<br>
=C2=A0 =C2=A0 flows=3D# explain select count(*) from flows;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 QUERY PLAN<br>
=C2=A0 =C2=A0 ------------------------------<wbr>--------------------------=
----<wbr>------------------------------<wbr>-----<br>
=C2=A0 =C2=A0 =C2=A0Finalize Aggregate=C2=A0 (cost=3D17214914.09..17214914.=
09 rows=3D1 width=3D8)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt;=C2=A0 Gather=C2=A0 (cost=3D17214914.07..17=
214914.09 rows=3D1 width=3D8)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 1<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt;=C2=A0 Partial Aggrega=
te=C2=A0 (cost=3D17213914.07..17213914.07<br>
=C2=A0 =C2=A0 rows=3D1 width=3D8)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt;=
=C2=A0 Parallel Seq Scan on flows<br>
=C2=A0 =C2=A0 =C2=A0(cost=3D0.00..17019464.49 rows=3D388899162 width=3D0)<b=
r>
=C2=A0 =C2=A0 (5 rows)<br>
<br>
=C2=A0 =C2=A0 Time: 171.835 ms<br>
=C2=A0 =C2=A0 flows=3D# select pg_relation_size(&#39;flows&#39;);<br>
=C2=A0 =C2=A0 =C2=A0pg_relation_size<br>
=C2=A0 =C2=A0 ------------------<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0129865867264<br>
=C2=A0 =C2=A0 (1 row)<br>
<br>
=C2=A0 =C2=A0 Time: 57.157 ms<br>
=C2=A0 =C2=A0 flows=3D# select count(*) from flows;<br>
=C2=A0 =C2=A0 LOG:=C2=A0 duration: 625546.522 ms=C2=A0 statement: select co=
unt(*) from flows;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0count<br>
=C2=A0 =C2=A0 -----------<br>
=C2=A0 =C2=A0 =C2=A0589831190<br>
=C2=A0 =C2=A0 (1 row)<br>
<br>
=C2=A0 =C2=A0 Time: 625546.662 ms<br>
<br>
The throughput reported by Postgresql is almost 198MB/s, and the throughput=
 as mesured by dstat during the query execution was between 25 and 299MB/s.=
 It is much better than what I had before! The i/o wait was about 12% all t=
hrough the query. One thing I noticed is the discrepency between the read t=
hroughput reported by pg_activity and the one reported by dstat: pg_activit=
y always report a value lower than dstat.<br>
<br>
Besides the change of disks configuration, here is what contributed the mos=
t to the improvment of the performance so far:<br>
<br>
=C2=A0 =C2=A0 Using Hugepage<br>
=C2=A0 =C2=A0 Increasing effective_io_concurrency to 256<br>
=C2=A0 =C2=A0 Reducing random_page_cost from 22 to 4<br>
=C2=A0 =C2=A0 Reducing min_parallel_relation_size to 512kB to have more wor=
kers<br>
=C2=A0 =C2=A0 when doing sequential parallel scan of my biggest table<br>
<br>
<br>
Thanks for recomending this test, I now know what the real throughput shoul=
d be!<br>
<br>
Charles<br>
<br></div></div><div><div class=3D"h5">
On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood &lt;<a href=3D"mailto:mark.k=
irkwood@catalyst.net.nz" target=3D"_blank">mark.kirkwood@catalyst.net.nz</a=
> &lt;mailto:<a href=3D"mailto:mark.kirkwood@catalyst.net.nz" target=3D"_bl=
ank">mark.kirkwood@catalyst<wbr>.net.nz</a>&gt;&gt; wrote:<br>
<br>
=C2=A0 =C2=A0 Hmm - how are you measuring that sequential scan speed of 4MB=
/s?<br>
=C2=A0 =C2=A0 I&#39;d recommend doing a very simple test e.g, here&#39;s on=
e on my<br>
=C2=A0 =C2=A0 workstation - 13 GB single table on 1 SATA drive - cold cache=
<br>
=C2=A0 =C2=A0 after reboot, sequential scan using Postgres 9.6.2:<br>
<br>
=C2=A0 =C2=A0 bench=3D#=C2=A0 EXPLAIN SELECT count(*) FROM pgbench_accounts=
;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0QU=
ERY PLAN<br>
=C2=A0 =C2=A0 ------------------------------<wbr>--------------------------=
----<wbr>------------------------<br>
=C2=A0 =C2=A0 =C2=A0Aggregate=C2=A0 (cost=3D2889345.00..2889345.01 rows=3D1=
 width=3D8)<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt;=C2=A0 Seq Scan on pgbench_accounts (cost=
=3D0.00..2639345.00<br>
=C2=A0 =C2=A0 rows=3D100000000 width=3D0)<br>
=C2=A0 =C2=A0 (2 rows)<br>
<br>
<br>
=C2=A0 =C2=A0 bench=3D#=C2=A0 SELECT pg_relation_size(&#39;pgbench_acco<wbr=
>unts&#39;);<br>
=C2=A0 =C2=A0 =C2=A0pg_relation_size<br>
=C2=A0 =C2=A0 ------------------<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 13429514240<br>
=C2=A0 =C2=A0 (1 row)<br>
<br>
=C2=A0 =C2=A0 bench=3D# SELECT count(*) FROM pgbench_accounts;<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0count<br>
=C2=A0 =C2=A0 -----------<br>
=C2=A0 =C2=A0 =C2=A0100000000<br>
=C2=A0 =C2=A0 (1 row)<br>
<br>
=C2=A0 =C2=A0 Time: 118884.277 ms<br>
<br>
<br>
=C2=A0 =C2=A0 So doing the math seq read speed is about 110MB/s (i.e 13 GB =
in<br>
=C2=A0 =C2=A0 120 sec). Sure enough, while I was running the query iostat s=
howed:<br>
<br>
=C2=A0 =C2=A0 Device:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rrqm/s=C2=A0 =C2=A0w=
rqm/s=C2=A0 =C2=A0 =C2=A0r/s=C2=A0 =C2=A0 =C2=A0w/s=C2=A0 =C2=A0 rMB/s wMB/=
s<br>
=C2=A0 =C2=A0 avgrq-sz avgqu-sz=C2=A0 =C2=A0await r_await w_await=C2=A0 svc=
tm %util<br>
=C2=A0 =C2=A0 sda=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0=
0=C2=A0 =C2=A0 =C2=A00.00=C2=A0 926.00=C2=A0 =C2=A0 0.00 114.89=C2=A0 0.00=
=C2=A0 =C2=A0 =C2=A0 254.10=C2=A0 =C2=A0 =C2=A01.90=C2=A0 =C2=A0 2.03=C2=A0=
 =C2=A0 2.03=C2=A0 =C2=A0 0.00=C2=A0 =C2=A01.08 100.00<br>
<br>
<br>
=C2=A0 =C2=A0 So might be useful for us to see something like that from you=
r<br>
=C2=A0 =C2=A0 system - note you need to check you really have flushed the c=
ache,<br>
=C2=A0 =C2=A0 and that no other apps are using the db.<br>
<br>
=C2=A0 =C2=A0 regards<br>
<br>
=C2=A0 =C2=A0 Mark<br>
<br>
<br>
=C2=A0 =C2=A0 On 12/07/17 00:46, Charles Nadeau wrote:<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 After reducing random_page_cost to 4 and testin=
g more, I can<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 report that the aggregate read throughput for p=
arallel<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sequential scan is about 90MB/s. However the th=
roughput for<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sequential scan is still around 4MB/s.<br>
<br>
<br>
<br>
<br>
<br>
-- <br>
Charles Nadeau Ph.D.<br>
<a href=3D"http://charlesnadeau.blogspot.com/" rel=3D"noreferrer" target=3D=
"_blank">http://charlesnadeau.blogspot.<wbr>com/</a><br>
</div></div></blockquote>
<br>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature" data-smartmail=3D"gmail_signature">Charles Nadeau Ph.D=
.<br><a href=3D"http://charlesnadeau.blogspot.com/" target=3D"_blank">http:=
//charlesnadeau.blogspot.com/</a></div>
</div>

--001a11420e08bb272a05545eda8e--