Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dWRGr-0000y6-LJ for pgsql-performance@arkaria.postgresql.org; Sat, 15 Jul 2017 17:54:25 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1dWRGq-0003Xj-Hq for pgsql-performance@arkaria.postgresql.org; Sat, 15 Jul 2017 17:54:24 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1dWRGp-0003Wz-Od for pgsql-performance@postgresql.org; Sat, 15 Jul 2017 17:54:23 +0000 Received: from mail-qt0-x233.google.com ([2607:f8b0:400d:c0d::233]) by magus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1dWRGk-0003o4-NF for pgsql-performance@postgresql.org; Sat, 15 Jul 2017 17:54:23 +0000 Received: by mail-qt0-x233.google.com with SMTP id 32so82563787qtv.1 for ; Sat, 15 Jul 2017 10:54:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=a5IthCOZZjbPu2Pi6rpw1rvwRwHTVRGewCq7bzt5Mok=; b=R/leN7/C//0p722sH9SwEqP0E7zMf3im2QOqGJhYwDkk3IS91DkfwputcwXPbm91hZ VhEeY/qGiYJcz2lAwvQw3keE32W7ZT+icFCNAI4sJneB7aLj9DktcDO7pvQ4/pkW3jPd uFeBmQY5RQWVWBL4x4C58yWYNXt0uVF+se+zhyLTAN5wABgS6g3yUC8JM5WBlJ8s8Mf9 jziSKgN3y9uu+MHEFzYIGaJBbPlb+udoyUg4Lj1+fT26kj5HK7pBcZ+HunXEcY4z1+qM NRS/SyUHnlZ5SmHOB/OiEm3JXClzOj9QrDjyJ7qlF3aAFee6kCvO+PT81R9Yp37N5wtj mzDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=a5IthCOZZjbPu2Pi6rpw1rvwRwHTVRGewCq7bzt5Mok=; b=kDloOcAegBpKHrupK5ODcy686srcTCZSX0eWsSnzGtVWDU6FMvL20e13+7yNvwnpd3 9p3IGa23Y8JyUAp+Szovl7Iz0fuK1fpXpJGvDHv0nUde0BGOV1ufZSu3fTv1shvIcE8R aKOeTo6EcAmZYI5fsUR0RownBRBwh7vTh6U9uKAiqkO3adSGiYV4RxoaV0bXM+D6Ul7D fXQqLG+XpKASWq2ktY0RLr6aeKmK3Fp7uuwilbUdWV2sYfVQ42GYWUCjt1T1JsiMRaCH yYcM8xrrZy34ZNupSDc0SMWB6oc03xFnEQ5hJwMfT+vQqQ10VcB+5aDsn/Pye+DugtL4 SnTA== X-Gm-Message-State: AIVw113jHhkaxMwv1Ds0/uqfABMWQyB0Ed2BRZmow3xMhih86uyg1TpU 09O1nPJSxfKQB2TwGuMJag+drEyeqQ== X-Received: by 10.237.36.143 with SMTP id t15mr19588163qtc.180.1500141256909; Sat, 15 Jul 2017 10:54:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.157.11 with HTTP; Sat, 15 Jul 2017 10:53:56 -0700 (PDT) In-Reply-To: References: From: Charles Nadeau Date: Sat, 15 Jul 2017 19:53:56 +0200 Message-ID: Subject: Re: Very poor read performance, query independent To: Mark Kirkwood , "pgsql-performa." Content-Type: multipart/alternative; boundary="001a11420e08bb272a05545eda8e" List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --001a11420e08bb272a05545eda8e Content-Type: text/plain; charset="UTF-8" Mark, I increased the read ahead to 16384 and it doesn't improve performance. My RAID 0 use a stripe size of 256k, the maximum size supported by the controller. Thanks! Charles On Sat, Jul 15, 2017 at 1:02 AM, Mark Kirkwood < mark.kirkwood@catalyst.net.nz> wrote: > Ah yes - that seems more sensible (but still slower than I would expect > for 5 disks RAID 0). You should be able to get something like 5 * (single > disk speed) i.e about 500MB/s. > > Might be worth increasing device read ahead (more than you have already). > Some of these so-called 'smart' RAID cards need to be hit over the head > before they will perform. E.g: I believe you have it set to 128 - I'd try > 4096 or even 16384 (In the past I've used those settings on some extremely > stupid cards that refused to max out their disks known speeds). > > Also worth investigating is RAID stripe size - for DW work it makes sense > for it to be reasonably big (256K to 1M), which again will help speed is > sequential scans. > > Cheers > > Mark > > P.s I used to work for Greenplum, so this type of problem came up a lot > :-) . The best cards were the LSI and Areca! > > > > On 15/07/17 02:09, Charles Nadeau wrote: > >> Mark, >> >> First I must say that I changed my disks configuration from 4 disks in >> RAID 10 to 5 disks in RAID 0 because I almost ran out of disk space during >> the last ingest of data. >> Here is the result test you asked. It was done with a cold cache: >> >> flows=# \timing >> Timing is on. >> flows=# explain select count(*) from flows; >> QUERY PLAN >> ------------------------------------------------------------ >> ----------------------------------- >> Finalize Aggregate (cost=17214914.09..17214914.09 rows=1 width=8) >> -> Gather (cost=17214914.07..17214914.09 rows=1 width=8) >> Workers Planned: 1 >> -> Partial Aggregate (cost=17213914.07..17213914.07 >> rows=1 width=8) >> -> Parallel Seq Scan on flows >> (cost=0.00..17019464.49 rows=388899162 width=0) >> (5 rows) >> >> Time: 171.835 ms >> flows=# select pg_relation_size('flows'); >> pg_relation_size >> ------------------ >> 129865867264 >> (1 row) >> >> Time: 57.157 ms >> flows=# select count(*) from flows; >> LOG: duration: 625546.522 ms statement: select count(*) from flows; >> count >> ----------- >> 589831190 >> (1 row) >> >> Time: 625546.662 ms >> >> The throughput reported by Postgresql is almost 198MB/s, and the >> throughput as mesured by dstat during the query execution was between 25 >> and 299MB/s. It is much better than what I had before! The i/o wait was >> about 12% all through the query. One thing I noticed is the discrepency >> between the read throughput reported by pg_activity and the one reported by >> dstat: pg_activity always report a value lower than dstat. >> >> Besides the change of disks configuration, here is what contributed the >> most to the improvment of the performance so far: >> >> Using Hugepage >> Increasing effective_io_concurrency to 256 >> Reducing random_page_cost from 22 to 4 >> Reducing min_parallel_relation_size to 512kB to have more workers >> when doing sequential parallel scan of my biggest table >> >> >> Thanks for recomending this test, I now know what the real throughput >> should be! >> >> Charles >> >> On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood < >> mark.kirkwood@catalyst.net.nz > >> wrote: >> >> Hmm - how are you measuring that sequential scan speed of 4MB/s? >> I'd recommend doing a very simple test e.g, here's one on my >> workstation - 13 GB single table on 1 SATA drive - cold cache >> after reboot, sequential scan using Postgres 9.6.2: >> >> bench=# EXPLAIN SELECT count(*) FROM pgbench_accounts; >> QUERY PLAN >> ------------------------------------------------------------ >> ------------------------ >> Aggregate (cost=2889345.00..2889345.01 rows=1 width=8) >> -> Seq Scan on pgbench_accounts (cost=0.00..2639345.00 >> rows=100000000 width=0) >> (2 rows) >> >> >> bench=# SELECT pg_relation_size('pgbench_accounts'); >> pg_relation_size >> ------------------ >> 13429514240 >> (1 row) >> >> bench=# SELECT count(*) FROM pgbench_accounts; >> count >> ----------- >> 100000000 >> (1 row) >> >> Time: 118884.277 ms >> >> >> So doing the math seq read speed is about 110MB/s (i.e 13 GB in >> 120 sec). Sure enough, while I was running the query iostat showed: >> >> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sda 0.00 0.00 926.00 0.00 114.89 0.00 >> 254.10 1.90 2.03 2.03 0.00 1.08 100.00 >> >> >> So might be useful for us to see something like that from your >> system - note you need to check you really have flushed the cache, >> and that no other apps are using the db. >> >> regards >> >> Mark >> >> >> On 12/07/17 00:46, Charles Nadeau wrote: >> >> After reducing random_page_cost to 4 and testing more, I can >> report that the aggregate read throughput for parallel >> sequential scan is about 90MB/s. However the throughput for >> sequential scan is still around 4MB/s. >> >> >> >> >> >> -- >> Charles Nadeau Ph.D. >> http://charlesnadeau.blogspot.com/ >> > > -- Charles Nadeau Ph.D. http://charlesnadeau.blogspot.com/ --001a11420e08bb272a05545eda8e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Mark,

I increased the read a= head to 16384 and it doesn't improve performance. My RAID 0 use a strip= e size of 256k, the maximum size supported by the controller.
Tha= nks!

Charles

On Sat, Jul 15, 2017 at 1:02 AM, Mark Kirkw= ood <mark.kirkwood@catalyst.net.nz> wrote:
Ah yes - that seems more sensible (but still s= lower than I would expect for 5 disks RAID 0). You should be able to get so= mething like 5 * (single disk speed) i.e about 500MB/s.

Might be worth increasing device read ahead (more than you have already). S= ome of these so-called 'smart' RAID cards need to be hit over the h= ead before they will perform. E.g: I believe you have it set to 128 - I'= ;d try 4096 or even 16384 (In the past I've used those settings on some= extremely stupid cards that refused to max out their disks known speeds).<= br>
Also worth investigating is RAID stripe size - for DW work it makes sense f= or it to be reasonably big (256K to 1M), which again will help speed is seq= uential scans.

Cheers

Mark

P.s I used to work for Greenplum, so this type of problem came up a lot :-)= . The best cards were the LSI and Areca!



On 15/07/17 02:09, Charles Nadeau wrote:
Mark,

First I must say that I changed my disks configuration from 4 disks in RAID= 10 to 5 disks in RAID 0 because I almost ran out of disk space during the = last ingest of data.
Here is the result test you asked. It was done with a cold cache:

=C2=A0 =C2=A0 flows=3D# \timing
=C2=A0 =C2=A0 Timing is on.
=C2=A0 =C2=A0 flows=3D# explain select count(*) from flows;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 QUERY PLAN
=C2=A0 =C2=A0 --------------------------------------------------------= ---------------------------------------
=C2=A0 =C2=A0 =C2=A0Finalize Aggregate=C2=A0 (cost=3D17214914.09..17214914.= 09 rows=3D1 width=3D8)
=C2=A0 =C2=A0 =C2=A0 =C2=A0->=C2=A0 Gather=C2=A0 (cost=3D17214914.07..17= 214914.09 rows=3D1 width=3D8)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 1
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0->=C2=A0 Partial Aggrega= te=C2=A0 (cost=3D17213914.07..17213914.07
=C2=A0 =C2=A0 rows=3D1 width=3D8)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0->= =C2=A0 Parallel Seq Scan on flows
=C2=A0 =C2=A0 =C2=A0(cost=3D0.00..17019464.49 rows=3D388899162 width=3D0) =C2=A0 =C2=A0 (5 rows)

=C2=A0 =C2=A0 Time: 171.835 ms
=C2=A0 =C2=A0 flows=3D# select pg_relation_size('flows');
=C2=A0 =C2=A0 =C2=A0pg_relation_size
=C2=A0 =C2=A0 ------------------
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0129865867264
=C2=A0 =C2=A0 (1 row)

=C2=A0 =C2=A0 Time: 57.157 ms
=C2=A0 =C2=A0 flows=3D# select count(*) from flows;
=C2=A0 =C2=A0 LOG:=C2=A0 duration: 625546.522 ms=C2=A0 statement: select co= unt(*) from flows;
=C2=A0 =C2=A0 =C2=A0 =C2=A0count
=C2=A0 =C2=A0 -----------
=C2=A0 =C2=A0 =C2=A0589831190
=C2=A0 =C2=A0 (1 row)

=C2=A0 =C2=A0 Time: 625546.662 ms

The throughput reported by Postgresql is almost 198MB/s, and the throughput= as mesured by dstat during the query execution was between 25 and 299MB/s.= It is much better than what I had before! The i/o wait was about 12% all t= hrough the query. One thing I noticed is the discrepency between the read t= hroughput reported by pg_activity and the one reported by dstat: pg_activit= y always report a value lower than dstat.

Besides the change of disks configuration, here is what contributed the mos= t to the improvment of the performance so far:

=C2=A0 =C2=A0 Using Hugepage
=C2=A0 =C2=A0 Increasing effective_io_concurrency to 256
=C2=A0 =C2=A0 Reducing random_page_cost from 22 to 4
=C2=A0 =C2=A0 Reducing min_parallel_relation_size to 512kB to have more wor= kers
=C2=A0 =C2=A0 when doing sequential parallel scan of my biggest table


Thanks for recomending this test, I now know what the real throughput shoul= d be!

Charles

On Wed, Jul 12, 2017 at 4:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz <mailto:mark.kirkwood@catalyst.net.nz>> wrote:

=C2=A0 =C2=A0 Hmm - how are you measuring that sequential scan speed of 4MB= /s?
=C2=A0 =C2=A0 I'd recommend doing a very simple test e.g, here's on= e on my
=C2=A0 =C2=A0 workstation - 13 GB single table on 1 SATA drive - cold cache=
=C2=A0 =C2=A0 after reboot, sequential scan using Postgres 9.6.2:

=C2=A0 =C2=A0 bench=3D#=C2=A0 EXPLAIN SELECT count(*) FROM pgbench_accounts= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0QU= ERY PLAN
=C2=A0 =C2=A0 --------------------------------------------------------= ----------------------------
=C2=A0 =C2=A0 =C2=A0Aggregate=C2=A0 (cost=3D2889345.00..2889345.01 rows=3D1= width=3D8)
=C2=A0 =C2=A0 =C2=A0 =C2=A0->=C2=A0 Seq Scan on pgbench_accounts (cost= =3D0.00..2639345.00
=C2=A0 =C2=A0 rows=3D100000000 width=3D0)
=C2=A0 =C2=A0 (2 rows)


=C2=A0 =C2=A0 bench=3D#=C2=A0 SELECT pg_relation_size('pgbench_accounts');
=C2=A0 =C2=A0 =C2=A0pg_relation_size
=C2=A0 =C2=A0 ------------------
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 13429514240
=C2=A0 =C2=A0 (1 row)

=C2=A0 =C2=A0 bench=3D# SELECT count(*) FROM pgbench_accounts;
=C2=A0 =C2=A0 =C2=A0 =C2=A0count
=C2=A0 =C2=A0 -----------
=C2=A0 =C2=A0 =C2=A0100000000
=C2=A0 =C2=A0 (1 row)

=C2=A0 =C2=A0 Time: 118884.277 ms


=C2=A0 =C2=A0 So doing the math seq read speed is about 110MB/s (i.e 13 GB = in
=C2=A0 =C2=A0 120 sec). Sure enough, while I was running the query iostat s= howed:

=C2=A0 =C2=A0 Device:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0rrqm/s=C2=A0 =C2=A0w= rqm/s=C2=A0 =C2=A0 =C2=A0r/s=C2=A0 =C2=A0 =C2=A0w/s=C2=A0 =C2=A0 rMB/s wMB/= s
=C2=A0 =C2=A0 avgrq-sz avgqu-sz=C2=A0 =C2=A0await r_await w_await=C2=A0 svc= tm %util
=C2=A0 =C2=A0 sda=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0= 0=C2=A0 =C2=A0 =C2=A00.00=C2=A0 926.00=C2=A0 =C2=A0 0.00 114.89=C2=A0 0.00= =C2=A0 =C2=A0 =C2=A0 254.10=C2=A0 =C2=A0 =C2=A01.90=C2=A0 =C2=A0 2.03=C2=A0= =C2=A0 2.03=C2=A0 =C2=A0 0.00=C2=A0 =C2=A01.08 100.00


=C2=A0 =C2=A0 So might be useful for us to see something like that from you= r
=C2=A0 =C2=A0 system - note you need to check you really have flushed the c= ache,
=C2=A0 =C2=A0 and that no other apps are using the db.

=C2=A0 =C2=A0 regards

=C2=A0 =C2=A0 Mark


=C2=A0 =C2=A0 On 12/07/17 00:46, Charles Nadeau wrote:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 After reducing random_page_cost to 4 and testin= g more, I can
=C2=A0 =C2=A0 =C2=A0 =C2=A0 report that the aggregate read throughput for p= arallel
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sequential scan is about 90MB/s. However the th= roughput for
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sequential scan is still around 4MB/s.





--
Charles Nadeau Ph.D.
http://charlesnadeau.blogspot.com/




--
--001a11420e08bb272a05545eda8e--