Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1b5QOr-0000VL-PT for pgsql-performance@arkaria.postgresql.org; Wed, 25 May 2016 04:26:29 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1b5QOr-0003B0-BN for pgsql-performance@arkaria.postgresql.org; Wed, 25 May 2016 04:26:29 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1b5QOq-0003At-Mr for pgsql-performance@postgresql.org; Wed, 25 May 2016 04:26:28 +0000 Received: from tamriel.snowman.net ([2001:4830:167b::11]) by makus.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1b5QOn-0008Ay-W5 for pgsql-performance@postgresql.org; Wed, 25 May 2016 04:26:27 +0000 Received: by tamriel.snowman.net (Postfix, from userid 1000) id 3865E5F795; Wed, 25 May 2016 00:26:25 -0400 (EDT) Date: Wed, 25 May 2016 00:26:25 -0400 From: Stephen Frost To: Peter Geoghegan Cc: Justin Pryzby , pgsql-performance@postgresql.org Subject: Re: index fragmentation on insert-only table with non-unique column Message-ID: <20160525042625.GX21416@tamriel.snowman.net> References: <20160524173914.GA11880@telsasoft.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="H7QJU8Gl1+/xsYtC" Content-Disposition: inline In-Reply-To: X-Editor: Vim http://www.vim.org/ X-Info: http://www.snowman.net X-Operating-System: Linux/3.13.0-77-generic (x86_64) X-Uptime: 00:20:08 up 92 days, 1:13, 37 users, load average: 0.10, 0.10, 0.12 User-Agent: Mutt/1.5.21 (2010-09-15) X-Pg-Spam-Score: -3.3 (---) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org --H7QJU8Gl1+/xsYtC Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable * Peter Geoghegan (pg@bowt.ie) wrote: > On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby wr= ote: > > I was able to see great improvement without planner parameters by REIND= EX the > > timestamp index. My theory is that the index/planner doesn't handle we= ll the > > case of many tuples with same column value, and returns pages out of lo= gical > > order. Reindex fixes that, rewriting the index data with pages in order > > (confirmed with pageinspect), which causes index scans to fetch heap da= ta more > > or less monotonically (if not consecutively). strace shows that consec= utive > > read()s are common (without intervening seeks). I gather this allows t= he OS > > readahead to kick in. >=20 > The basic problem is that the B-Tree code doesn't maintain this > property. However, B-Tree index builds will create an index that > initially has this property, because the tuplesort.c code happens to > sort index tuples with a CTID tie-breaker. >=20 > > Postgres seems to assume that the high degree of correlation of the tab= le > > column seen in pg_stats is how it will get data from the index scan, wh= ich > > assumption seems to be very poor on what turns out to be a higly fragme= nted > > index. Is there a way to help it to understand otherwise?? >=20 > Your complaint is vague. Are you complaining about the planner making > a poor choice? I don't think that's the issue here, because you never > made any firm statement about the planner making a choice that was > worth than an alternative that it had available. >=20 > If you're arguing for the idea that B-Trees should reliably keep > tuples in order by a tie-break condition, that seems difficult to > implement, and likely not worth it in practice. The ongoing discussion around how to effectively handle duplicate values in B-Tree seems relevant to this. In particular, if we're able to store duplicate values efficiently and all the locations under a single key are easily available then we could certainly sort those prior to going and visiting them. That's not quite the same as keeping the tuples in order in the heap, but would more-or-less achieve the effect desired, I believe? Thanks! Stephen --H7QJU8Gl1+/xsYtC Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBCgAGBQJXRSlwAAoJEO1sijiDR2RVUrUP/3MEMzXr2cddvxIetb8YFCsd wMK3tb7Zc8/x/oBU0z1bTOCDEjiqnQzGDs48/xAd/YhEVvjSLjuuOJFFZKtZr5Lj v5BBw13lMVyVJ1SjGl3vpZYKCkBb3vk9X467oKzhUEw+lOXYpbfU3JQU45hNvEBa loJelJjJa8N0xTqhtVeAn9dvNjglTbpY3ohLs8B+RqtoccuizIMM1s5568WYM/2I frw9tOcoczlBpW8Ty8N28Pl2auIUnZR4WbU0yq7J8+nYMXRVEdrNrAkoN4FiJguL lncjqcaS7LOCv08n4BxfGWnmTfBlQ8xpa/bGfUYy3pdLQAslGTzIIm4+BQzMYeTA 89rZtfcSdGoESG15kihMpR/a0f3R74+lFngtLmcdPvCQkZN6IAPzGgHcJbV+aDhV Ji1xnuWETpDG1y4/qRQjmXiRIl/Erv2ybSQ0r45wnXF+PmNLZYgfZyQe7Q1Ehy3v pZU/jM7u2gsLGsybKy2V2CdoA7Xk4ZTLH2RVp0TZoMEHXuMqVHeA28NrBLQv5Fco h+470Swf5CXejLaXul7GAUh4ma+yt3eJ9nZZ2BgAvj/U2ZVL+UaoQ/EaDkz7KQj3 T80FfxqWNlSHBykLBCZFT5moQWAfG/xVBJifTaxDVDtIChwmhODKEaillRpSOna8 RHQgxBTgRLf4XHwK1ksW =lJjM -----END PGP SIGNATURE----- --H7QJU8Gl1+/xsYtC--