Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1b5SEW-0006TS-Ly for pgsql-performance@arkaria.postgresql.org; Wed, 25 May 2016 06:23:56 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1b5SEV-00059P-HO for pgsql-performance@arkaria.postgresql.org; Wed, 25 May 2016 06:23:55 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1b5SET-000594-Fb for pgsql-performance@postgresql.org; Wed, 25 May 2016 06:23:53 +0000 Received: from mail-vk0-x230.google.com ([2607:f8b0:400c:c05::230]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1b5SEQ-0003oP-9O for pgsql-performance@postgresql.org; Wed, 25 May 2016 06:23:52 +0000 Received: by mail-vk0-x230.google.com with SMTP id y2so51520164vka.3 for ; Tue, 24 May 2016 23:23:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=FpY5P5nWkty+jBajdYAm8NCsJ0n8UUTMvRWMSSPMEYE=; b=NuFCOFBHHomxJCaVVqPk2odnCuvhe7r56zLQbZn+36SMGG0kC8NcLga6B9I6ZXmQGg obMq+JWsglvdBOk5lUsErZV3nnTbZNpz6/57Uhi4o7aO66W+HGn4AKnTvjFh+oF28TkW p5hUGPPZ7VlOrIqg1MqIomUISvCjggAWvQMQg/hy3wgZ3R0GzcBLRlUDvmMyQaovpQD8 fcsRMqjI7ofxWH17P3ZzrWHDPXN9EN9gFG5TLbjXP+NtXLMT2n2rmmPVXFnVIB+/jloC 9Eu3/+YQpRIW8KPLOErGMe/3EX9PvJP9noaoQEMrtKztewlGtfq2xU7E6DkAFSHpv5Wr sazA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=FpY5P5nWkty+jBajdYAm8NCsJ0n8UUTMvRWMSSPMEYE=; b=ivEmxtP1RMUlVffcPmvcWV+PgsUdpELz/uRLkg1SCnwr78jpab05D6S07259Z1UbI4 HAAsqF5jT4ReU4TAYV5jU3RrTMwMF4EOGLYsgCYds8b3gM95HJBZ0peEu5LwLSWtPlni vxVdwcFPrwbuRbRn1Glemg1epQh7QAAUiV1HLJW78kaQKu+gR0A3XNFClw/T3yhYNA2F muui2IRmZS9TTN8/yEYHvVu8fzosAgw5zotfXNpgb7YOGXeUwWJVljx6zh3vxP2p52B4 J7WUJJwto5P5jCI/IdeVkcPQSaC5Bsu0MJ4UaPeoDRd27USmgtANlGXtvYK6t866eb1V WOpA== X-Gm-Message-State: ALyK8tLALhdw8FNgP2YWmOXhG0zz4+TDUwT+u5AYHw9fneCgDn5aiPEHpduHXm08vGYju74xeBkg1y2iMIXXgg== MIME-Version: 1.0 X-Received: by 10.31.3.12 with SMTP id 12mr1339169vkd.22.1464157428839; Tue, 24 May 2016 23:23:48 -0700 (PDT) Received: by 10.103.71.203 with HTTP; Tue, 24 May 2016 23:23:48 -0700 (PDT) In-Reply-To: <20160524173914.GA11880@telsasoft.com> References: <20160524173914.GA11880@telsasoft.com> Date: Tue, 24 May 2016 23:23:48 -0700 Message-ID: Subject: Re: index fragmentation on insert-only table with non-unique column From: Jeff Janes To: Justin Pryzby Cc: "pgsql-performance@postgresql.org" Content-Type: text/plain; charset=UTF-8 X-Pg-Spam-Score: -2.7 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby wrote: > Summary: Non-unique btree indices are returning CTIDs for rows with same > value of indexed column not in logical order, imposing a high performance > penalty. > > Running PG 9.5.3 now, we have a time-based partitions of append-only tables > with data loaded from other sources. The tables are partitioned by time, and > timestamp column has an non-unique, not-null btree index. > > The child tables are each ~75GB and expected to keep growing. For a child > table with a week's worth of data: > relpages | 11255802 > reltuples | 5.90502e+07 > > The data is loaded shortly after it's available, so have high correlation in > pg_statistic: > [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head > tablename | correlation | n_distinct > ----------------------------------+-------------+------------ > cdrs_huawei_pgwrecord | 0.999997 | 102892 > cdrs_huawei_pgwrecord_2016_02_15 | 0.999658 | 96145 > cdrs_huawei_pgwrecord_2016_02_22 | 0.999943 | 91916 > cdrs_huawei_pgwrecord_2016_02_29 | 0.997219 | 50341 > cdrs_huawei_pgwrecord_2016_03_01 | 0.999947 | 97485 > > But note the non-uniqueness of the index column: > ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 DESC; > recordopeningtime | count > ---------------------+------- > 2016-05-21 12:17:29 | 176 > 2016-05-21 12:17:25 | 171 > 2016-05-21 13:11:33 | 170 > 2016-05-21 10:20:02 | 169 > 2016-05-21 11:30:02 | 167 > [...] That is not that much duplication. You aren't going to have dozens or hundreds of leaf pages all with equal values. (and you only showed the most highly duplicated ones, presumably the average is much less) > We have an daily analytic query which processes the previous day's data. For > new child tables, with only 1 days data loaded, this runs in ~30min, and for > child tables with an entire week's worth of data loaded, takes several hours > (even though both queries process the same amount of data). For an append only table, why would the first day of a new partition be any less fragmented than that same day would be a week from now? Are you sure it isn't just that your week-old data has all been aged out of the cache? > First, I found I was able to get 30-50min query results on full week's table by > prefering a seq scan to an index scan. The row estimates seemed fine, and the > only condition is the timestamp, so the planner's use of index scan is as > expected. Can you show us the query? I would expect a bitmap scan of the index (which would do what you want, but even more so), instead. > > AFAICT what's happening is that the index scan was returning pages > nonsequentially. strace-ing the backend showed alternating lseek()s and > read()s, with the offsets not consistently increasing (nor consistently > decreasing): > % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773' > lseek(773, 1059766272, SEEK_SET) = 1059766272 > lseek(773, 824926208, SEEK_SET) = 824926208 > lseek(773, 990027776, SEEK_SET) = 990027776 > lseek(773, 990330880, SEEK_SET) = 990330880 > lseek(773, 1038942208, SEEK_SET) = 1038942208 > lseek(773, 1059856384, SEEK_SET) = 1059856384 > lseek(773, 977305600, SEEK_SET) = 977305600 > lseek(773, 990347264, SEEK_SET) = 990347264 > lseek(773, 871096320, SEEK_SET) = 871096320 > > .. and consecutive read()s being rare: > read(802, "g"..., 8192) = 8192 > lseek(802, 918003712, SEEK_SET) = 918003712 > read(802, "c"..., 8192) = 8192 > lseek(802, 859136000, SEEK_SET) = 859136000 > read(802, "a"..., 8192) = 8192 > lseek(802, 919601152, SEEK_SET) = 919601152 > read(802, "d"..., 8192) = 8192 > lseek(802, 905101312, SEEK_SET) = 905101312 > read(802, "c"..., 8192) = 8192 > lseek(801, 507863040, SEEK_SET) = 507863040 > read(801, "p"..., 8192) = 8192 > lseek(802, 914235392, SEEK_SET) = 914235392 > read(802, "c"..., 8192) = 8192 Which of those are the table, and which the index? Something doesn't add up here. How could an index of an append-only table possibly become that fragmented, when the highest amount of key duplication is about 170? Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance