Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1b5GHi-0000VO-QI for pgsql-performance@arkaria.postgresql.org; Tue, 24 May 2016 17:38:26 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1b5GHi-0001bv-8P for pgsql-performance@arkaria.postgresql.org; Tue, 24 May 2016 17:38:26 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1b5GHh-0001aR-Bw for pgsql-performance@postgresql.org; Tue, 24 May 2016 17:38:25 +0000 Received: from mail-io0-x22f.google.com ([2607:f8b0:4001:c06::22f]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1b5GHe-0001y0-3d for pgsql-performance@postgresql.org; Tue, 24 May 2016 17:38:24 +0000 Received: by mail-io0-x22f.google.com with SMTP id p64so16724436ioi.2 for ; Tue, 24 May 2016 10:38:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=telsasoft-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=s6AePb1zwM1vrtN4u+pfGaJMKw32DOKbCtWNkUQQeT0=; b=nF0S/YNx7jq6N4F36Q318BDqD/o8sqj0XmLIESqzFQlLZ9A/JyGsRN7KLJuNG+sZ8s 4x56RvHQ9lH6ciCDoibZNkyPnAn2cUmQvL5b88dVQEIT4ovy852GaV0jYT0stbTd6vOh 2Hnfi+Dy+eyMLHuYE09evHB8sD6fHa5XnV82MTSlnxvvkLHtfIgcw4nMTZ5zkTQY7iGD ymIb07pRqNUvl/STVUjxkEPW2nj2Td/ZZ3s4pyAZwAJUp99j9wttGYY/qDjmjM2MCiiL B6pFtLPclzvuABV0ImmyDCMj57HT29fXp0fE2rSVJ0XBQkFl0GVbtk6p3JQ/Qf9iglp5 hoGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=s6AePb1zwM1vrtN4u+pfGaJMKw32DOKbCtWNkUQQeT0=; b=DkEc4kX6v0MDUypOOmEtDsMaRertlsgM5nFdCKggdPdcmNzkrdMlVrtzJtYtfg87f6 0Z3MQylAHLacJppWGg6UinkpXlvs1L1T2EpOg2bfBsX9PD5s9Gzv/AN0YV1GT4BKnV6G 3P251yfzQeuPHFKmOQsQIbQq5gBdSxpaEwykdHV+gMjQ8+8ztAZ+W8rGMgd7SIiC++By 6914elw/0FzAJlBK44MaBIOGFYiRvtwKQgzqWKwtLuKW8+BexjWUVTVycZqw3/Aw3WQA l00vtWuJPyz985Gv2LzHgpSau5frFXrLbIS72VqIFh/v1V3tBn+/aQMfvOVWnxQnnkma bLMw== X-Gm-Message-State: AOPr4FWGeMHPxDUoyofEI+FODuo1HDAFaaBy3GJu/uiGMCy9nLQhfYh2EdfaBY/vsJpqOg== X-Received: by 10.107.12.154 with SMTP id 26mr17614583iom.21.1464111500687; Tue, 24 May 2016 10:38:20 -0700 (PDT) Received: from pryzbyj (charmander.telsasoft.com. [50.244.222.1]) by smtp.gmail.com with ESMTPSA id u134sm5808760ith.6.2016.05.24.10.38.19 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 24 May 2016 10:38:19 -0700 (PDT) Received: by pryzbyj (Postfix, from userid 1000) id E6F99800C62; Tue, 24 May 2016 12:39:14 -0500 (CDT) Date: Tue, 24 May 2016 12:39:14 -0500 From: Justin Pryzby To: pgsql-performance@postgresql.org Subject: index fragmentation on insert-only table with non-unique column Message-ID: <20160524173914.GA11880@telsasoft.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-Pg-Spam-Score: -2.6 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org Summary: Non-unique btree indices are returning CTIDs for rows with same value of indexed column not in logical order, imposing a high performance penalty. Running PG 9.5.3 now, we have a time-based partitions of append-only tables with data loaded from other sources. The tables are partitioned by time, and timestamp column has an non-unique, not-null btree index. The child tables are each ~75GB and expected to keep growing. For a child table with a week's worth of data: relpages | 11255802 reltuples | 5.90502e+07 The data is loaded shortly after it's available, so have high correlation in pg_statistic: [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head tablename | correlation | n_distinct ----------------------------------+-------------+------------ cdrs_huawei_pgwrecord | 0.999997 | 102892 cdrs_huawei_pgwrecord_2016_02_15 | 0.999658 | 96145 cdrs_huawei_pgwrecord_2016_02_22 | 0.999943 | 91916 cdrs_huawei_pgwrecord_2016_02_29 | 0.997219 | 50341 cdrs_huawei_pgwrecord_2016_03_01 | 0.999947 | 97485 But note the non-uniqueness of the index column: ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 DESC; recordopeningtime | count ---------------------+------- 2016-05-21 12:17:29 | 176 2016-05-21 12:17:25 | 171 2016-05-21 13:11:33 | 170 2016-05-21 10:20:02 | 169 2016-05-21 11:30:02 | 167 [...] We have an daily analytic query which processes the previous day's data. For new child tables, with only 1 days data loaded, this runs in ~30min, and for child tables with an entire week's worth of data loaded, takes several hours (even though both queries process the same amount of data). First, I found I was able to get 30-50min query results on full week's table by prefering a seq scan to an index scan. The row estimates seemed fine, and the only condition is the timestamp, so the planner's use of index scan is as expected. AFAICT what's happening is that the index scan was returning pages nonsequentially. strace-ing the backend showed alternating lseek()s and read()s, with the offsets not consistently increasing (nor consistently decreasing): % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773' lseek(773, 1059766272, SEEK_SET) = 1059766272 lseek(773, 824926208, SEEK_SET) = 824926208 lseek(773, 990027776, SEEK_SET) = 990027776 lseek(773, 990330880, SEEK_SET) = 990330880 lseek(773, 1038942208, SEEK_SET) = 1038942208 lseek(773, 1059856384, SEEK_SET) = 1059856384 lseek(773, 977305600, SEEK_SET) = 977305600 lseek(773, 990347264, SEEK_SET) = 990347264 lseek(773, 871096320, SEEK_SET) = 871096320 .. and consecutive read()s being rare: read(802, "g"..., 8192) = 8192 lseek(802, 918003712, SEEK_SET) = 918003712 read(802, "c"..., 8192) = 8192 lseek(802, 859136000, SEEK_SET) = 859136000 read(802, "a"..., 8192) = 8192 lseek(802, 919601152, SEEK_SET) = 919601152 read(802, "d"..., 8192) = 8192 lseek(802, 905101312, SEEK_SET) = 905101312 read(802, "c"..., 8192) = 8192 lseek(801, 507863040, SEEK_SET) = 507863040 read(801, "p"..., 8192) = 8192 lseek(802, 914235392, SEEK_SET) = 914235392 read(802, "c"..., 8192) = 8192 I was able to see great improvement without planner parameters by REINDEX the timestamp index. My theory is that the index/planner doesn't handle well the case of many tuples with same column value, and returns pages out of logical order. Reindex fixes that, rewriting the index data with pages in order (confirmed with pageinspect), which causes index scans to fetch heap data more or less monotonically (if not consecutively). strace shows that consecutive read()s are common (without intervening seeks). I gather this allows the OS readahead to kick in. Postgres seems to assume that the high degree of correlation of the table column seen in pg_stats is how it will get data from the index scan, which assumption seems to be very poor on what turns out to be a higly fragmented index. Is there a way to help it to understand otherwise?? Maybe this is a well-understood problem/deficiency; but, is there any reason why Btree scan can't sort result by ctid for index tuples with same column value (_bt_steppage() or btgettuple())? Or maybe the problem could be mitigated by changing the behavior during INESRT? In the meantime, I'll be implementing a reindex job. Thanks, Justin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance