Date: Tue, 24 May 2016 12:39:14 -0500
From: Justin Pryzby <pryzby@telsasoft.com>
To: pgsql-performance@postgresql.org
Subject: index fragmentation on insert-only table with non-unique column
Message-ID: <20160524173914.GA11880@telsasoft.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.23 (2014-03-12)
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

Summary: Non-unique btree indices are returning CTIDs for rows with same
value of indexed column not in logical order, imposing a high performance
penalty.

Running PG 9.5.3 now, we have a time-based partitions of append-only tables
with data loaded from other sources.  The tables are partitioned by time, and
timestamp column has an non-unique, not-null btree index.

The child tables are each ~75GB and expected to keep growing.  For a child
table with a week's worth of data:
relpages  | 11255802
reltuples | 5.90502e+07

The data is loaded shortly after it's available, so have high correlation in
pg_statistic:
[pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head
            tablename             | correlation | n_distinct 
----------------------------------+-------------+------------
 cdrs_huawei_pgwrecord            |    0.999997 |     102892
 cdrs_huawei_pgwrecord_2016_02_15 |    0.999658 |      96145
 cdrs_huawei_pgwrecord_2016_02_22 |    0.999943 |      91916
 cdrs_huawei_pgwrecord_2016_02_29 |    0.997219 |      50341
 cdrs_huawei_pgwrecord_2016_03_01 |    0.999947 |      97485

But note the non-uniqueness of the index column:
ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 DESC;
  recordopeningtime  | count
---------------------+-------
 2016-05-21 12:17:29 |   176
 2016-05-21 12:17:25 |   171
 2016-05-21 13:11:33 |   170
 2016-05-21 10:20:02 |   169
 2016-05-21 11:30:02 |   167
[...]

We have an daily analytic query which processes the previous day's data.  For
new child tables, with only 1 days data loaded, this runs in ~30min, and for
child tables with an entire week's worth of data loaded, takes several hours
(even though both queries process the same amount of data).

First, I found I was able to get 30-50min query results on full week's table by
prefering a seq scan to an index scan.  The row estimates seemed fine, and the
only condition is the timestamp, so the planner's use of index scan is as
expected.

AFAICT what's happening is that the index scan was returning pages
nonsequentially.  strace-ing the backend showed alternating lseek()s and
read()s, with the offsets not consistently increasing (nor consistently
decreasing):
% sudo strace -p 25588 2>&1 |grep -m9 'lseek(773'
lseek(773, 1059766272, SEEK_SET)        = 1059766272
lseek(773, 824926208, SEEK_SET)         = 824926208
lseek(773, 990027776, SEEK_SET)         = 990027776
lseek(773, 990330880, SEEK_SET)         = 990330880
lseek(773, 1038942208, SEEK_SET)        = 1038942208
lseek(773, 1059856384, SEEK_SET)        = 1059856384
lseek(773, 977305600, SEEK_SET)         = 977305600
lseek(773, 990347264, SEEK_SET)         = 990347264
lseek(773, 871096320, SEEK_SET)         = 871096320

.. and consecutive read()s being rare:
read(802, "g"..., 8192)                 = 8192
lseek(802, 918003712, SEEK_SET)         = 918003712
read(802, "c"..., 8192)                 = 8192
lseek(802, 859136000, SEEK_SET)         = 859136000
read(802, "a"..., 8192)                 = 8192
lseek(802, 919601152, SEEK_SET)         = 919601152
read(802, "d"..., 8192)                 = 8192
lseek(802, 905101312, SEEK_SET)         = 905101312
read(802, "c"..., 8192)                 = 8192
lseek(801, 507863040, SEEK_SET)         = 507863040
read(801, "p"..., 8192)                 = 8192
lseek(802, 914235392, SEEK_SET)         = 914235392
read(802, "c"..., 8192)                 = 8192

I was able to see great improvement without planner parameters by REINDEX the
timestamp index.  My theory is that the index/planner doesn't handle well the
case of many tuples with same column value, and returns pages out of logical
order.  Reindex fixes that, rewriting the index data with pages in order
(confirmed with pageinspect), which causes index scans to fetch heap data more
or less monotonically (if not consecutively).  strace shows that consecutive
read()s are common (without intervening seeks).  I gather this allows the OS
readahead to kick in.

Postgres seems to assume that the high degree of correlation of the table
column seen in pg_stats is how it will get data from the index scan, which
assumption seems to be very poor on what turns out to be a higly fragmented
index.  Is there a way to help it to understand otherwise??

Maybe this is a well-understood problem/deficiency; but, is there any reason
why Btree scan can't sort result by ctid for index tuples with same column
value (_bt_steppage() or btgettuple())?  Or maybe the problem could be
mitigated by changing the behavior during INESRT?  In the meantime, I'll be
implementing a reindex job.

Thanks,
Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance