MIME-Version: 1.0
In-Reply-To: <20170707234119.GN17566@telsasoft.com>
References: 
 <CAMkU=1xwGn+O0jhKsvrUrbW9MQp1YX0iB4Y-6h1mEz0ffBxK-Q@mail.gmail.com>
 <CAMkU=1wsSEgByzYQ5uazD1gSsJh9RRYME6ehgJsUp3XcMm2Djg@mail.gmail.com>
 <20160813185448.GP1179@telsasoft.com> <20170707234119.GN17566@telsasoft.com>
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 7 Jul 2017 20:27:02 -0700
Message-ID: 
 <CAH2-WzkRTggiy_LKQUu-oViyp6y_Hhz-a1yWacPy4tcYWV1HoA@mail.gmail.com>
Subject: Re: estimate correlation of index separately from table (Re:
 index fragmentation on insert-only table with non-unique column)
To: Justin Pryzby <pryzby@telsasoft.com>
Cc: pgsql-performance@postgresql.org, Jeff Janes <jeff.janes@gmail.com>,
	Tom Lane <tgl@sss.pgh.pa.us>, Claudio Freire <klaussfreire@gmail.com>,
	Ralph Loader <suckfish@ihug.co.nz>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk
Sender: pgsql-performance-owner@postgresql.org

On Fri, Jul 7, 2017 at 4:41 PM, Justin Pryzby <pryzby@telsasoft.com> wrote:
> The second change averages separate correlation values of small lists of 300
> consecutive TIDs, rather than the course-granularity/aggregate correlation of a
> small sample of pages across the entire index.  Postgres' existing sampling is
> designed to give an even sample across all rows.  An issue with this
> course-granularity correlation is that the index can have a broad correlation
> (to physical heap location), but with many small-scale deviations, which don't
> show up due to sampling a relatively small fraction of a large table; and/or
> the effect of the deviations is insignificant/noise and correlation is still
> computed near 1.
>
> I believe the "large scale" correlation that postgres computes from block
> sample fails to well represent small-scale uncorrelated reads which contribute
> large number of random reads, but not included in planner cost.

All of that may well be true, but I've actually come around to the
view that we should treat TID as a first class part of the keyspace,
that participates in comparisons as an implicit last column in all
cases (not just when B-Trees are built within tuplesort.c). That would
probably obviate the need for a patch like this entirely, because
pg_stats.correlation would be unaffected by duplicates. I think that
this would have a number of advantages, some of which might be
significant. For example, I think it could make LP_DEAD cleanup within
nbtree more effective for some workloads, especially workloads where
it is important for HOT pruning and LP_DEAD cleanup to work in concert
-- locality of access probably matters with that. Also, with every
entry in the index guaranteed to be unique, we can imagine VACUUM
being much more selective with killing index entries, when the TID
array it uses happens to be very small. With the freeze map stuff
that's now in place, being able to do that matters more than before.

The original Lehman & Yao algorithm is supposed to have unique keys in
all cases, but we don't follow that in order to make duplicates work,
which is implemented by changing an invariant (see nbtree README for
details). So, this could simplify some aspects of how binary searches
must work in the face of having to deal with non-unique keys. I think
that there are cases where many non-HOT UPDATEs have to go through a
bunch of duplicate leaf tuples and do visibility checking on old
versions for no real benefit. With the TID as part of the keyspace, we
could instead tell the B-Tree code to insert a new tuple as part of an
UPDATE while using the TID as part of its insertion scan key, so
rummaging through many duplicates is avoided.

That having been said, it would be hard to do this for all the reasons
I went into in that thread you referenced [1]. If you're going to
treat TID as a part of the keyspace, you have to totally embrace that,
which means that the internal pages need to have heap TIDs too (not
just pointers to the lower levels in the B-Tree, which the
IndexTuple's t_tid pointer is used for there). Those are the place
where you need to literally append this new, implicit heap TID column
as if it was just another user-visible attribute, since that
information isn't stored in the internal pages today at all. Doing
that has a cost, which isn't going to be acceptable if we naively
append a heap TID to every internal page IndexTuple. With a type like
int4, you're going to completely bloat those pages, with big
consequences for fan-in.

So, really, what needs to happen to make it work is someone needs to
write a suffix truncation patch, which implies that only those cases
that actually benefit from increasing the width of internal page
IndexTuples (those with many "would-be duplicates") pay the cost. This
is a classic technique, that I've actually already prototyped, though
that's extremely far from being worth posting here. That was just to
verify my understanding.

I think that I should write and put up for discussion a design
document for various nbtree enhancements. These include internal page
suffix truncation, prefix compression, and key normalization. I'm
probably not going to take any of these projects on, but it would be
useful if there was at least a little bit of clarity about how they
could be implemented. Maybe we could reach some kind of weak consensus
without going to great lengths. These optimizations are closely
intertwined things, and the lack of clarity on how they all fit
together is probably holding back an implementation of any one of
them.

[1] https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance