Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dTgQn-0000i7-Eo for pgsql-performance@arkaria.postgresql.org; Sat, 08 Jul 2017 03:29:17 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1dTgQm-0006iI-UU for pgsql-performance@arkaria.postgresql.org; Sat, 08 Jul 2017 03:29:16 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1dTgP1-0003dw-Uy for pgsql-performance@postgresql.org; Sat, 08 Jul 2017 03:27:28 +0000 Received: from mail-oi0-x232.google.com ([2607:f8b0:4003:c06::232]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1dTgOy-0001J9-HU for pgsql-performance@postgresql.org; Sat, 08 Jul 2017 03:27:26 +0000 Received: by mail-oi0-x232.google.com with SMTP id 191so41682455oii.2 for ; Fri, 07 Jul 2017 20:27:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=8tXWWL3LCcI+mfAJ8XJCUTMm+zgOvvHpgt8nmbejfwY=; b=wHhQmvW0/zSuuP5CO/s9iQKKfwpH9QNVOqXZSkJvoTlkxM9kHhePum8M30enxBNCIe 9mxj5gd8IUuxD/YN9zurkpViKBCEWf/rEiLtrRsjGs001cRUlYxM5/iGY84Qb8xd625Z GW4pW8ZLdiDU4giSnah6jUZ0xQ2uoTOMQXD7py7wMajTs9Cjb2b9A61H9xxg/BHTckka oq7UOlt6ULL3lYuqVQeH365m/O85/GHCQjH3GphRIx/JE2i3Bah+kqOy1rsHIaQewGJd 6MB37yvQB12OWtoquyDSBHJAhT+0lbTkSSfflurTZN6uhREdvtYeHje3uWd+DwKgAAJX BxYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=8tXWWL3LCcI+mfAJ8XJCUTMm+zgOvvHpgt8nmbejfwY=; b=rcEGgN1uEoMmryYxjb+3/rNWySAJip8eNwr427F0KzrT3eu/oEbOOexhgKq4f8Irf1 nAR/x9XnO6f8umMZEi14jqN2S3I7xs0KBW0n9VLWFJISJdjKnYJJM998jHzsKbXL+LfL ga/HV8gHBrEvOuMTglopg8kSHZATo5cnBtT6EPFamADuHp4ZtCjoxGrMfZr+HfTi3mq2 H7mWaWs0ngcDQjuLdKs+5RMmGVORIYEpVTTTjurtkzxyRVOBu5Fp1/gJFJeS94o/4kgl jmKDQIPoM/ry+h+H2E7JNY/n0iwxFmsPrdM4xgMXqAhm2uDqowi3nrMTqqf+3ysnwJ2k 200w== X-Gm-Message-State: AIVw110JdCx1zGUoOagz/OGcWQ0VdQW8Rjb+20qbwDCTKJZEQo/mAila yL+IZUGA0wIJTktwGz84acu+dfBsG7mp X-Received: by 10.202.182.86 with SMTP id g83mr2381362oif.4.1499484443072; Fri, 07 Jul 2017 20:27:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.74.11.90 with HTTP; Fri, 7 Jul 2017 20:27:02 -0700 (PDT) In-Reply-To: <20170707234119.GN17566@telsasoft.com> References: <20160813185448.GP1179@telsasoft.com> <20170707234119.GN17566@telsasoft.com> From: Peter Geoghegan Date: Fri, 7 Jul 2017 20:27:02 -0700 Message-ID: Subject: Re: estimate correlation of index separately from table (Re: index fragmentation on insert-only table with non-unique column) To: Justin Pryzby Cc: pgsql-performance@postgresql.org, Jeff Janes , Tom Lane , Claudio Freire , Ralph Loader Content-Type: text/plain; charset="UTF-8" List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org On Fri, Jul 7, 2017 at 4:41 PM, Justin Pryzby wrote: > The second change averages separate correlation values of small lists of 300 > consecutive TIDs, rather than the course-granularity/aggregate correlation of a > small sample of pages across the entire index. Postgres' existing sampling is > designed to give an even sample across all rows. An issue with this > course-granularity correlation is that the index can have a broad correlation > (to physical heap location), but with many small-scale deviations, which don't > show up due to sampling a relatively small fraction of a large table; and/or > the effect of the deviations is insignificant/noise and correlation is still > computed near 1. > > I believe the "large scale" correlation that postgres computes from block > sample fails to well represent small-scale uncorrelated reads which contribute > large number of random reads, but not included in planner cost. All of that may well be true, but I've actually come around to the view that we should treat TID as a first class part of the keyspace, that participates in comparisons as an implicit last column in all cases (not just when B-Trees are built within tuplesort.c). That would probably obviate the need for a patch like this entirely, because pg_stats.correlation would be unaffected by duplicates. I think that this would have a number of advantages, some of which might be significant. For example, I think it could make LP_DEAD cleanup within nbtree more effective for some workloads, especially workloads where it is important for HOT pruning and LP_DEAD cleanup to work in concert -- locality of access probably matters with that. Also, with every entry in the index guaranteed to be unique, we can imagine VACUUM being much more selective with killing index entries, when the TID array it uses happens to be very small. With the freeze map stuff that's now in place, being able to do that matters more than before. The original Lehman & Yao algorithm is supposed to have unique keys in all cases, but we don't follow that in order to make duplicates work, which is implemented by changing an invariant (see nbtree README for details). So, this could simplify some aspects of how binary searches must work in the face of having to deal with non-unique keys. I think that there are cases where many non-HOT UPDATEs have to go through a bunch of duplicate leaf tuples and do visibility checking on old versions for no real benefit. With the TID as part of the keyspace, we could instead tell the B-Tree code to insert a new tuple as part of an UPDATE while using the TID as part of its insertion scan key, so rummaging through many duplicates is avoided. That having been said, it would be hard to do this for all the reasons I went into in that thread you referenced [1]. If you're going to treat TID as a part of the keyspace, you have to totally embrace that, which means that the internal pages need to have heap TIDs too (not just pointers to the lower levels in the B-Tree, which the IndexTuple's t_tid pointer is used for there). Those are the place where you need to literally append this new, implicit heap TID column as if it was just another user-visible attribute, since that information isn't stored in the internal pages today at all. Doing that has a cost, which isn't going to be acceptable if we naively append a heap TID to every internal page IndexTuple. With a type like int4, you're going to completely bloat those pages, with big consequences for fan-in. So, really, what needs to happen to make it work is someone needs to write a suffix truncation patch, which implies that only those cases that actually benefit from increasing the width of internal page IndexTuples (those with many "would-be duplicates") pay the cost. This is a classic technique, that I've actually already prototyped, though that's extremely far from being worth posting here. That was just to verify my understanding. I think that I should write and put up for discussion a design document for various nbtree enhancements. These include internal page suffix truncation, prefix compression, and key normalization. I'm probably not going to take any of these projects on, but it would be useful if there was at least a little bit of clarity about how they could be implemented. Maybe we could reach some kind of weak consensus without going to great lengths. These optimizations are closely intertwined things, and the lack of clarity on how they all fit together is probably holding back an implementation of any one of them. [1] https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@telsasoft.com -- Peter Geoghegan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance