Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtp (Exim 4.84_2) (envelope-from ) id 1b5QFT-0008Tj-3n for pgsql-performance@arkaria.postgresql.org; Wed, 25 May 2016 04:16:47 +0000 Received: from localhost ([127.0.0.1] helo=postgresql.org) by malur.postgresql.org with smtp (Exim 4.84_2) (envelope-from ) id 1b5QFS-0000IN-IC for pgsql-performance@arkaria.postgresql.org; Wed, 25 May 2016 04:16:46 +0000 Received: from makus.postgresql.org ([2001:4800:1501:1::229]) by malur.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1b5QFR-0000FP-54 for pgsql-performance@postgresql.org; Wed, 25 May 2016 04:16:45 +0000 Received: from mail-oi0-x231.google.com ([2607:f8b0:4003:c06::231]) by makus.postgresql.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.84_2) (envelope-from ) id 1b5QFN-0007pp-93 for pgsql-performance@postgresql.org; Wed, 25 May 2016 04:16:44 +0000 Received: by mail-oi0-x231.google.com with SMTP id k23so60000970oih.0 for ; Tue, 24 May 2016 21:16:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bowt-ie.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=axH0NTrpdYZKCzA6f6N0JiqHyoPV6zKVNb1wLTo60os=; b=fP4Z1QDDoiQo18+eXKJl2UtZlcdpuA9u8D1k3mOYDerw1OArJF6IpHfqNgpJtPPB2P w6w+1gbWFBA53jMMUuE/BFw7pIefWR6EEDPjKhybxQnqXs7YXvdMSSUvL+mUvnTucbX7 U3CtKAvqI6WYIZnxFvxA6cmiZvfu+XAlxX8CaRvj8Wos2JtmpRA0AhqdBAmCflOwTEPZ IQYEspDRK6aRW9DytVoyq/4tsnkzyyoZ+E64T7n18C5/yAxc+XId7uXwGk1wBow+pNeQ 7NFuFJ15dzgZ+543PPGjEosGPh1MJLd5KpfQ0ZFFJFClIXNqCGfc7XVnS+40VJh6nACg 5hfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=axH0NTrpdYZKCzA6f6N0JiqHyoPV6zKVNb1wLTo60os=; b=alluCBy/1JGFzGwgRKaA7413SRwsV9/50diKt4849Xffa9jY5kHNigAT4lacUQF8YP LZI+H7HWVdXA1SoUtXLt8ciL6uCH2MQICj0RmxbSxc1wwULjkbiU34wa4uFYJi8kyZng LMteHPVWj9q+cgxsoGZqJjr396r6EpvBzfilqcTyTn8aoL6oQzmV/cozYErVpFcO+PRP acF0Wstk0xd+YS9VmCdu2MgttR6v2YdGiGGW5Sk9lw32UWEHi5b5z9SOsFlc7J6cMSEY yYX23i84Bpx4dxrwdRVu2dnXQIcLalO+0jFgJOryVtoKMQaM/HYZpZ+b+jmXF455lp00 /Cfw== X-Gm-Message-State: ALyK8tKW6Djdr0jeEvBz/7jLGNDrpEH6CCHy6kc3fb97J6Uwq3Jhqz585y80ekMkvHz1heSo0hVc5G4+DDGUeA== X-Received: by 10.202.168.143 with SMTP id r137mr1041512oie.40.1464149799824; Tue, 24 May 2016 21:16:39 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.17.74 with HTTP; Tue, 24 May 2016 21:16:20 -0700 (PDT) X-Originating-IP: [75.101.100.201] In-Reply-To: <20160524173914.GA11880@telsasoft.com> References: <20160524173914.GA11880@telsasoft.com> From: Peter Geoghegan Date: Tue, 24 May 2016 21:16:20 -0700 Message-ID: Subject: Re: index fragmentation on insert-only table with non-unique column To: Justin Pryzby Cc: pgsql-performance@postgresql.org Content-Type: text/plain; charset=UTF-8 X-Pg-Spam-Score: -2.6 (--) List-Archive: List-Help: List-ID: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: X-Mailing-List: pgsql-performance Precedence: bulk Sender: pgsql-performance-owner@postgresql.org On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby wrote: > I was able to see great improvement without planner parameters by REINDEX the > timestamp index. My theory is that the index/planner doesn't handle well the > case of many tuples with same column value, and returns pages out of logical > order. Reindex fixes that, rewriting the index data with pages in order > (confirmed with pageinspect), which causes index scans to fetch heap data more > or less monotonically (if not consecutively). strace shows that consecutive > read()s are common (without intervening seeks). I gather this allows the OS > readahead to kick in. The basic problem is that the B-Tree code doesn't maintain this property. However, B-Tree index builds will create an index that initially has this property, because the tuplesort.c code happens to sort index tuples with a CTID tie-breaker. > Postgres seems to assume that the high degree of correlation of the table > column seen in pg_stats is how it will get data from the index scan, which > assumption seems to be very poor on what turns out to be a higly fragmented > index. Is there a way to help it to understand otherwise?? Your complaint is vague. Are you complaining about the planner making a poor choice? I don't think that's the issue here, because you never made any firm statement about the planner making a choice that was worth than an alternative that it had available. If you're arguing for the idea that B-Trees should reliably keep tuples in order by a tie-break condition, that seems difficult to implement, and likely not worth it in practice. -- Peter Geoghegan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance