Subject: Re: Volunteer to build a configuration tool
From: Mark Lewis <mark.lewis@mir3.com>
To: Greg Smith <gsmith@gregsmith.com>
Cc: pgsql-performance@postgresql.org
In-Reply-To: <Pine.GSO.4.64.0706201044100.9089@westnet.com>
References: <A3AC4FA47DC0B1458C3E5396E685E63302395E53@SAB-DC1.sab.uiuc.edu>
	<1182298719.4482.133.camel@ipso.snappymail.ca>
	<Pine.GSO.4.64.0706200040001.7594@westnet.com>
	<7621.1182318230@sss.pgh.pa.us>
	<Pine.GSO.4.64.0706200224430.16657@westnet.com>
	<9001.1182323187@sss.pgh.pa.us> <op.tt7mn0m9cigqcu@apollo13>
	<Pine.GSO.4.64.0706201044100.9089@westnet.com>
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Organization: MIR3, Inc.
Date: Wed, 20 Jun 2007 10:27:17 -0700
Message-Id: <1182360437.14407.40.camel@archimedes>
Mime-Version: 1.0

On Wed, 2007-06-20 at 11:21 -0400, Greg Smith wrote:
...
> One of the things that was surprising to me when I started looking at the 
> organization of the PostgreSQL buffer cache is how little gross 
> information about its contents is available.  I kept expecting to find a 
> summary section where you could answer questions like "how much of the 
> cache currently has information about index/table X?" used as an input to 
> the optimizer.  I understand that the design model expects much of this is 
> unknowable due to the interaction with the OS cache, and in earlier 
> versions you couldn't make shared_buffers big enough for its contents to 
> be all that interesting, so until recently this wasn't worth collecting.
> 
> But in the current era, where it's feasible to have multi-GB caches 
> efficiently managed by PG and one can expect processor time is relatively 
> cheap, it seems to me one way to give a major boost to the optimizer is to 
> add some overhead to buffer cache management so it collects such 
> information.  When I was trying to do a complete overhaul on the 
> background writer, the #1 problem was that I had to assemble my own 
> statistics on what was inside the buffer cache as it was scanned, because 
> a direct inspection of every buffer is the only way to know things like 
> what percentage of the cache is currently dirty.
...

One problem with feeding the current state of the buffer cache to the
planner is that the planner may be trying to prepare a plan which will
execute 10,000 times.  For many interesting queries, the state of the
cache will be very different after the first execution, as indexes and
active portions of tables are brought in.

For that matter, an early stage of query execution could significantly
change the contents of the buffer cache as seen by a later stage of the
execution, even inside a single query.

I'm not saying that inspecting the buffer cache more is a bad idea, but
gathering useful information with the current planner is a bit tricky.

For purposes of idle speculation, one could envision some non-trivial
changes to PG which would make really slick use this data:

(1) Allow PG to defer deciding whether to perform an index scan or
sequential scan until the moment it is needed, and then ask the buffer
cache what % of the pages from the relevant indexes/tables are currently
cached.

(2) Automatically re-plan prepared queries with some kind of frequency
(exponential in # of executions?  fixed-time?), to allow the plans to
adjust to changes in the buffer-cache.

Besides being hard to build, the problem with these approaches (or any
other approach which takes current temporary state into account) is that
as much as some of us might want to make use of every piece of data
available to make the planner into a super-brain, there are lots of
other folks who just want plan stability.  The more dynamic the system
is, the less predictable it can be, and especially in mission-critical
stuff, predictability matters more than .  Tom said it really well in a
recent post, 

"To me, the worst catch-22 we face in this area is that we'd like the
optimizer's choices of plan to be stable and understandable, but the
real-world costs of queries depend enormously on short-term conditions
such as how much of the table has been sucked into RAM recently by
other queries.  I have no good answer to that one."