Received: from localhost (maia-5.hub.org [200.46.204.182]) by postgresql.org (Postfix) with ESMTP id B86099FBFBB for ; Wed, 20 Jun 2007 14:27:25 -0300 (ADT) Received: from postgresql.org ([200.46.204.71]) by localhost (mx1.hub.org [200.46.204.182]) (amavisd-maia, port 10024) with ESMTP id 27562-03 for ; Wed, 20 Jun 2007 14:27:19 -0300 (ADT) X-Greylist: from auto-whitelisted by SQLgrey-1.7.4 Received: from mir3-fs.mir3.com (mail.mir3.com [65.208.188.100]) by postgresql.org (Postfix) with ESMTP id 34F769FBFC2 for ; Wed, 20 Jun 2007 14:27:19 -0300 (ADT) Received: MIR3-FS.mir3.com 172.16.1.11 from 172.16.2.68 172.16.2.68 via HTTP with MS-WebStorage 6.0.6249 Received: from archimedes by MIR3-FS.mir3.com; 20 Jun 2007 10:27:18 -0700 Subject: Re: Volunteer to build a configuration tool From: Mark Lewis To: Greg Smith Cc: pgsql-performance@postgresql.org In-Reply-To: References: <1182298719.4482.133.camel@ipso.snappymail.ca> <7621.1182318230@sss.pgh.pa.us> <9001.1182323187@sss.pgh.pa.us> Content-Type: text/plain Content-Transfer-Encoding: 7bit Organization: MIR3, Inc. Date: Wed, 20 Jun 2007 10:27:17 -0700 Message-Id: <1182360437.14407.40.camel@archimedes> Mime-Version: 1.0 X-Mailer: Evolution 2.8.1 X-Virus-Scanned: Maia Mailguard 1.0.1 X-Archive-Number: 200706/429 X-Sequence-Number: 25462 On Wed, 2007-06-20 at 11:21 -0400, Greg Smith wrote: ... > One of the things that was surprising to me when I started looking at the > organization of the PostgreSQL buffer cache is how little gross > information about its contents is available. I kept expecting to find a > summary section where you could answer questions like "how much of the > cache currently has information about index/table X?" used as an input to > the optimizer. I understand that the design model expects much of this is > unknowable due to the interaction with the OS cache, and in earlier > versions you couldn't make shared_buffers big enough for its contents to > be all that interesting, so until recently this wasn't worth collecting. > > But in the current era, where it's feasible to have multi-GB caches > efficiently managed by PG and one can expect processor time is relatively > cheap, it seems to me one way to give a major boost to the optimizer is to > add some overhead to buffer cache management so it collects such > information. When I was trying to do a complete overhaul on the > background writer, the #1 problem was that I had to assemble my own > statistics on what was inside the buffer cache as it was scanned, because > a direct inspection of every buffer is the only way to know things like > what percentage of the cache is currently dirty. ... One problem with feeding the current state of the buffer cache to the planner is that the planner may be trying to prepare a plan which will execute 10,000 times. For many interesting queries, the state of the cache will be very different after the first execution, as indexes and active portions of tables are brought in. For that matter, an early stage of query execution could significantly change the contents of the buffer cache as seen by a later stage of the execution, even inside a single query. I'm not saying that inspecting the buffer cache more is a bad idea, but gathering useful information with the current planner is a bit tricky. For purposes of idle speculation, one could envision some non-trivial changes to PG which would make really slick use this data: (1) Allow PG to defer deciding whether to perform an index scan or sequential scan until the moment it is needed, and then ask the buffer cache what % of the pages from the relevant indexes/tables are currently cached. (2) Automatically re-plan prepared queries with some kind of frequency (exponential in # of executions? fixed-time?), to allow the plans to adjust to changes in the buffer-cache. Besides being hard to build, the problem with these approaches (or any other approach which takes current temporary state into account) is that as much as some of us might want to make use of every piece of data available to make the planner into a super-brain, there are lots of other folks who just want plan stability. The more dynamic the system is, the less predictable it can be, and especially in mission-critical stuff, predictability matters more than . Tom said it really well in a recent post, "To me, the worst catch-22 we face in this area is that we'd like the optimizer's choices of plan to be stable and understandable, but the real-world costs of queries depend enormously on short-term conditions such as how much of the table has been sucked into RAM recently by other queries. I have no good answer to that one."