X-Original-To: pgsql-www-postgresql.org@localhost.postgresql.org Received: from localhost (av.hub.org [200.46.204.144]) by postgresql.org (Postfix) with ESMTP id 99C1F9DD753 for ; Fri, 13 Jan 2006 22:16:57 -0400 (AST) Received: from postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 46652-08 for ; Fri, 13 Jan 2006 22:17:00 -0400 (AST) Received: from hub.org (hub.org [200.46.204.220]) by postgresql.org (Postfix) with ESMTP id C3A9E9DCBB6 for ; Fri, 13 Jan 2006 22:16:54 -0400 (AST) Received: from localhost (unknown [200.46.204.144]) by hub.org (Postfix) with ESMTP id 137A362C9A8 for ; Fri, 13 Jan 2006 22:16:59 -0400 (AST) Received: from hub.org ([200.46.204.220]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 47668-09; Fri, 13 Jan 2006 22:16:56 -0400 (AST) Received: from ganymede.hub.org (blk-222-82-85.eastlink.ca [24.222.82.85]) by hub.org (Postfix) with ESMTP id 8C44262C97E; Fri, 13 Jan 2006 22:16:55 -0400 (AST) Received: by ganymede.hub.org (Postfix, from userid 1000) id 91CD43E49F; Fri, 13 Jan 2006 22:16:59 -0400 (AST) Received: from localhost (localhost [127.0.0.1]) by ganymede.hub.org (Postfix) with ESMTP id 90F0D33C54; Fri, 13 Jan 2006 22:16:59 -0400 (AST) Date: Fri, 13 Jan 2006 22:16:59 -0400 (AST) From: "Marc G. Fournier" X-X-Sender: scrappy@ganymede.hub.org To: Josh Berkus cc: John Hansen , pgsql-www@postgresql.org, "Jim C. Nasby" Subject: Re: Infrastructure monitoring In-Reply-To: <200601131714.48132.josh@agliodbs.com> Message-ID: <20060113220930.R28752@ganymede.hub.org> References: <20060114010502.GW9017@pervasive.com> <200601131714.48132.josh@agliodbs.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Virus-Scanned: by amavisd-new at hub.org X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, score=0.987 required=5 tests=[AWL=0.508, DNS_FROM_RFC_ABUSE=0.479] X-Spam-Score: 0.987 X-Spam-Level: X-Archive-Number: 200601/92 X-Sequence-Number: 9280 On Fri, 13 Jan 2006, Josh Berkus wrote: > Jim, > >> Search has been down for at least 2 days now, and this certainly isn't >> the first time it's happened. There's also been cases of archives >> getting stuck, and probably other outages besides those that went on >> until someone email'd about it. >> >> Would it be difficult to setup something to monitor these various >> services? I know there's at least one OSS tool to do it, though I have >> no idea how hard it would be to tie that into the current >> infrastructure. > > We have an open offer of Hyperic licenses, and they support FreeBSD now. Not to discount the offer ... but, what exactly would that provide us? We already monitor the *servers*, its what is inside of the servers that needs better monitoring ... knowing nothing about Hyperic, does that provide something for that? In the case of the archives, for instance, the problem was a perl process that for some unknown reason got stuck randomly ... removed that in favor of an awk script, and it hasn't done it since ... i also redirected cron's email to scrappy@postgresql.org, so that any errors show up in my mailbox instead of roots, so I get an hourly reminder that things are running well ... In the case of search ... John would be better at answering that, but when he and I talked this past week, he mentioned that he was moving it all over to two new servers, which I changed the DNS for on Wednesday ... As I've said above ... physical servers are being monitored, so if anyone has some ideas on how we can improve "content monitoring", for lack of a better word, I know I'm all ears ... Again, if Hyperic can offer something for this, let me know ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664