X-Original-To: pgsql-www-postgresql.org@postgresql.org Received: from localhost (wm.hub.org [200.46.204.128]) by postgresql.org (Postfix) with ESMTP id E4DA99FB1E6 for ; Tue, 29 Aug 2006 02:04:45 -0300 (ADT) Received: from postgresql.org ([200.46.204.71]) by localhost (mx1.hub.org [200.46.204.128]) (amavisd-new, port 10024) with ESMTP id 43313-03 for ; Tue, 29 Aug 2006 05:04:41 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey- Received: from service-web.de (p15093784.pureserver.info [217.160.106.224]) by postgresql.org (Postfix) with ESMTP id 79BDC9FB1E1 for ; Tue, 29 Aug 2006 02:04:41 -0300 (ADT) Received: from [192.168.178.99] (p548B2099.dip0.t-ipconnect.de [84.139.32.153]) by service-web.de (Postfix) with ESMTP id 3792F20045F; Tue, 29 Aug 2006 07:04:39 +0200 (CEST) Message-ID: <44F3CAE6.4040204@wildenhain.de> Date: Tue, 29 Aug 2006 07:04:38 +0200 From: Tino Wildenhain User-Agent: Mail/News 1.5 (X11/20060228) MIME-Version: 1.0 To: "Joshua D. Drake" Cc: PostgreSQL WWW Subject: Re: A counter productive conversation about search. References: <44F3B09C.3010104@commandprompt.com> In-Reply-To: <44F3B09C.3010104@commandprompt.com> X-Enigmail-Version: 0.94.0.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Scanned: Maia Mailguard 1.0.1 X-Spam-Status: No, hits=0.135 tagged_above=0 required=5 tests=FORGED_RCVD_HELO X-Spam-Level: X-Archive-Number: 200608/151 X-Sequence-Number: 10542 Joshua D. Drake wrote: ... > Rolling our own really wouldn't be that hard "if" we can create a > reasonably smart web page grabber. We have all the tools (tsearch2 and > pg_pgtrm) to easily do the searches. > > So is anyone up for helping develop a page grabber? Thats not the hardest part but why do we need to grab if the contents of the pages could be in the database? But admittedly, I don't know any good CMS w/ postgresql backend. But anyway, grabbing the sources of the pages while they are published (like the docbook stuff for the documentation) makes a lot more sense imho. Ditto for the archives. Its much easier to get an idea of the structure and nature of the data when you dont have to deal with the final result (e.g. HTML) So a couple of scripts that fire when mail comes in, documentation is compiled and when some other publishing takes place could really help to keep the index in sync w/o having to crawl all sites over and over again. Regards Tino Wildenhain