X-Original-To: pgsql-www-postgresql.org@postgresql.org Received: from localhost (mx1.hub.org [200.46.208.251]) by postgresql.org (Postfix) with ESMTP id 7DB029FB1EA for ; Tue, 29 Aug 2006 02:37:26 -0300 (ADT) Received: from postgresql.org ([200.46.204.71]) by localhost (mx1.hub.org [200.46.208.251]) (amavisd-new, port 10024) with ESMTP id 31923-01 for ; Tue, 29 Aug 2006 02:37:22 -0300 (ADT) X-Greylist: from auto-whitelisted by SQLgrey- Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2]) by postgresql.org (Postfix) with ESMTP id 2218B9FB1E7 for ; Tue, 29 Aug 2006 02:37:15 -0300 (ADT) Received: from ra (ra [158.250.29.2]) by ra.sai.msu.su (8.13.4/8.13.4) with ESMTP id k7T5b5ub029181; Tue, 29 Aug 2006 09:37:05 +0400 (MSD) Date: Tue, 29 Aug 2006 09:37:04 +0400 (MSD) From: Oleg Bartunov X-X-Sender: megera@ra.sai.msu.su To: Tino Wildenhain cc: "Joshua D. Drake" , PostgreSQL WWW Subject: Re: A counter productive conversation about search. In-Reply-To: <44F3CAE6.4040204@wildenhain.de> Message-ID: References: <44F3B09C.3010104@commandprompt.com> <44F3CAE6.4040204@wildenhain.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Virus-Scanned: Maia Mailguard 1.0.1 X-Spam-Status: No, hits=1.341 tagged_above=0 required=5 tests=AWL, DNS_FROM_RFC_ABUSE, URIBL_SBL X-Spam-Level: * X-Archive-Number: 200608/152 X-Sequence-Number: 10543 On Tue, 29 Aug 2006, Tino Wildenhain wrote: > Joshua D. Drake wrote: > ... >> Rolling our own really wouldn't be that hard "if" we can create a >> reasonably smart web page grabber. We have all the tools (tsearch2 and >> pg_pgtrm) to easily do the searches. >> >> So is anyone up for helping develop a page grabber? > > Thats not the hardest part but why do we need to grab if the contents > of the pages could be in the database? But admittedly, I don't know > any good CMS w/ postgresql backend. But anyway, grabbing the sources > of the pages while they are published (like the docbook stuff > for the documentation) makes a lot more sense imho. Ditto for the > archives. Its much easier to get an idea of the structure and nature > of the data when you dont have to deal with the final result (e.g. HTML) > > So a couple of scripts that fire when mail comes in, documentation > is compiled and when some other publishing takes place could > really help to keep the index in sync w/o having to crawl all sites > over and over again. This is exactly what we have on pgsql.ru/db/mw. We use procmail to fire our backend to process incoming message. This is not a problem, the most complex thing is a backend. > > Regards > Tino Wildenhain > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83