X-Original-To: pgsql-www-postgresql.org@localhost.postgresql.org Received: from localhost (unknown [200.46.204.144]) by svr1.postgresql.org (Postfix) with ESMTP id 667B45E4998 for ; Sun, 5 Sep 2004 17:54:31 +0100 (BST) Received: from svr1.postgresql.org ([200.46.204.71]) by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024) with ESMTP id 09310-03 for ; Sun, 5 Sep 2004 16:54:28 +0000 (GMT) Received: from svr4.postgresql.org (svr4.postgresql.org [66.98.251.159]) by svr1.postgresql.org (Postfix) with ESMTP id 59B2B5E4932 for ; Sun, 5 Sep 2004 17:52:37 +0100 (BST) Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2]) by svr4.postgresql.org (Postfix) with ESMTP id C6B6D5AF9F3 for ; Sun, 5 Sep 2004 13:32:41 +0000 (GMT) Received: from ra (ra [158.250.29.2]) by ra.sai.msu.su (8.12.10/8.12.10) with ESMTP id i85DPqQT018957; Sun, 5 Sep 2004 17:25:52 +0400 (MSD) Date: Sun, 5 Sep 2004 17:25:52 +0400 (MSD) From: Oleg Bartunov X-X-Sender: megera@ra.sai.msu.su To: John Hansen Cc: Greg Sabino Mullane , pgsql-www@postgresql.org Subject: Re: Suggestion for improving Archives In-Reply-To: <5066E5A966339E42AA04BA10BA706AE56190@rodrick.geeknet.com.au> Message-ID: References: <5066E5A966339E42AA04BA10BA706AE56190@rodrick.geeknet.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: by amavisd-new at hub.org X-Spam-Status: No, hits=0.0 tagged_above=0.0 required=5.0 tests= X-Spam-Level: X-Archive-Number: 200409/48 X-Sequence-Number: 5118 On Sun, 5 Sep 2004, John Hansen wrote: > > Marc again dropped last time modification header, so it's > > impossible to sort results by date (in general case ) without > > specific parser. > > Yes, that is unfortunate, but the code required to make this happen puts > stress on the archives to some degree. What code ? I've seen that last modified header and now it's gone. No stress on the archives, it's pure question of several lines of code > > > Also, he changed template for message. These changes cause > > recrawling the whole archive each time and overloading > > archives.postgresql.org More specific search engine could use > > another source of information which messages to crawl, but > > one we use at pgsql.ru is a general search engine and it > > can't get modification date without proper header. > > There should be no need to reindex the entire archive because of a > template change, since if you honor the embedded > .. tags, the body text never changes. > Unless of course, you want to keep an up-to-date cached copy. > Hmm, this is rather non-standard feature of archives.postgresql.org. The problem is not with index/reindex ! The problem with crawler which doesn't have enough information to make a right decision. I don't like non-standard solution/hack when there are standard and reliable solutions. > > > > I suggest: > > > > 1. Use 3-server architecture (image server, frontend, backend) which > > could be reduced to 2 servers (image+frontend, backend) - > > frontend could be plain apache+mod_accel and serve/cache > > all backends > > outputs, backend is a modperl or/and php enabled apache. > > 2. return last modification header - be friendly to crawlers > > and browsers > > Tho an accellerator would only work if last-modified header is returned > by the backend, this might be worth looking into. > I don't see a problem to return that header. But we'll have standard solution for database driven site with dynamic content. Note, one frontend could serve/hide many backends. > > 3. stop changing message template > > > > Template changes are inevitable, they're part of progress :) > it's not a portal page, it's just a message, why should it changed so often. I think I should teach our crawler to recognize if changes were cosmetic using fuzzy checksum. > ... John > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83