public inbox for [email protected]
help / color / mirror / Atom feedFrom: Oleg Bartunov <[email protected]>
To: John Hansen <[email protected]>
Cc: Greg Sabino Mullane <[email protected]>
Cc: [email protected]
Subject: Re: Suggestion for improving Archives
Date: Sun, 5 Sep 2004 17:25:52 +0400 (MSD)
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
On Sun, 5 Sep 2004, John Hansen wrote:
> > Marc again dropped last time modification header, so it's
> > impossible to sort results by date (in general case ) without
> > specific parser.
>
> Yes, that is unfortunate, but the code required to make this happen puts
> stress on the archives to some degree.
What code ? I've seen that last modified header and now it's gone.
No stress on the archives, it's pure question of several lines of code
>
> > Also, he changed template for message. These changes cause
> > recrawling the whole archive each time and overloading
> > archives.postgresql.org More specific search engine could use
> > another source of information which messages to crawl, but
> > one we use at pgsql.ru is a general search engine and it
> > can't get modification date without proper header.
>
> There should be no need to reindex the entire archive because of a
> template change, since if you honor the embedded
> <!--noindex-->..<!--/noindex--> tags, the body text never changes.
> Unless of course, you want to keep an up-to-date cached copy.
>
Hmm, this is rather non-standard feature of archives.postgresql.org.
The problem is not with index/reindex ! The problem with crawler which
doesn't have enough information to make a right decision.
I don't like non-standard solution/hack when there are standard and
reliable solutions.
> >
> > I suggest:
> >
> > 1. Use 3-server architecture (image server, frontend, backend) which
> > could be reduced to 2 servers (image+frontend, backend) -
> > frontend could be plain apache+mod_accel and serve/cache
> > all backends
> > outputs, backend is a modperl or/and php enabled apache.
> > 2. return last modification header - be friendly to crawlers
> > and browsers
>
> Tho an accellerator would only work if last-modified header is returned
> by the backend, this might be worth looking into.
>
I don't see a problem to return that header. But we'll have standard
solution for database driven site with dynamic content. Note,
one frontend could serve/hide many backends.
> > 3. stop changing message template
> >
>
> Template changes are inevitable, they're part of progress :)
>
it's not a portal page, it's just a message, why should it changed so
often. I think I should teach our crawler to recognize if changes were
cosmetic using fuzzy checksum.
> ... John
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [email protected], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
view thread (23+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Suggestion for improving Archives
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox