public inbox for [email protected]  
help / color / mirror / Atom feed
incomplete headers: archives.postgresql.org
2+ messages / 2 participants
[nested] [flat]

* incomplete headers: archives.postgresql.org
@ 2004-01-13 17:52  Oleg Bartunov <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Oleg Bartunov @ 2004-01-13 17:52 UTC (permalink / raw)
  To: pgsql-www

Hi there,

crawling of archives.postgresql.org is a pain, because there are no
last-modified information in headers and crawler have to download message
again. For example:

megera@mira:~$ curl -I http://archives.postgresql.org/pgsql-hackers/2004-01/msg00282.php
HTTP/1.1 200 OK
Date: Tue, 13 Jan 2004 17:38:26 GMT
Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1
X-Powered-By: PHP/4.3.3RC1
Content-Type: text/html

Is't possible to add, at least,  header 'Last-Modified', so crawler could
understand if this page should be downloaded again ? It'll save bandwidth
and time to crawle. I think the best way to set 'Last-Modified' header
to date of message from 'Date:' field. Of course, there are should be
proof from 'bad clocks', so default time may be arrival time.

Also, it could be useful to add 'Expires' header.
I think, headers should be added only to pages with individual message, not
to indexes, because index pages are indeed changed.

I don't think it's very difficult, but it help site and people.


btw, I use cacheability to check if page could cached:
http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers...

http://archives.postgresql.org/pgsql-hackers/2004-01/msg00282.php
Expires   	  -
Cache-Control   	  -
Last-Modified   	  -
ETag   	  -
Content-Length  	  - (actual size: 13277)
Server  	Apache/1.3.28 (Unix) PHP/4.3.3RC1

This object will be considered stale, because it doesn't have any freshness
information assigned. It doesn't have a validator present. It doesn't have a Content-Length header present, so it can't be used in a HTTP/1.0 persistent connection.





	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [email protected], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83



^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: incomplete headers: archives.postgresql.org
@ 2004-01-13 19:49  Marc G. Fournier <[email protected]>
  parent: Oleg Bartunov <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Marc G. Fournier @ 2004-01-13 19:49 UTC (permalink / raw)
  To: Oleg Bartunov <[email protected]>; +Cc: pgsql-www


let me look into it ...  I don't think adding that info is particularly
difficult, just a matter of adding a couple of 'headers()' functions to
the PHP on top ... added to my TODO list ...

On Tue, 13 Jan 2004, Oleg Bartunov wrote:

> Hi there,
>
> crawling of archives.postgresql.org is a pain, because there are no
> last-modified information in headers and crawler have to download message
> again. For example:
>
> megera@mira:~$ curl -I http://archives.postgresql.org/pgsql-hackers/2004-01/msg00282.php
> HTTP/1.1 200 OK
> Date: Tue, 13 Jan 2004 17:38:26 GMT
> Server: Apache/1.3.28 (Unix) PHP/4.3.3RC1
> X-Powered-By: PHP/4.3.3RC1
> Content-Type: text/html
>
> Is't possible to add, at least,  header 'Last-Modified', so crawler could
> understand if this page should be downloaded again ? It'll save bandwidth
> and time to crawle. I think the best way to set 'Last-Modified' header
> to date of message from 'Date:' field. Of course, there are should be
> proof from 'bad clocks', so default time may be arrival time.
>
> Also, it could be useful to add 'Expires' header.
> I think, headers should be added only to pages with individual message, not
> to indexes, because index pages are indeed changed.
>
> I don't think it's very difficult, but it help site and people.
>
>
> btw, I use cacheability to check if page could cached:
> http://www.sai.msu.su/admin/cacheability/?query=http%3A%2F%2Farchives.postgresql.org%2Fpgsql-hackers...
>
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00282.php
> Expires   	  -
> Cache-Control   	  -
> Last-Modified   	  -
> ETag   	  -
> Content-Length  	  - (actual size: 13277)
> Server  	Apache/1.3.28 (Unix) PHP/4.3.3RC1
>
> This object will be considered stale, because it doesn't have any freshness
> information assigned. It doesn't have a validator present. It doesn't have a Content-Length header present, so it can't be used in a HTTP/1.0 persistent connection.
>
>
>
>
>
> 	Regards,
> 		Oleg
> _____________________________________________________________
> Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> Sternberg Astronomical Institute, Moscow University (Russia)
> Internet: [email protected], http://www.sai.msu.su/~megera/
> phone: +007(095)939-16-83, +007(095)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: [email protected]           Yahoo!: yscrappy              ICQ: 7615664




^ permalink  raw  reply  [nested|flat] 2+ messages in thread


end of thread, other threads:[~2004-01-13 19:49 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2004-01-13 17:52 incomplete headers: archives.postgresql.org Oleg Bartunov <[email protected]>
2004-01-13 19:49 ` Marc G. Fournier <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox