public inbox for [email protected]
help / color / mirror / Atom feeddeveloper.pgadmin.org/nagios.pgadmin.org - Disk failure
9+ messages / 4 participants
[nested] [flat]
* developer.pgadmin.org/nagios.pgadmin.org - Disk failure
@ 2006-05-12 21:47 Dave Page <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Dave Page @ 2006-05-12 21:47 UTC (permalink / raw)
To: [email protected]; pgsql-www; +Cc: Andreas Pflug <[email protected]>; Greg Sabino Mullane <[email protected]>; Mark Yeatman <[email protected]>
The machine hosting the developer.pgadmin.org and nagios.pgadmin.org
vservers is currently having serious filesystem problems, which are
causing disk intensive operations (like rsync, tar) to segfault for
currently unknown reasons. If you commit to the pgAdmin SVN, please hold
off for a while, or if you are working on other projects on the machine,
please don't for now!
If anyone has any idea what might cause ReiserFS to die horribly like
this, whilst the RAID1 disks don't so much as squeak in the wrong way,
I'd love to hear it!!
Anyhoo, I have backups, and a replacement machine sitting in the wings
so I should be able to get things sorted early next week.
Regards, Dave
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: developer.pgadmin.org/nagios.pgadmin.org - Diskfailure
@ 2006-05-12 22:37 Dave Page <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Dave Page @ 2006-05-12 22:37 UTC (permalink / raw)
To: Jeff MacDonald <[email protected]>; +Cc: [email protected]; pgsql-www
> -----Original Message-----
> From: Jeff MacDonald [mailto:[email protected]]
> Sent: 12 May 2006 23:19
> To: Dave Page
> Cc: Jeff MacDonald
> Subject: Re: [pgsql-www]
> developer.pgadmin.org/nagios.pgadmin.org - Diskfailure
>
> On Fri, 2006-05-12 at 22:47 +0100, Dave Page wrote:
> > The machine hosting the developer.pgadmin.org and
> nagios.pgadmin.org
> > vservers is currently having serious filesystem problems, which are
> > causing disk intensive operations (like rsync, tar) to segfault for
> > currently unknown reasons.
>
> do a memory test, swap as needed, see if that solves the
> problem..
I'll try just replacing it - I have some unopened sticks for that mobo.
FWIW, a reboot with a forced fsck found no errors at all and the box is
currently working OK, but I have now found errors similar to the
following:
May 12 21:11:29 barbas rsyncd[32134]: rsync: writefd_unbuffered failed
to write 4 bytes: phase "send_file_entry" [sender]: Broken pipe (32)
May 12 21:11:29 barbas rsyncd[32134]: rsync error: error in rsync
protocol data stream (code 12) at io.c(1126) [sender]
May 12 22:13:52 barbas kernel: kernel BUG at page_alloc.c:142!
May 12 22:13:52 barbas kernel: invalid operand: 0000
May 12 22:13:52 barbas kernel: CPU: 1
May 12 22:13:52 barbas kernel: EIP: 0010:[<c013cec0>] Not tainted
May 12 22:13:52 barbas kernel: EFLAGS: 00010286
May 12 22:13:52 barbas kernel: eax: d9e18100 ebx: c262c140 ecx:
c262c140 edx: 00000000
May 12 22:13:52 barbas kernel: esi: c262c140 edi: 00000000 ebp:
00000000 esp: d50d5edc
May 12 22:13:52 barbas kernel: ds: 0018 es: 0018 ss: 0018
May 12 22:13:52 barbas kernel: Process rsync (pid: 32141,
stackpage=d50d5000)
May 12 22:13:52 barbas kernel: Stack: d50d5ee8 c0133ab0 00001000
c262c140 e3a59d44 00006000 c01348e9 00000000
May 12 22:13:52 barbas kernel: 00000000 00001000 c262c140
e3a59d44 00000000 c013423d d50d5f7c c262c140
May 12 22:13:52 barbas kernel: 00000000 00001000 00001000
00000001 00000000 0000013b e3a59c80 c01347f0
May 12 22:13:52 barbas kernel: Call Trace: [<c0133ab0>] [<c01348e9>]
[<c013423d>] [<c01347f0>] [<c01347f0>]
May 12 22:13:52 barbas kernel: [<c0134a2f>] [<c01347f0>] [<c0145a50>]
[<c0108fdf>]
May 12 22:13:52 barbas kernel:
May 12 22:13:52 barbas kernel: Code: 0f 0b 8e 00 6b ba 37 c0 e9 ba fd ff
ff 8b 69 60 85 ed 0f 85
Could well be a duff stick I guess, given where it died.
> the quicker solution may be to just put the backup
> machine into production rather than running exhaustive memory tests.
Yes, well it was going into it anyway to get it out of the current 3U
chassis and into a 1U one with full OOB management. The only problem is
that I'm still awaiting delivery of a cable for the external tape drive
in the rack so I can only do rsync/scp backups until that arrives.
Regards, Dave.
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: developer.pgadmin.org/nagios.pgadmin.org - Disk failure
@ 2006-05-13 02:04 Travis Hein <[email protected]>
parent: Dave Page <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Travis Hein @ 2006-05-13 02:04 UTC (permalink / raw)
To: pgsql-www
I used to have a pair of old SCSI drives, in software RAID1. I never used
reiserfs, it was ext2 of the day. It worked great for most of the time,
except when I did backups, where there was more bus or bulk activity. At
first I went nuts thinking the scsi tape drive was badly terminated or
wreaking havoc on the bus, but then I found the same problem happened with
network backups and backups to IDE drive.
The issue was the kernel scsi card driver was using tag command queueing, but
one of my drives didn't know what to do with those, whilst the other drive
did support tag command queueing. I am not a scsi scientist, but my best
theory was that there was some evil eventual something breaking under the
high loads because of the different TCQ support, and the software RAID1
didn't know what to do then.
I never did fix it, i moved to new hardware and abandoned the system all
together.
Well, sorry to hear things are funny there, and it is probably not related to
my colorful ranting.
But I have some space you can borrow to stuff things on, if it is helpful.
$>df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 337G 196G 141G 59% /
/dev/sdb1 231G 184G 48G 80% /mnt/backup
/ the 141G is 6 element RAID 5 on ext3, with rsync backup to the /mnt/backup,
which is a usb drive, but it works :)
let me know if there is anything I can do.
On Friday 12 May 2006 17:47, Dave Page wrote:
> The machine hosting the developer.pgadmin.org and nagios.pgadmin.org
> vservers is currently having serious filesystem problems, which are
> causing disk intensive operations (like rsync, tar) to segfault for
> currently unknown reasons. If you commit to the pgAdmin SVN, please hold
> off for a while, or if you are working on other projects on the machine,
> please don't for now!
>
> If anyone has any idea what might cause ReiserFS to die horribly like
> this, whilst the RAID1 disks don't so much as squeak in the wrong way,
> I'd love to hear it!!
>
> Anyhoo, I have backups, and a replacement machine sitting in the wings
> so I should be able to get things sorted early next week.
>
> Regards, Dave
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to [email protected] so that your
> message can get through to the mailing list cleanly
--
Only those who attempt the absurd can achieve the impossible.
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org
@ 2006-05-13 11:45 Raphaël Enrici <[email protected]>
parent: Dave Page <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Raphaël Enrici @ 2006-05-13 11:45 UTC (permalink / raw)
To: Dave Page <[email protected]>; +Cc: Jeff MacDonald <[email protected]>; [email protected]; pgsql-www
Dave Page wrote:
>
>
>
>>-----Original Message-----
>>From: Jeff MacDonald [mailto:[email protected]]
>>Sent: 12 May 2006 23:19
>>To: Dave Page
>>Cc: Jeff MacDonald
>>Subject: Re: [pgsql-www]
>>developer.pgadmin.org/nagios.pgadmin.org - Diskfailure
>>
>>On Fri, 2006-05-12 at 22:47 +0100, Dave Page wrote:
>>
>>>The machine hosting the developer.pgadmin.org and
>>
>>nagios.pgadmin.org
>>
>>>vservers is currently having serious filesystem problems, which are
>>>causing disk intensive operations (like rsync, tar) to segfault for
>>>currently unknown reasons.
>>
>>do a memory test, swap as needed, see if that solves the
>>problem..
>
>
> I'll try just replacing it - I have some unopened sticks for that mobo.
> FWIW, a reboot with a forced fsck found no errors at all and the box is
> currently working OK, but I have now found errors similar to the
> following:
>
> May 12 21:11:29 barbas rsyncd[32134]: rsync: writefd_unbuffered failed
> to write 4 bytes: phase "send_file_entry" [sender]: Broken pipe (32)
> May 12 21:11:29 barbas rsyncd[32134]: rsync error: error in rsync
> protocol data stream (code 12) at io.c(1126) [sender]
> May 12 22:13:52 barbas kernel: kernel BUG at page_alloc.c:142!
> May 12 22:13:52 barbas kernel: invalid operand: 0000
> May 12 22:13:52 barbas kernel: CPU: 1
> May 12 22:13:52 barbas kernel: EIP: 0010:[<c013cec0>] Not tainted
> May 12 22:13:52 barbas kernel: EFLAGS: 00010286
> May 12 22:13:52 barbas kernel: eax: d9e18100 ebx: c262c140 ecx:
> c262c140 edx: 00000000
> May 12 22:13:52 barbas kernel: esi: c262c140 edi: 00000000 ebp:
> 00000000 esp: d50d5edc
> May 12 22:13:52 barbas kernel: ds: 0018 es: 0018 ss: 0018
> May 12 22:13:52 barbas kernel: Process rsync (pid: 32141,
> stackpage=d50d5000)
> May 12 22:13:52 barbas kernel: Stack: d50d5ee8 c0133ab0 00001000
> c262c140 e3a59d44 00006000 c01348e9 00000000
> May 12 22:13:52 barbas kernel: 00000000 00001000 c262c140
> e3a59d44 00000000 c013423d d50d5f7c c262c140
> May 12 22:13:52 barbas kernel: 00000000 00001000 00001000
> 00000001 00000000 0000013b e3a59c80 c01347f0
> May 12 22:13:52 barbas kernel: Call Trace: [<c0133ab0>] [<c01348e9>]
> [<c013423d>] [<c01347f0>] [<c01347f0>]
> May 12 22:13:52 barbas kernel: [<c0134a2f>] [<c01347f0>] [<c0145a50>]
> [<c0108fdf>]
> May 12 22:13:52 barbas kernel:
> May 12 22:13:52 barbas kernel: Code: 0f 0b 8e 00 6b ba 37 c0 e9 ba fd ff
> ff 8b 69 60 85 ed 0f 85
Dave,
I recently (2 months ago) experienced kernel crash with reiserfs after
some electrical failure. I solved the problem by doing a full fsck (I
mean fsck and then a reiserfs rebuild of the tree [dangerous]). It
worked, at least for me.
Regards,
Raphaël
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org - Diskfailure
@ 2006-05-13 19:29 Dave Page <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Dave Page @ 2006-05-13 19:29 UTC (permalink / raw)
To: [email protected]; [email protected]; +Cc: [email protected]; [email protected]; pgsql-www
-----Original Message-----
From: "Raphaël Enrici"<[email protected]>
Sent: 13/05/06 12:45:59
To: "Dave Page"<[email protected]>
Cc: "Jeff MacDonald"<[email protected]>, "[email protected]"<[email protected]>, "PostgreSQL WWW"<[email protected]>
Subject: Re: [pgadmin-hackers] [pgsql-www] developer.pgadmin.org/nagios.pgadmin.org - Diskfailure
Hi Raph,
>I recently (2 months ago) experienced kernel crash with reiserfs after
> some electrical failure. I solved the problem by doing a full fsck (I
> mean fsck and then a reiserfs rebuild of the tree [dangerous]). It
> worked, at least for me.
Thanks - I'm leaning towards the memory issue atm as it seems to be OK again following a reboot, and the svn repo which previously wouldn't tar or rsync now verifys perfectly and can be tarred up.
I'll swap the sticks on Monday, and if that doesn't work, then consider a 'full fsck'. If that fails, I guess I'll just move it into the new chassis, and use scp backup to another box until the new scsi cable arrives.
Cheers, Dave
-----Unmodified Original Message-----
Dave Page wrote:
>
>
>
>>-----Original Message-----
>>From: Jeff MacDonald [mailto:[email protected]]
>>Sent: 12 May 2006 23:19
>>To: Dave Page
>>Cc: Jeff MacDonald
>>Subject: Re: [pgsql-www]
>>developer.pgadmin.org/nagios.pgadmin.org - Diskfailure
>>
>>On Fri, 2006-05-12 at 22:47 +0100, Dave Page wrote:
>>
>>>The machine hosting the developer.pgadmin.org and
>>
>>nagios.pgadmin.org
>>
>>>vservers is currently having serious filesystem problems, which are
>>>causing disk intensive operations (like rsync, tar) to segfault for
>>>currently unknown reasons.
>>
>>do a memory test, swap as needed, see if that solves the
>>problem..
>
>
> I'll try just replacing it - I have some unopened sticks for that mobo.
> FWIW, a reboot with a forced fsck found no errors at all and the box is
> currently working OK, but I have now found errors similar to the
> following:
>
> May 12 21:11:29 barbas rsyncd[32134]: rsync: writefd_unbuffered failed
> to write 4 bytes: phase "send_file_entry" [sender]: Broken pipe (32)
> May 12 21:11:29 barbas rsyncd[32134]: rsync error: error in rsync
> protocol data stream (code 12) at io.c(1126) [sender]
> May 12 22:13:52 barbas kernel: kernel BUG at page_alloc.c:142!
> May 12 22:13:52 barbas kernel: invalid operand: 0000
> May 12 22:13:52 barbas kernel: CPU: 1
> May 12 22:13:52 barbas kernel: EIP: 0010:[<c013cec0>] Not tainted
> May 12 22:13:52 barbas kernel: EFLAGS: 00010286
> May 12 22:13:52 barbas kernel: eax: d9e18100 ebx: c262c140 ecx:
> c262c140 edx: 00000000
> May 12 22:13:52 barbas kernel: esi: c262c140 edi: 00000000 ebp:
> 00000000 esp: d50d5edc
> May 12 22:13:52 barbas kernel: ds: 0018 es: 0018 ss: 0018
> May 12 22:13:52 barbas kernel: Process rsync (pid: 32141,
> stackpage=d50d5000)
> May 12 22:13:52 barbas kernel: Stack: d50d5ee8 c0133ab0 00001000
> c262c140 e3a59d44 00006000 c01348e9 00000000
> May 12 22:13:52 barbas kernel: 00000000 00001000 c262c140
> e3a59d44 00000000 c013423d d50d5f7c c262c140
> May 12 22:13:52 barbas kernel: 00000000 00001000 00001000
> 00000001 00000000 0000013b e3a59c80 c01347f0
> May 12 22:13:52 barbas kernel: Call Trace: [<c0133ab0>] [<c01348e9>]
> [<c013423d>] [<c01347f0>] [<c01347f0>]
> May 12 22:13:52 barbas kernel: [<c0134a2f>] [<c01347f0>] [<c0145a50>]
> [<c0108fdf>]
> May 12 22:13:52 barbas kernel:
> May 12 22:13:52 barbas kernel: Code: 0f 0b 8e 00 6b ba 37 c0 e9 ba fd ff
> ff 8b 69 60 85 ed 0f 85
Dave,
I recently (2 months ago) experienced kernel crash with reiserfs after
some electrical failure. I solved the problem by doing a full fsck (I
mean fsck and then a reiserfs rebuild of the tree [dangerous]). It
worked, at least for me.
Regards,
Raphaël
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: developer.pgadmin.org/nagios.pgadmin.org - Disk failure
@ 2006-05-13 19:29 Dave Page <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Dave Page @ 2006-05-13 19:29 UTC (permalink / raw)
To: [email protected]; pgsql-www
-----Original Message-----
From: "Travis Hein"<[email protected]>
Sent: 13/05/06 03:11:53
To: "[email protected]"<[email protected]>
Subject: Re: [pgsql-www] developer.pgadmin.org/nagios.pgadmin.org - Disk failure
> Well, sorry to hear things are funny there, and it is probably not related to
> my colorful ranting.
:-)
This box has been fine for 18 months or so, so I doubt it's the same issue as yours - still, it's always interesting to hear of others' experiences.
> But I have some space you can borrow to stuff things on, if it is helpful.
Thanks - space isn't a problem though so I should be Ok.
Cheers, Dave.
-----Unmodified Original Message-----
I used to have a pair of old SCSI drives, in software RAID1. I never used
reiserfs, it was ext2 of the day. It worked great for most of the time,
except when I did backups, where there was more bus or bulk activity. At
first I went nuts thinking the scsi tape drive was badly terminated or
wreaking havoc on the bus, but then I found the same problem happened with
network backups and backups to IDE drive.
The issue was the kernel scsi card driver was using tag command queueing, but
one of my drives didn't know what to do with those, whilst the other drive
did support tag command queueing. I am not a scsi scientist, but my best
theory was that there was some evil eventual something breaking under the
high loads because of the different TCQ support, and the software RAID1
didn't know what to do then.
I never did fix it, i moved to new hardware and abandoned the system all
together.
Well, sorry to hear things are funny there, and it is probably not related to
my colorful ranting.
But I have some space you can borrow to stuff things on, if it is helpful.
$>df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 337G 196G 141G 59% /
/dev/sdb1 231G 184G 48G 80% /mnt/backup
/ the 141G is 6 element RAID 5 on ext3, with rsync backup to the /mnt/backup,
which is a usb drive, but it works :)
let me know if there is anything I can do.
On Friday 12 May 2006 17:47, Dave Page wrote:
> The machine hosting the developer.pgadmin.org and nagios.pgadmin.org
> vservers is currently having serious filesystem problems, which are
> causing disk intensive operations (like rsync, tar) to segfault for
> currently unknown reasons. If you commit to the pgAdmin SVN, please hold
> off for a while, or if you are working on other projects on the machine,
> please don't for now!
>
> If anyone has any idea what might cause ReiserFS to die horribly like
> this, whilst the RAID1 disks don't so much as squeak in the wrong way,
> I'd love to hear it!!
>
> Anyhoo, I have backups, and a replacement machine sitting in the wings
> so I should be able to get things sorted early next week.
>
> Regards, Dave
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to [email protected] so that your
> message can get through to the mailing list cleanly
--
Only those who attempt the absurd can achieve the impossible.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
^ permalink raw reply [nested|flat] 9+ messages in thread
* developer.pgadmin.org/nagios.pgadmin.org server - update
@ 2006-05-17 09:22 Dave Page <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Dave Page @ 2006-05-17 09:22 UTC (permalink / raw)
To: [email protected]; pgsql-www
Just a quick update on the server problem I was experiencing...
Swapped out the memory on Monday, and put the old sticks into a
different box for testing. The memory appears fault free after 24 hours
of memtest86, however the server, whilst appearing fine, threw another
error during last night's tape backup.
I've swapped out the mobo & CPU this morning (which coupled with an ARP
cache on a router was the cause of this mornings noise from Nagios on
the -slaves list) so we'll see how that goes.
I'm currently working on the system again but being careful to keep an
eye on the logs and keep additional backups of my work around just in
case. I'd advise anyone working on the PostgreSQL website, pgAdmin or
the Nagios install to do the same. Once I have the last cable for the
new server I'll move it all across once and for all.
Cheers, Dave.
PS. For those that are wondering what this has to do with the pgsql-www
list, nagios.pgadmin.org monitors all the postgresql.org servers, and a
number of the webteam use developer.pgadmin.org as a dev machine for the
main PostgreSQL website.
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org server - update
@ 2006-05-18 14:53 Dave Page <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Dave Page @ 2006-05-18 14:53 UTC (permalink / raw)
To: [email protected]; pgsql-www
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Dave Page
> Sent: 17 May 2006 10:23
> To: [email protected]; pgsql-www
> Subject: [pgadmin-hackers]
> developer.pgadmin.org/nagios.pgadmin.org server - update
>
> Once I have the last cable for the new server I'll move
> it all across once and for all.
This is now done. Can anyone with an account on either VM please check
that things look OK?
Thanks, Dave.
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org server - update
@ 2006-05-18 18:45 Robert Treat <[email protected]>
parent: Dave Page <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Robert Treat @ 2006-05-18 18:45 UTC (permalink / raw)
To: pgsql-www; +Cc: Dave Page <[email protected]>; [email protected]
On Thursday 18 May 2006 10:53, Dave Page wrote:
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Dave Page
> > Sent: 17 May 2006 10:23
> > To: [email protected]; pgsql-www
> > Subject: [pgadmin-hackers]
> > developer.pgadmin.org/nagios.pgadmin.org server - update
> >
> > Once I have the last cable for the new server I'll move
> > it all across once and for all.
>
> This is now done. Can anyone with an account on either VM please check
> that things look OK?
>
My cursory glance looks good. :-)
--
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
^ permalink raw reply [nested|flat] 9+ messages in thread
end of thread, other threads:[~2006-05-18 18:45 UTC | newest]
Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2006-05-12 21:47 developer.pgadmin.org/nagios.pgadmin.org - Disk failure Dave Page <[email protected]>
2006-05-13 02:04 ` Travis Hein <[email protected]>
2006-05-12 22:37 Re: developer.pgadmin.org/nagios.pgadmin.org - Diskfailure Dave Page <[email protected]>
2006-05-13 11:45 ` Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org Raphaël Enrici <[email protected]>
2006-05-13 19:29 Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org - Diskfailure Dave Page <[email protected]>
2006-05-13 19:29 Re: developer.pgadmin.org/nagios.pgadmin.org - Disk failure Dave Page <[email protected]>
2006-05-17 09:22 developer.pgadmin.org/nagios.pgadmin.org server - update Dave Page <[email protected]>
2006-05-18 14:53 Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org server - update Dave Page <[email protected]>
2006-05-18 18:45 ` Re: [pgadmin-hackers] developer.pgadmin.org/nagios.pgadmin.org server - update Robert Treat <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox