[opensuse-kernel] bnc #696474 and more weirdness

14 Jun 2011

      The following post is about the following bug that you must read to
understand the rest...

https://bugzilla.novell.com/show_bug.cgi?id=696474

After removing commit=100 from the mount options the symptoms mentioned
in the bug report disappear, but still there is a mayor weirdness..

I call it, "everything is fine before midnight"

Between 23:00 and 00:00 each day, cron jobs start rsync and various IO
intensive processes, and it is when the apache server running in the
same machine starts to segfault with random errors, sometimes there is
not even a workable backtrace, memory gets corrupted somehow, It will
keep segfaulting randomly even after rsync process has ended, the only
solution is the reboot the box after those processes end.

There is no kernel warning, error, I have run memtest86++ and memtester
user-space utility, more than once, even for a complete night.

The machine is not out of ram, it has at least 5 GB free and other 5GB
cached.

The only remotely related (????) warning I get is this:

un 15 00:19:32 eq6 smartd[2555]: Device: /dev/twa0 [3ware_disk_00],
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 110 to 111
Jun 15 00:19:32 eq6 smartd[2555]: Device: /dev/twa0 [3ware_disk_00],
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 34 to 35

Googling says it is not something to worry about unless smart reports
fail, neither the raid controller built-in error reporting nor smartd
says nothing about broken disks.

Everything is fine after reboot, until next midnight approaches..

Lossing hair here, any hints ?

-- 
To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse-kernel+help@opensuse.org

Cristian Rodríguez

Dave Howorth

tags

participants (2)