[opensuse-kernel] Leap 4.4.92 : Story of kernel warnings that finish to be a total lost of data
Hi all, Some times ago (after 4.4.92 hit Leap update repository) I've filled the following bug https://bugzilla.opensuse.org/show_bug.cgi?id=1064533 due to warnings and kernel trace about a raid10. Yesterday, after checking every disk with smart test long and as none of them reported an error I've readded the previously failing harddrive. There was one suspicious things, the resynchronisation at half an hour of the estimate time suddendly was ok. I've rebooted the server at that time, and then start to cry (just a bit), filesystem stored in lv or vm image were all destroyed (fsck.ext4 couldn't fix it auto, and even forced, when you tried to rewrite on it it failed) I've added all the informations I can for the moment to the mentionned bug. But I guess there's a number missing, Don't hesitate to comment here, or directly in the bug and ask for more information. The system as it is should still be present during the next 10 days. Thanks for your help and advise. -- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch Bareos Partner, openSUSE Member, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On 16 November 2017 at 17:33, Bruno Friedmann <bruno@ioda-net.ch> wrote:
Hi all,
Some times ago (after 4.4.92 hit Leap update repository) I've filled the following bug https://bugzilla.opensuse.org/show_bug.cgi?id=1064533 due to warnings and kernel trace about a raid10.
Yesterday, after checking every disk with smart test long and as none of them reported an error I've readded the previously failing harddrive.
There was one suspicious things, the resynchronisation at half an hour of the estimate time suddendly was ok.
I've rebooted the server at that time, and then start to cry (just a bit), filesystem stored in lv or vm image were all destroyed (fsck.ext4 couldn't fix it auto, and even forced, when you tried to rewrite on it it failed)
I've added all the informations I can for the moment to the mentionned bug. But I guess there's a number missing, Don't hesitate to comment here, or directly in the bug and ask for more information.
The system as it is should still be present during the next 10 days.
Thanks for your help and advise. --
Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch Bareos Partner, openSUSE Member, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Attempt 2, without HTML so the mailing list gets a copy... To Bruno Friedmann: Something you may want to check in your server are the Sata cables. I originally signed up to the mailing list with problems on a sata card that appeared only under Linux (it worked under Windows...) with a drive dropping from a RAID 6 array on my home server. I changed the offending drive, changed the SATA card... I never found the error. Finally, this year in April doing a drive swap, I determined that I had two (!!) bad Sata cables in my home server... They were both replaced (together with the drives as two had failed by then) and everything has been running smooth since then. One give-away was a high read error rate, but I never associated it with a cable problem... Of course the drives would pass all SMART tests without issues. Incidentally, at the time neither Unix Stackexchange, nor the openSUSE forum or the kernel mailing list could identify the source of the issue. I hope you find a solution to your problem (with as little data loss as possible). Best of luck! Detlev (this was the old original question on Unix Stackexchange at the time: https://unix.stackexchange.com/questions/244419/marvell-88se9128-9123-sata-c... ) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Le 16/11/2017 à 18:02, Detlev Conrad Mielczarek a écrit :
Something you may want to check in your server are the Sata cables.
when something goes wrong, first check the cables... always true, but to be honest I may also have missed this one :-( thanks jdd -- http://dodin.org -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On jeudi, 16 novembre 2017 18.02:56 h CET Detlev Conrad Mielczarek wrote:
On 16 November 2017 at 17:33, Bruno Friedmann <bruno@ioda-net.ch> wrote:
Hi all,
Some times ago (after 4.4.92 hit Leap update repository) I've filled the following bug https://bugzilla.opensuse.org/show_bug.cgi?id=1064533 due to warnings and kernel trace about a raid10.
Yesterday, after checking every disk with smart test long and as none of them reported an error I've readded the previously failing harddrive.
There was one suspicious things, the resynchronisation at half an hour of the estimate time suddendly was ok.
I've rebooted the server at that time, and then start to cry (just a bit), filesystem stored in lv or vm image were all destroyed (fsck.ext4 couldn't fix it auto, and even forced, when you tried to rewrite on it it failed)
I've added all the informations I can for the moment to the mentionned bug. But I guess there's a number missing, Don't hesitate to comment here, or directly in the bug and ask for more information.
The system as it is should still be present during the next 10 days.
Thanks for your help and advise. --
Bruno Friedmann
Ioda-Net Sàrl www.ioda-net.ch Bareos Partner, openSUSE Member, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot
Attempt 2, without HTML so the mailing list gets a copy... Yeah and you can send only to the ml ;-)
To Bruno Friedmann:
Something you may want to check in your server are the Sata cables. I originally signed up to the mailing list with problems on a sata card that appeared only under Linux (it worked under Windows...) with a drive dropping from a RAID 6 array on my home server. I changed the offending drive, changed the SATA card... I never found the error. Finally, this year in April doing a drive swap, I determined that I had two (!!) bad Sata cables in my home server... They were both replaced (together with the drives as two had failed by then) and everything has been running smooth since then. One give-away was a high read error rate, but I never associated it with a cable problem... Of course the drives would pass all SMART tests without issues.
The hardware is quite seriously monitored, and checked with regular smartd. Cables in used are high end, and same are use on other mirror without any glitches. The failing drive was not always the same, and 10 days ago we changed one. They have ~22500 hours of fly (2.5 year old) which is not also too much. We have the same harddrive in other machine (most of them behind a natsemi adaptec controler) and we don't see a special pattern of failure on this model. We absolutely don't exclude a hardware trouble, I just want to find the root cause, and certainly try to understand why the kernel has been trapped by faulty hardware, when it should have just say no.
Incidentally, at the time neither Unix Stackexchange, nor the openSUSE forum or the kernel mailing list could identify the source of the issue.
I hope you find a solution to your problem (with as little data loss as possible).
We have serious backup (bareos) that are always ready to restore so the lost was not that bad. Hopefully where the changes happen (database) was on another mirror (of ssd). Just get to bed very early this morning :-))
Best of luck!
Detlev
(this was the old original question on Unix Stackexchange at the time: https://unix.stackexchange.com/questions/244419/marvell-88se9128-9123-sata-c ard-weird-behaviour-opensuse )
-- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch Bareos Partner, openSUSE Member, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (3)
-
Bruno Friedmann
-
Detlev Conrad Mielczarek
-
jdd@dodin.org