[Bug 461277] New: Upgrade to 11.1 GM broke RAID5 XFS
https://bugzilla.novell.com/show_bug.cgi?id=461277 Summary: Upgrade to 11.1 GM broke RAID5 XFS Product: openSUSE 11.1 Version: Final Platform: x86-64 OS/Version: openSUSE 11.1 Status: NEW Severity: Major Priority: P5 - None Component: Installation AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: quentin.jackson@exclamation.co.nz QAContact: jsrain@novell.com Found By: Customer I have no idea how yet, but upgrading degraded the array and it didn't even mount degraded. The array is comprised of 4 1TB disks /dev/sdb, c, d and e. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c1
--- Comment #1 from Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c2
Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c3
--- Comment #3 from Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c4
--- Comment #4 from Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c5
--- Comment #5 from Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c6
--- Comment #6 from Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User stbinner@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c7
Stephan Binner
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c8
Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c9
Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c10
--- Comment #10 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c11
--- Comment #11 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c12
--- Comment #12 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c13
--- Comment #13 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c14
--- Comment #14 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c15
--- Comment #15 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c16
--- Comment #16 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c17
--- Comment #17 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c18
--- Comment #18 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c19
--- Comment #19 from Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User aorlovskyy@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c20
Alexander Orlovskyy
https://bugzilla.novell.com/show_bug.cgi?id=461277
User christoph-erdmann.pfeiler@gmx.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c23
Christoph Pfeiler
https://bugzilla.novell.com/show_bug.cgi?id=461277
User christoph-erdmann.pfeiler@gmx.de added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c24
--- Comment #24 from Christoph Pfeiler
https://bugzilla.novell.com/show_bug.cgi?id=461277
User mmarek@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c25
Michal Marek
https://bugzilla.novell.com/show_bug.cgi?id=461277
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=461277
User nfbrown@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c26
Neil Brown
https://bugzilla.novell.com/show_bug.cgi?id=461277
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=461277
User teheo@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c27
Tejun Heo
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c28
Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c29
--- Comment #29 from Kenn de Mello
https://bugzilla.novell.com/show_bug.cgi?id=461277
User teheo@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c30
Tejun Heo
https://bugzilla.novell.com/show_bug.cgi?id=461277
User quentin.jackson@exclamation.co.nz added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c31
Quentin Jackson
https://bugzilla.novell.com/show_bug.cgi?id=461277
User teheo@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c32
Tejun Heo
Thanks Tejun, what I'm saying is, if there's a safe way of me verifying this without having to purchase more disks to copy 3TB of data onto then I'm happy to help. So I'm asking for advice, I have no idea at this level.
The safest way to test I can think of is reserving small parts of each disk (several giga bytes should do, I think), build a separate out of them and mix random read workload on the original array (but do NOT mount them, just access the raw md device read-only) with heavy write workload on the separate test array.
I agree with what you're saying about power, but I can still say the problem did happen in 11.1 GM and nothing else. The software change you're talking about sounds likely but I really have no clue.
Power related issues can get very indeterministic. The fact that I got three almost identical reports recently give me a pretty strong hint but yes till each case is actually verified at best it's only a likely theory.
I have purchased a better case in terms of cooling because I know my drives were quite hot, plus I went back to ext3 because it seems I don't understand how to recover from failures on XFS properly and this problem seemed to be a file system corruption.
ext3 doesn't enable barrier by default. Dunno what sabayon does but that could easily be the difference. Also, after this kind of failure, raid or not, recovery becomes very difficult. Lost data in write buffer is actually okay but the problem is that after the disk comes back the filesystem merrily goes ahead thinking what it wrote before are still on the disk. If the data at the right places is lost (mixing old and new journals for example), it will easily render the whole filesystem inaccessible and fsck can easily worsen the situation further. I had a similar failure while hotplugging a harddrive, and I had to resort to hex editor and grep. Fun that was. :-(
So I guess you're saying this is to do with RAID and what FS I'm running wouldn't matter?
Well, to be safe with data, barrier is necessary and md code recently added the support, which is all good and dandy, so your fs on the raid array now can guarantee consistency when power is suddenly lost. However, the improvement changed power usage pattern on your hardware and exposed existing deficiency. XFS by default enables barrier. ext3 doesn't but SUSE does. AFAIK, most other distros don't. So, everything seems to add up.
I agree with Kenn above, it happens when writing.
I could try replicating this with some other sized drives, taking these ones out if that would help? It's potential is quite serious so you're right in wanting to give it priority.
The best way would be keep the known failing configuration and try to verify the problem first so that the number of parameters is controlled. But, yeah, it'll be a hassle to test when you have a lot of data to carry around. I'm afraid I can't think of an easy way to verify. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c33
--- Comment #33 from Kenn de Mello
Kenn, most people don't believe when I tell them they're likely to be experiencing power issues but it's sort of surprising even to me how often that diagnosis turns out to be correct. Till now, I had four or five separate reports of the exact same symptom and in all cases power was the problem. There was a recent upstream one too.
http://bugzilla.kernel.org/show_bug.cgi?id=10480#c29
So, given recent barrier changes and the symptom, I think power problem is most likely. Of course it could be something else, but power problem is the most likely and is easy to rule out. So, please try to verify it.
Thanks.
As a test I've been doing a similar rsysnc/restore from USB using a subset of the data I wanted to restore (2 directories, about 20GB each, all files between 500M and 4G in size). I set this up in a while true loop and it's been running overnight so far (rsync, remove, rsync, remove, repeat). It probably would have failed by now. What's different between now and the time my disks were reset is now I have the computer plugged into a UPS, so I'm not vulnerable to fluctuations in utility power (before everything: computer, monitor, speakers, printer, etc were plugged into daisy chained power strips). So signs point to power. I'm going to let the loop run for about 24 hours, but I'm pretty confident that I won't see a problem, it's already run about long enough to restore all of my data twice. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=461277
User teheo@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c34
--- Comment #34 from Tejun Heo
As a test I've been doing a similar rsysnc/restore from USB using a subset of the data I wanted to restore (2 directories, about 20GB each, all files between 500M and 4G in size). I set this up in a while true loop and it's been running overnight so far (rsync, remove, rsync, remove, repeat). It probably would have failed by now.
Nice.
What's different between now and the time my disks were reset is now I have the computer plugged into a UPS, so I'm not vulnerable to fluctuations in utility power (before everything: computer, monitor, speakers, printer, etc were plugged into daisy chained power strips).
I don't have any first hand experience with power source problem (the power source quality here is pretty good) but yeah I can imagine that. The only related experience I had was an EMI verification test done at a company developing an external RAID rig. It wasn't voltage fluctuation per-se. They inserted high frequency interference to power line which doesn't cause noticeable voltage fluctuation at the output side but does cause quite some amount of EMI inside the machine. Serial connection for information display and SATA signals were highly susceptible to them. Well, it's gigahertz signal running along a relatively long unshielded cable after all. Anyways, for the brief loss of power problem, in most cases it seems dependent on the specific power supply. I suspect it's about how much capacitance they have at the output side but my electrical knowledge is severely limited, so it's just a wild guess. Rated wattage really doesn't have much bearing on the issue. I test many drives in varied configurations and hotplugging a drive into a loaded power supply is always a good litmus test for such problems and for some unknown reason cheap noname 300w power supply I have is the best behaving one. Now that hotplugging is a common thing, I'm afraid many people is experiencing such issue. It's disturbing but I can't think of a good software way to work around it as transmission failures which aren't uncommon due to the high susceptibility to EMI and the brief power loss seem identical to the host. I hope some people test power supplies for things like this but it never seems to be the focus of benchmarking or verification.
So signs point to power. I'm going to let the loop run for about 24 hours, but I'm pretty confident that I won't see a problem, it's already run about long enough to restore all of my data twice.
Dunno about the quality of your power source but it might be a good idea to put half of the disks to a separate power supply just for safety. It's perfectly safe to use different power sources for the disk and the host. SATA signal lines don't actually connect to each other. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=461277
User kdemello@gmail.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c35
--- Comment #35 from Kenn de Mello
participants (1)
-
bugzilla_noreply@novell.com