https://bugzilla.novell.com/show_bug.cgi?id=461277
User teheo@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c32
Tejun Heo
Thanks Tejun, what I'm saying is, if there's a safe way of me verifying this without having to purchase more disks to copy 3TB of data onto then I'm happy to help. So I'm asking for advice, I have no idea at this level.
The safest way to test I can think of is reserving small parts of each disk (several giga bytes should do, I think), build a separate out of them and mix random read workload on the original array (but do NOT mount them, just access the raw md device read-only) with heavy write workload on the separate test array.
I agree with what you're saying about power, but I can still say the problem did happen in 11.1 GM and nothing else. The software change you're talking about sounds likely but I really have no clue.
Power related issues can get very indeterministic. The fact that I got three almost identical reports recently give me a pretty strong hint but yes till each case is actually verified at best it's only a likely theory.
I have purchased a better case in terms of cooling because I know my drives were quite hot, plus I went back to ext3 because it seems I don't understand how to recover from failures on XFS properly and this problem seemed to be a file system corruption.
ext3 doesn't enable barrier by default. Dunno what sabayon does but that could easily be the difference. Also, after this kind of failure, raid or not, recovery becomes very difficult. Lost data in write buffer is actually okay but the problem is that after the disk comes back the filesystem merrily goes ahead thinking what it wrote before are still on the disk. If the data at the right places is lost (mixing old and new journals for example), it will easily render the whole filesystem inaccessible and fsck can easily worsen the situation further. I had a similar failure while hotplugging a harddrive, and I had to resort to hex editor and grep. Fun that was. :-(
So I guess you're saying this is to do with RAID and what FS I'm running wouldn't matter?
Well, to be safe with data, barrier is necessary and md code recently added the support, which is all good and dandy, so your fs on the raid array now can guarantee consistency when power is suddenly lost. However, the improvement changed power usage pattern on your hardware and exposed existing deficiency. XFS by default enables barrier. ext3 doesn't but SUSE does. AFAIK, most other distros don't. So, everything seems to add up.
I agree with Kenn above, it happens when writing.
I could try replicating this with some other sized drives, taking these ones out if that would help? It's potential is quite serious so you're right in wanting to give it priority.
The best way would be keep the known failing configuration and try to verify the problem first so that the number of parameters is controlled. But, yeah, it'll be a hassle to test when you have a lot of data to carry around. I'm afraid I can't think of an easy way to verify. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.