[Bug 461277] Upgrade to 11.1 GM broke RAID5 XFS

5 Feb 2009

      https://bugzilla.novell.com/show_bug.cgi?id=461277

User teheo@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=461277#c32

Tejun Heo  changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO
      Info Provider|                            |quentin.jackson@exclamation
                   |                            |.co.nz

--- Comment #32 from Tejun Heo   2009-02-04 20:37:56 MST ---
Hello,

(In reply to comment #31)
...
Thanks Tejun, what I'm saying is, if there's a safe way of me verifying this
without having to purchase more disks to copy 3TB of data onto then I'm happy
to help.  So I'm asking for advice, I have no idea at this level.
The safest way to test I can think of is reserving small parts of each disk
(several giga bytes should do, I think), build a separate out of them and mix
random read workload on the original array (but do NOT mount them, just access
the raw md device read-only) with heavy write workload on the separate test
array.
...
I agree with what you're saying about power, but I can still say the problem
did happen in 11.1 GM and nothing else.  The software change you're talking
about sounds likely but I really have no clue.
Power related issues can get very indeterministic.  The fact that I got three
almost identical reports recently give me a pretty strong hint but yes till
each case is actually verified at best it's only a likely theory.
...
I have purchased a better case
in terms of cooling because I know my drives were quite hot, plus I went back
to ext3 because it seems I don't understand how to recover from failures on XFS
properly and this problem seemed to be a file system corruption.
ext3 doesn't enable barrier by default.  Dunno what sabayon does but that could
easily be the difference.  Also, after this kind of failure, raid or not,
recovery becomes very difficult.  Lost data in write buffer is actually okay
but the problem is that after the disk comes back the filesystem merrily goes
ahead thinking what it wrote before are still on the disk.  If the data at the
right places is lost (mixing old and new journals for example), it will easily
render the whole filesystem inaccessible and fsck can easily worsen the
situation further.  I had a similar failure while hotplugging a harddrive, and
I had to resort to hex editor and grep.  Fun that was.  :-(
...
So I guess you're saying this is to do with RAID and what FS I'm running
wouldn't matter?
Well, to be safe with data, barrier is necessary and md code recently added the
support, which is all good and dandy, so your fs on the raid array now can
guarantee consistency when power is suddenly lost.  However, the improvement
changed power usage pattern on your hardware and exposed existing deficiency.

XFS by default enables barrier.  ext3 doesn't but SUSE does.  AFAIK, most other
distros don't.  So, everything seems to add up.
...
I agree with Kenn above, it happens when writing.
I could try replicating this with some other sized drives, taking these ones
out if that would help?  It's potential is quite serious so you're right in
wanting to give it priority.
The best way would be keep the known failing configuration and try to verify
the problem first so that the number of parameters is controlled.  But, yeah,
it'll be a hassle to test when you have a lot of data to carry around.  I'm
afraid I can't think of an easy way to verify.

Thanks.

-- 
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

[Bug 461277] Upgrade to 11.1 GM broke RAID5 XFS

bugzilla_noreply＠novell.com