[kernel-bugs] [Bug 1175105] New: mdadm doesn't except partition for --add because of "dirt" in that partition of a fresh gpt table
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105 Bug ID: 1175105 Summary: mdadm doesn't except partition for --add because of "dirt" in that partition of a fresh gpt table Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.1 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: ralf@czekalla.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- I'm about to migrate an important central private nextcloud server to 15.2. I'm use RAID1 for all my partitions and benefit from that with easy made clones with the help of mdadm by growing the number of raid-devices and adding a refreshed SSD with a new gpt partition table and cloned partitions descriptions. This worked for most of the six partitions but one. (It's not the first installation I transform with this idea. Did these processes already several times) This faulty one showed weird error messages I couldn't find a real good reason for, like ... ...on console: # mdadm --verbose /dev/md2 --add /dev/sdc5 mdadm: add new device failed for /dev/sdc5 as 4: Invalid argument # mdadm -E /dev/sdc5 /dev/sdc5: Magic : a92b4efc Version : 1.0 Feature Map : 0x9 Array UUID : 94bbbcd3:ad4d1b0b:dcd4d548:1af16050 Name : any:2 Creation Time : Fri Feb 2 20:09:03 2018 Raid Level : raid1 Raid Devices : 3 Avail Dev Size : 83892192 sectors (40.00 GiB 42.95 GB) Array Size : 41945984 KiB (40.00 GiB 42.95 GB) Used Dev Size : 83891968 sectors (40.00 GiB 42.95 GB) Super Offset : 83892208 sectors Unused Space : before=0 sectors, after=224 sectors State : active Device UUID : c9e17312:0069d580:42782958:8817e0f7 Internal Bitmap : -16 sectors from superblock Update Time : Mon Aug 10 18:09:21 2020 Bad Block Log : 512 entries available at offset -8 sectors - bad blocks present. Checksum : 7b5d72a - correct Events : 0 Device Role : spare Array State : AA. ('A' == active, '.' == missing, 'R' == replacing) First I was blinded by the "bad blocks present", but all devices checked out perfectly healthy in smart and also extended device scans didn't show any problems. Took me several hours to get through all this. dmesg: md: sdc5 does not have a valid v1.0 superblock, not importing! md: md_import_device returned -22 Also tried to --zero-superblock the partition of course. No change of behavior and still the above error message in dmesg remained. At the end I found a surprising hint on serverfault.com (https://serverfault.com/questions/696392/how-to-add-disk-back-to-raid-and-re...) from Oct. 2015 with a similar behavior that suggested to clean-up of the partition first with dd writing zeros (dd if=/dev/zero of=/dev/sdc5 status=progress) and try again later to add the last seemingly faulty partition. And long story short, this really worked. Somehow mdadm - and md behind - is choking on some dirt inside of partition blocks from old content before the disk/ssd was wiped with a new gpt partition table. (Of course with SSDs you try to prevent unnecessary writes and wiping every block of a disk first) I'm using md device type 1.0 here (in contrast to the serverfault.com case), where the RAID/md setup data is stored at the end of the partition. I had to wipe exactly this area at the end of the partition to add it afterwards successfully. Of course, no effect when doing this at the beginning of the partition (mentioned in the serverfault.com case) where md type 1.2 stores the md setup data. I think after 5 years this might need a clean up in mdadm or md device management. At least the error message should suggest to clean the partition first instead of this error message in dmesg. Also the --zero-superblock should wipe out the dirt beneath the superblock and not seemingly let some stuff leak here. Thanks Ralf -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105#c1
--- Comment #1 from Ralf Czekalla
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105
Ralf Czekalla
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105#c5
Ralf Czekalla
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105#c6
--- Comment #6 from Coly Li
Seriously guys?
=> Misleading error messages by an important tool - we are talking about mdadm here, right? First to-be-fixed issue! At least it should bring an appropriate error message and recommend the possibility to fix this problem with wipefs.
Adding an proper error message might be a practical fix which might be acceptable by upstream maintainer. Could you please to provide a suggested error message to avoid misunderstanding in this case ?
=> wipefs should automatically be called by mdadm here! Second necessary fix: Call wipefs automatically if mdadm recognizes that it can not cope with the bits and bytes of dirt during md creation and run it on the device mdadm is supposed to use!
This is hard and complicated, to make mdadm correctly recognize the error condition is really not easy, and erase the meta data automatically is risky and won't have full support from all developers. I do need your help to provide your opinion and suggestion on how to provide the error information and a preferred message content. At least the information which you think may make you feel better, then I can try to compose a patch. Thanks in advance. Coly Li -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105
http://bugzilla.opensuse.org/show_bug.cgi?id=1175105#c7
--- Comment #7 from Ralf Czekalla
participants (1)
-
bugzilla_noreply@suse.com