Bug ID | 1175105 |
---|---|
Summary | mdadm doesn't except partition for --add because of "dirt" in that partition of a fresh gpt table |
Classification | openSUSE |
Product | openSUSE Distribution |
Version | Leap 15.1 |
Hardware | Other |
OS | Other |
Status | NEW |
Severity | Normal |
Priority | P5 - None |
Component | Kernel |
Assignee | kernel-bugs@opensuse.org |
Reporter | ralf@czekalla.com |
QA Contact | qa-bugs@suse.de |
Found By | --- |
Blocker | --- |
I'm about to migrate an important central private nextcloud server to 15.2. I'm use RAID1 for all my partitions and benefit from that with easy made clones with the help of mdadm by growing the number of raid-devices and adding a refreshed SSD with a new gpt partition table and cloned partitions descriptions. This worked for most of the six partitions but one. (It's not the first installation I transform with this idea. Did these processes already several times) This faulty one showed weird error messages I couldn't find a real good reason for, like ... ...on console: # mdadm --verbose /dev/md2 --add /dev/sdc5 mdadm: add new device failed for /dev/sdc5 as 4: Invalid argument # mdadm -E /dev/sdc5 /dev/sdc5: Magic : a92b4efc Version : 1.0 Feature Map : 0x9 Array UUID : 94bbbcd3:ad4d1b0b:dcd4d548:1af16050 Name : any:2 Creation Time : Fri Feb 2 20:09:03 2018 Raid Level : raid1 Raid Devices : 3 Avail Dev Size : 83892192 sectors (40.00 GiB 42.95 GB) Array Size : 41945984 KiB (40.00 GiB 42.95 GB) Used Dev Size : 83891968 sectors (40.00 GiB 42.95 GB) Super Offset : 83892208 sectors Unused Space : before=0 sectors, after=224 sectors State : active Device UUID : c9e17312:0069d580:42782958:8817e0f7 Internal Bitmap : -16 sectors from superblock Update Time : Mon Aug 10 18:09:21 2020 Bad Block Log : 512 entries available at offset -8 sectors - bad blocks present. Checksum : 7b5d72a - correct Events : 0 Device Role : spare Array State : AA. ('A' == active, '.' == missing, 'R' == replacing) First I was blinded by the "bad blocks present", but all devices checked out perfectly healthy in smart and also extended device scans didn't show any problems. Took me several hours to get through all this. dmesg: md: sdc5 does not have a valid v1.0 superblock, not importing! md: md_import_device returned -22 Also tried to --zero-superblock the partition of course. No change of behavior and still the above error message in dmesg remained. At the end I found a surprising hint on serverfault.com (https://serverfault.com/questions/696392/how-to-add-disk-back-to-raid-and-replace-removed) from Oct. 2015 with a similar behavior that suggested to clean-up of the partition first with dd writing zeros (dd if=/dev/zero of=/dev/sdc5 status=progress) and try again later to add the last seemingly faulty partition. And long story short, this really worked. Somehow mdadm - and md behind - is choking on some dirt inside of partition blocks from old content before the disk/ssd was wiped with a new gpt partition table. (Of course with SSDs you try to prevent unnecessary writes and wiping every block of a disk first) I'm using md device type 1.0 here (in contrast to the serverfault.com case), where the RAID/md setup data is stored at the end of the partition. I had to wipe exactly this area at the end of the partition to add it afterwards successfully. Of course, no effect when doing this at the beginning of the partition (mentioned in the serverfault.com case) where md type 1.2 stores the md setup data. I think after 5 years this might need a clean up in mdadm or md device management. At least the error message should suggest to clean the partition first instead of this error message in dmesg. Also the --zero-superblock should wipe out the dirt beneath the superblock and not seemingly let some stuff leak here. Thanks Ralf