Bug ID 1175105
Summary mdadm doesn't except partition for --add because of "dirt" in that partition of a fresh gpt table
Classification openSUSE
Product openSUSE Distribution
Version Leap 15.1
Hardware Other
OS Other
Status NEW
Severity Normal
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter ralf@czekalla.com
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

I'm about to migrate an important central private nextcloud server to 15.2.

I'm use RAID1 for all my partitions and benefit from that with easy made clones
with the help of mdadm by growing the number of raid-devices and adding a
refreshed SSD with a new gpt partition table and cloned partitions
descriptions. 

This worked for most of the six partitions but one. (It's not the first
installation I transform with this idea. Did these processes already several
times)

This faulty one showed weird error messages I couldn't find a real good reason
for, like ...

...on console:
# mdadm --verbose /dev/md2 --add /dev/sdc5
mdadm: add new device failed for /dev/sdc5 as 4: Invalid argument

# mdadm -E /dev/sdc5
/dev/sdc5:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x9
     Array UUID : 94bbbcd3:ad4d1b0b:dcd4d548:1af16050
           Name : any:2
  Creation Time : Fri Feb  2 20:09:03 2018
     Raid Level : raid1
   Raid Devices : 3

 Avail Dev Size : 83892192 sectors (40.00 GiB 42.95 GB)
     Array Size : 41945984 KiB (40.00 GiB 42.95 GB)
  Used Dev Size : 83891968 sectors (40.00 GiB 42.95 GB)
   Super Offset : 83892208 sectors
   Unused Space : before=0 sectors, after=224 sectors
          State : active
    Device UUID : c9e17312:0069d580:42782958:8817e0f7

Internal Bitmap : -16 sectors from superblock
    Update Time : Mon Aug 10 18:09:21 2020
  Bad Block Log : 512 entries available at offset -8 sectors - bad blocks
present.
       Checksum : 7b5d72a - correct
         Events : 0


   Device Role : spare
   Array State : AA. ('A' == active, '.' == missing, 'R' == replacing)

First I was blinded by the "bad blocks present", but all devices checked out
perfectly healthy in smart and also extended device scans didn't show any
problems. Took me several hours to get through all this.

dmesg:
md: sdc5 does not have a valid v1.0 superblock, not importing!
md: md_import_device returned -22

Also tried to --zero-superblock the partition of course. No change of behavior
and still the above error message in dmesg remained.

At the end I found a surprising hint on serverfault.com
(https://serverfault.com/questions/696392/how-to-add-disk-back-to-raid-and-replace-removed)
from Oct. 2015 with a similar behavior that suggested to clean-up of the
partition first with dd writing zeros (dd if=/dev/zero of=/dev/sdc5
status=progress) and try again later to add the last seemingly faulty
partition. 

And long story short, this really worked.

Somehow mdadm - and md behind - is choking on some dirt inside of partition
blocks from old content before the disk/ssd was wiped with a new gpt partition
table. (Of course with SSDs you try to prevent unnecessary writes and wiping
every block of a disk first) 

I'm using md device type 1.0 here (in contrast to the serverfault.com case),
where the RAID/md setup data is stored at the end of the partition. I had to
wipe exactly this area at the end of the partition to add it afterwards
successfully. Of course, no effect when doing this at the beginning of the
partition (mentioned in the serverfault.com case) where md type 1.2 stores the
md setup data. 

I think after 5 years this might need a clean up in mdadm or md device
management.

At least the error message should suggest to clean the partition first instead
of this error message in dmesg. Also the --zero-superblock should wipe out the
dirt beneath the superblock and not seemingly let some stuff leak here.

Thanks
Ralf


You are receiving this mail because: