Re: [SLE] Strange raid init sequence - revisited

24 Jun 2003

      Some time ago, on 03.02.25 we were talking about a strange sequence
of events when mounting a software raid:
...
...
On Monday 24 February 2003 15:11, Carlos E. R. wrote:
[...]
The sequence is this: the kernel loads, initialises a lot of things, like
IDE, checks the partitions, sees some of them are raid, and tries to load
The 03.02.24 at 21:44, Paul Uiterlinden wrote:
them:
[but fails the first time]
[
...]
Now the root filesystem has been really mounted, and the raid is again
considered:
<6>Freeing unused kernel memory: 168k freed
<6>md: Autodetecting RAID arrays.
<6> [events: 00000344]
<6> [events: 00000344]
<6>md: autorun ...
<6>md: considering hda11 ...
<6>md:  adding hda11 ...
<6>md:  adding hdb14 ...
<6>md: created md0
<6>md: bind
<6>md: bind
<6>md: running: <hda11><hdb14>
<6>md: hda11's event counter: 00000344
<6>md: hdb14's event counter: 00000344
<6>md: RAID level 1 does not need chunksize! Continuing anyway.
<6>md: raid1 personality registered as nr 3
<6>md0: max total readahead window set to 508k
<6>md0: 1 data-disks, max readahead per data-disk: 508k
<6>raid1: device hda11 operational as mirror 0
<6>raid1: device hdb14 operational as mirror 1
<6>raid1: raid set md0 active with 2 out of 2 mirrors
<6>md: updating md0 RAID superblock on device
<6>md: hda11 [events: 00000345]<6>(write) hda11's sb offset: 3124544
<6>md: hdb14 [events: 00000345]<6>(write) hdb14's sb offset: 3124544
<6> [events: 0f227faa]
<3>md: invalid raid superblock magic on md0
<4>md: md0 has invalid sb, not importing!
<4>md: no nested md device found
<6>md: ... autorun DONE.
Now I'm noticing that it says "invalid raid superblock magic on md0". I
Well, that's a extract of what we were saying then. I have discovered some
new info. I was watching the console 10 when halting the system, and
noticed a message about md being stoped, and something that failed -
unfortunately, the system powers off at that precise moment, and I can not
read it.

I have edited "/etc/init.d/halt" right at the end. First, I added a
"sleep 20":

  echo "----------- real halt comming in 20\" ----------"
    sleep 20
  echo "----------- real halt now ----------------------"

  # Now talk to kernel
  exec $command -d -f

I do see those lines on the output, and I read a message about md being
stopped. But that's all! The error message I see goes after that "exec
$command -d -f" (probably halt -p -d -f), so I still can not read it. It's
probably doing a "halt -p -d -f":

    -f     Force halt or reboot, don't call shutdown(8).
    -d     Don't write the wtmp record. The -n flag implies -d.
    -p     When halting the system, do a poweroff. This is the default
when halt is called as poweroff.

So, I remove the "-p" (not shown), and this is what I see on console
number 10, copied by hand:

  analog.c [...] gameport
  md: recovery thread got woken up ...
  md: recovery thread finished ...
---> kernel halt is called here
  md: stopping al md devices
  md: marking sb clean
  md: updating md0 RAID superblock on device
  md: hda11 [events: 000004c2]<6>(write) hda11's sb offset: 3124544
  md: hdb14 [events: 000004c2]<6>(write) hdb14's sb offset: 3124544
  md: md0 switched to read_only mode
  flushing ide devices: hda hdb hdc hdd
 System halted

My guess is that the software raid is trying to do something as a result
of the halt command, but can not finish writing it to disk. Thus, on the
next wakeup, there are always errors:

  <6>md: created md0
  <6>md: bind
  <6>md: bind
  <6>md: running: <hda11><hdb14>
  <6>md: hda11's event counter: 000004c2
  <6>md: hdb14's event counter: 000004c2
  <6>md: RAID level 1 does not need chunksize! Continuing anyway.
  <6>md: raid1 personality registered as nr 3
  <6>md0: max total readahead window set to 508k
  <6>md0: 1 data-disks, max readahead per data-disk: 508k
  <6>raid1: device hda11 operational as mirror 0
  <6>raid1: device hdb14 operational as mirror 1
  <6>raid1: raid set md0 active with 2 out of 2 mirrors
  <6>md: updating md0 RAID superblock on device
  <6>md: hda11 [events: 000004c3]<6>(write) hda11's sb offset: 3124544
  <6>md: hdb14 [events: 000004c3]<6>(write) hdb14's sb offset: 3124544
  <6> [events: d20feae5]
  <3>md: invalid raid superblock magic on md0
  <4>md: md0 has invalid sb, not importing!
  <4>md: no nested md device found
  <6>md: ... autorun DONE.

Interesting... I could be wrong, because you see that the event log is
indeed incremented after boot (or perhaps it is so because this time I had
disabled the power off), so something got written (and the offset is
exactly the same every time). But a few more people have reported that
"invalid raid superblock" and events messages, so something weird is still
happening.

-- 
Cheers,
       Carlos Robinson