Booting a dual opteron system, I am getting the following lockup error as it attempts to setup teh software raid 5 array: md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) . md: using 128k window, over a total of 143363904 blocks. NMI Watchdog detected LOCKUP on CPU0, registers: CPU 0 Pid: 742, comm: md0_resync Not tainted 2.6.4-54.5-smp RIP: 0010:[<ffffffffa0098415>] <ffffffffa0098415>{:raid5:.text.lock.raid5+145} RSP: 0018:ffffffff804eaeb8 EFLAGS: 00000086 RAX: 00000101ff074640 RBX: 00000101ff074500 RCX: 0000000000000006 RDX: 0000000000000002 RSI: 0000000000001000 RDI: 00000101ff074500 RBP: 00000101ffc55880 R08: 0000000000000001 R09: 0000010010362c40 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: 00000101ffc55880 R14: 0000000000000000 R15: 0000010010283200 FS: 0000002a9588d6e0(0000) GS:ffffffff804e6d00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a956fa1a0 CR3: 0000000000101000 CR4: 00000000000006e0 Any ideas on how to proceed? its a Suse 9.1 x86_64 install, LSI SCSI card, 7 drives, 6 of which participate in this array. All drives are identical 143 GByte Seagate drives. 16 Byte RAM. -- ---------------------------------------------------------------- Johannes Ullrich jullrich@euclidian.com pgp key: http://johannes.homepc.org/PGPKEYS contact: http://johannes.homepc.org/contact.htm ---------------------------------------------------------------- There are two kinds of system administrators: The first kind solves problems with little shell scripts. The second kind are the problem. ----------------------------------------------------------------
NMI Watchdog detected LOCKUP on CPU0, registers: CPU 0 Pid: 742, comm: md0_resync Not tainted 2.6.4-54.5-smp RIP: 0010:[<ffffffffa0098415>] <ffffffffa0098415>{:raid5:.text.lock.raid5+145} RSP: 0018:ffffffff804eaeb8 EFLAGS: 00000086 RAX: 00000101ff074640 RBX: 00000101ff074500 RCX: 0000000000000006 RDX: 0000000000000002 RSI: 0000000000001000 RDI: 00000101ff074500 RBP: 00000101ffc55880 R08: 0000000000000001 R09: 0000010010362c40 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: 00000101ffc55880 R14: 0000000000000000 R15: 0000010010283200 FS: 0000002a9588d6e0(0000) GS:ffffffff804e6d00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a956fa1a0 CR3: 0000000000101000 CR4: 00000000000006e0
You snipped the interesting part - the backtrace. -Andi
Sorry... this time I got it all (I hope ;-| ) doing a few different reinstalls during the day, this looks like it is linked to raid5. I have yet to try a 2.6.6 vanilla kernel. All 9.1 YOU updates are applied. NMI Watchdog detected LOCKUP on CPU0, registers: CPU 0 Pid: 337, comm: md0_resync Not tainted 2.6.4-54.5-smp RIP: 0010:[<ffffffffa008b403>] <ffffffffa008b403>{:raid5:.text.lock.raid5+127} RSP: 0018:ffffffff804eaee8 EFLAGS: 00000086 RAX: 00000101ffd363e8 RBX: 0000000000000005 RCX: 0000000000000006 RDX: 0000000000000005 RSI: 0000000000000001 RDI: 00000101ffd363e8 RBP: 00000101ffd36080 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 000001000807bc80 R13: 0000000000000000 R14: 0000000000000000 R15: 00000103ffc4fb00 FS: 0000002a9588d6e0(0000) GS:ffffffff804e6d00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a956fa1a0 CR3: 0000000000101000 CR4: 00000000000006e0 Process md0_resync (pid: 337, stackpage=100fbc9d440) Stack: 00000103ffe23400 0000000000000297 00000101ffd363e8 0000000000000000 000000000001e000 ffffffff8018180f 00f90100fbdca800 0000000000001000 00000101ffd363e8 ffffffff802679da Call Trace:<IRQ> <ffffffff8018180f>{bio_endio+95} <ffffffff802679da>{__end_that <ffffffffa0005d9d>{:scsi_mod:scsi_end_request+61} <ffffffffa00060ae>{:sc <ffffffffa0000446>{:scsi_mod:scsi_finish_command+214} <ffffffffa0000daa>{:scsi_mod:scsi_softirq+234} <ffffffff8013e09b>{do_sof <ffffffff80113d5d>{do_IRQ+317} <ffffffff80110cad>{ret_from_intr+0} <EOI> <ffffffff80138f90>{autoremove_wake_function+0} <ffffffffa0005af8>{:scsi_mod:scsi_request_fn+632} <ffffffffa000599f>{:sc <ffffffff80268d9e>{generic_unplug_device+62} <ffffffff802b6bed>{md_unplu <ffffffffa00882c9>{:raid5:get_active_stripe+233} <ffffffffa008a1ef>{:rai <ffffffff802ba3a7>{md_do_sync+887} <ffffffff8013b5ab>{reparent_to_init+5 <ffffffff8013b726>{daemonize+358} <ffffffff802bc924>{md_thread+404} <ffffffff80133600>{default_wake_function+0} <ffffffff802ba030>{md_do_syn <ffffffff80111257>{child_rip+8} <ffffffff802ba030>{md_do_sync+0} <ffffffff802bc790>{md_thread+0} <ffffffff8011124f>{child_rip+0} Code: 41 80 bc 24 ac 00 00 00 00 7e f3 e9 86 ea ff ff f3 90 80 bd console shuts up ... -- ---------------------------------------------------------------- Johannes Ullrich jullrich@euclidian.com pgp key: http://johannes.homepc.org/PGPKEYS contact: http://johannes.homepc.org/contact.htm ---------------------------------------------------------------- There are two kinds of system administrators: The first kind solves problems with little shell scripts. The second kind are the problem. ----------------------------------------------------------------
On Thu, Jun 03, 2004 at 05:51:55PM -0400, Johannes B. Ullrich wrote:
Sorry... this time I got it all (I hope ;-| )
doing a few different reinstalls during the day, this looks like it is linked to raid5. I have yet to try a 2.6.6 vanilla kernel.
All 9.1 YOU updates are applied.
[...]
RIP: 0010:[<ffffffffa008b403>] <ffffffffa008b403>{:raid5:.text.lock.raid5+127}
Can you please test if the 2.6.5 kernel in ftp.suse.com/pub/people/aj/Kernel-AMD64/ fixes the problem? It could be a race in MD queue handling that's already fixed in the newer 2.6.5 kernel. -Andi
participants (2)
-
Andi Kleen
-
Johannes B. Ullrich