Hi,
in a critical setting, I'm still using a SuSE 9.3 server (sorry), where the
system gets unusable typically after 70-80 days uptime (only sysrq works
then). Unfortunately, these crashes!?, even using the ATL-SysRQ [S] [U] [B]
sequence leaves currupt ldap databases behind lately, too. All that mess is
running 2.6.11.4-21.14-smp still.
Since the users also complain about 1-10 second hangs in a terminal based
order management system, I thought, it would be a good idea to try to move
the kernel to 2.6.24.1 with all the fancy (IO) scheduling and engaging BKL,
etc.. (I just rpmbuild
Kernel:/HEAD/openSUSE_Factory/kernel-default-2.6.24.1-35.1 on that
system).
The first tries consistently resulted in Oops during initrd, similar to:
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip: c01e5374 *pde = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /block/md0/dev
Modules linked in: xfs ide_cd cdrom ide_disk pata_amd amd74xx ide_core raid456 async_xor async_memcpy async_tx xor
sata_
sil24 libata 3w_9xxx sd_mod scsi_mod
Pid: 710, comm: udev_volume_id Tainted: G N (2.6.24.1-35.1-default #1)
EIP: 0060:[<c01e5374>] EFLAGS: 00010046 CPU: 0
EIP is at __rb_erase_color+0x19/0x13f
EAX: 00000000 EBX: f7fcd3a8 ECX: 00000000 EDX: 00000000
ESI: c2029f9c EDI: f7fcd3a0 EBP: f750be28 ESP: f750be10
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process udev_volume_id (pid: 710, ti=f750a000 task=f7fcd370 task.ti=f750a000)
Stack: f7525a20 c2029f80 c011f6fe c2029f80 f7525a20 f7509040 f750be38 c011f841
f75259f0 c02f65a0 f750be48 c01200bd c202cd80 f75259f0 f750be58 c0120146
f75259f0 00000000 f750be7c c02eaeb3 f7509040 f75259f0 c202cd80 f7fcd4d8
Call Trace:
[<c011f6fe>] dequeue_entity+0x2c/0x39
[<c011f841>] dequeue_task_fair+0x18/0x2d
[<c01200bd>] dequeue_task+0xd/0x18
[<c0120146>] deactivate_task+0x1f/0x2b
[<c02eaeb3>] __sched_text_start+0xe3/0x379
[<c02eb8bf>] __mutex_lock_interruptible_slowpath+0x6e/0x9f
[<c02eb7d0>] mutex_lock_interruptible+0x1b/0x21
[<c0278f77>] md_open+0x1f/0x50
[<c019cbbf>] do_open+0x1b6/0x248
[<c019cd00>] blkdev_open+0x27/0x51
[<c017a91c>] __dentry_open+0xd1/0x184
[<ffffff9c>] 0xffffff9c
DWARF2 unwinder stuck at 0xffffff9c
Leftover inexact backtrace:
[<c017aab2>] nameidata_to_filp+0x23/0x32
[<c017aa0f>] do_filp_open+0x40/0x48
[<c017ab6a>] get_unused_fd_flags+0x59/0xc3
[<c017acac>] do_sys_open+0x48/0xc9
[<c017ad47>] sys_open+0x1a/0x1c
[<c0104fa2>] syscall_call+0x7/0xb
=======================
Code: 01 0f 84 75 ff ff ff 8b 45 00 83 08 01 5b 5e 5f 5d c3 56 89 ce 53 89 d3 e9 19 01 00 00 8b 53 08 39 c2 0f 85 84
00
00 00 8b 4b 04 <8b> 01 a8 01 75 14 83 c8 01 89 f2 89 01 89 d8 83 23 fe e8 7d fe
EIP: [<c01e5374>] __rb_erase_color+0x19/0x13f SS:ESP 0068:f750be10
---[ end trace 92ebfcce66e192a6 ]---
and:
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000040
printing eip: c011f96c *pde = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /block/sde/sde1/dev
Modules linked in: sata_sil24 libata 3w_9xxx sd_mod scsi_mod
Pid: 538, comm: udev Not tainted (2.6.24.1-35.1-default #1)
EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0
EIP is at pick_next_task_fair+0x15/0x23
EAX: 00000000 EBX: f75d11f0 ECX: c202cdd0 EDX: 00000000
ESI: 00000000 EDI: 00000001 EBP: f7481f08 ESP: f7481f08
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process udev (pid: 538, ti=f7480000 task=f75d11f0 task.ti=f7480000)
Stack: f7481f2c c02eaeea c202a1cc 00000000 c202cd80 f75d1358 f7481f44 00000001
00000001 f7481f9c c02eb95e 00000001 00000000 00000000 c013af1e f7ffc944
00000000 00000000 73e8b1d5 00000005 c013aeba c202a1cc 00000001 c02eb953
Call Trace:
[<c02eaeea>] __sched_text_start+0x11a/0x379
[<c02eb95e>] do_nanosleep+0x3c/0x67
=======================
Code: 39 c8 73 0c 5e 89 f8 5b 5e 5f 5d e9 9b f6 ff ff 5b 5b 5e 5f 5d c3 55 83 c0 34 31 d2 83 78 08 00 89 e5 74 11 e8
15
fe ff ff 89 c2 <8b> 40 40 85 c0 75 f2 83 ea 30 5d 89 d0 c3 55 89 e5 53 89 d3 83
EIP: [<c011f96c>] pick_next_task_fair+0x15/0x23 SS:ESP 0068:f7481f08
---[ end trace 18a67066b954c85e ]---
Looking into requirements, I noticed that 2.6.24 needs a udev 081, while the
system uses 053-15.4. As it also still has the plain old static /dev setup, I
figured, I only need udev during boot, aka initrd. Since that initrd setup
differs considerable, I created the initrd on a openSUSE 10.2 system with
udev-103-12 and a matching mkinitrd. Note, all I want is overcoming the
initrd oopsing, but still no deal:
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000040
printing eip: c011f96c *pde = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /block/md0/dev
Modules linked in: xfs ide_cd cdrom ide_disk pata_amd amd74xx ide_core raid456 async_xor async_memcpy async_tx xor
sata_
sil24 libata 3w_9xxx sd_mod scsi_mod
Pid: 538, comm: udev Tainted: G N (2.6.24.1-35.1-default #1)
EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0
EIP is at pick_next_task_fair+0x15/0x23
EAX: 00000000 EBX: f75961b0 ECX: c202cdd0 EDX: 00000000
ESI: 00000000 EDI: 00000001 EBP: f765ff08 ESP: f765ff08
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process udev (pid: 538, ti=f765e000 task=f75961b0 task.ti=f765e000)
Stack: f765ff2c c02eaeea c202a1cc 00000000 c202cd80 f7596318 f765ff44 00000001
00000001 f765ff9c c02eb95e 00000001 00000000 00000000 c013af1e f7657f44
00000000 00000000 d00530e1 00000006 c013aeba c202a1cc 00000001 c02eb953
Call Trace:
[<c02eaeea>] __sched_text_start+0x11a/0x379
[<c02eb95e>] do_nanosleep+0x3c/0x67
=======================
Code: 39 c8 73 0c 5e 89 f8 5b 5e 5f 5d e9 9b f6 ff ff 5b 5b 5e 5f 5d c3 55 83 c0 34 31 d2 83 78 08 00 89 e5 74 11 e8
15
fe ff ff 89 c2 <8b> 40 40 85 c0 75 f2 83 ea 30 5d 89 d0 c3 55 89 e5 53 89 d3 83
EIP: [<c011f96c>] pick_next_task_fair+0x15/0x23 SS:ESP 0068:f765ff08
---[ end trace b79f2f543ef32b7a ]---
As it stands, it crashes consistently in pick_next_task_fair, even with a
initrd within contraints..
Today I noticed Gregs 2.6.24.3 announcement, with two hrtimer related fixes,
with could fit the picture. Another problem in this setup and the reason for
this deliberate inquiry is, I normally only have a chance to fiddle with such
things at sunday, if at all. Before I spam LKML, I thought, I ask the SUSE
kernel people here.
Do these oopses sound common to anybody? Do I have a chance to get over it with
the 2.6.24.3 patches?
If you want more info, just ask..
Thanks in advance,
Pete
--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe(a)opensuse.org
For additional commands, e-mail: opensuse-kernel+help(a)opensuse.org