[opensuse-kernel] 2.6.24.1-35.1 udev crashes in initrd
Hi, in a critical setting, I'm still using a SuSE 9.3 server (sorry), where the system gets unusable typically after 70-80 days uptime (only sysrq works then). Unfortunately, these crashes!?, even using the ATL-SysRQ [S] [U] [B] sequence leaves currupt ldap databases behind lately, too. All that mess is running 2.6.11.4-21.14-smp still. Since the users also complain about 1-10 second hangs in a terminal based order management system, I thought, it would be a good idea to try to move the kernel to 2.6.24.1 with all the fancy (IO) scheduling and engaging BKL, etc.. (I just rpmbuild Kernel:/HEAD/openSUSE_Factory/kernel-default-2.6.24.1-35.1 on that system). The first tries consistently resulted in Oops during initrd, similar to: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c01e5374 *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/md0/dev Modules linked in: xfs ide_cd cdrom ide_disk pata_amd amd74xx ide_core raid456 async_xor async_memcpy async_tx xor sata_ sil24 libata 3w_9xxx sd_mod scsi_mod Pid: 710, comm: udev_volume_id Tainted: G N (2.6.24.1-35.1-default #1) EIP: 0060:[<c01e5374>] EFLAGS: 00010046 CPU: 0 EIP is at __rb_erase_color+0x19/0x13f EAX: 00000000 EBX: f7fcd3a8 ECX: 00000000 EDX: 00000000 ESI: c2029f9c EDI: f7fcd3a0 EBP: f750be28 ESP: f750be10 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process udev_volume_id (pid: 710, ti=f750a000 task=f7fcd370 task.ti=f750a000) Stack: f7525a20 c2029f80 c011f6fe c2029f80 f7525a20 f7509040 f750be38 c011f841 f75259f0 c02f65a0 f750be48 c01200bd c202cd80 f75259f0 f750be58 c0120146 f75259f0 00000000 f750be7c c02eaeb3 f7509040 f75259f0 c202cd80 f7fcd4d8 Call Trace: [<c011f6fe>] dequeue_entity+0x2c/0x39 [<c011f841>] dequeue_task_fair+0x18/0x2d [<c01200bd>] dequeue_task+0xd/0x18 [<c0120146>] deactivate_task+0x1f/0x2b [<c02eaeb3>] __sched_text_start+0xe3/0x379 [<c02eb8bf>] __mutex_lock_interruptible_slowpath+0x6e/0x9f [<c02eb7d0>] mutex_lock_interruptible+0x1b/0x21 [<c0278f77>] md_open+0x1f/0x50 [<c019cbbf>] do_open+0x1b6/0x248 [<c019cd00>] blkdev_open+0x27/0x51 [<c017a91c>] __dentry_open+0xd1/0x184 [<ffffff9c>] 0xffffff9c DWARF2 unwinder stuck at 0xffffff9c Leftover inexact backtrace: [<c017aab2>] nameidata_to_filp+0x23/0x32 [<c017aa0f>] do_filp_open+0x40/0x48 [<c017ab6a>] get_unused_fd_flags+0x59/0xc3 [<c017acac>] do_sys_open+0x48/0xc9 [<c017ad47>] sys_open+0x1a/0x1c [<c0104fa2>] syscall_call+0x7/0xb ======================= Code: 01 0f 84 75 ff ff ff 8b 45 00 83 08 01 5b 5e 5f 5d c3 56 89 ce 53 89 d3 e9 19 01 00 00 8b 53 08 39 c2 0f 85 84 00 00 00 8b 4b 04 <8b> 01 a8 01 75 14 83 c8 01 89 f2 89 01 89 d8 83 23 fe e8 7d fe EIP: [<c01e5374>] __rb_erase_color+0x19/0x13f SS:ESP 0068:f750be10 ---[ end trace 92ebfcce66e192a6 ]--- and: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000040 printing eip: c011f96c *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/sde/sde1/dev Modules linked in: sata_sil24 libata 3w_9xxx sd_mod scsi_mod Pid: 538, comm: udev Not tainted (2.6.24.1-35.1-default #1) EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0 EIP is at pick_next_task_fair+0x15/0x23 EAX: 00000000 EBX: f75d11f0 ECX: c202cdd0 EDX: 00000000 ESI: 00000000 EDI: 00000001 EBP: f7481f08 ESP: f7481f08 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process udev (pid: 538, ti=f7480000 task=f75d11f0 task.ti=f7480000) Stack: f7481f2c c02eaeea c202a1cc 00000000 c202cd80 f75d1358 f7481f44 00000001 00000001 f7481f9c c02eb95e 00000001 00000000 00000000 c013af1e f7ffc944 00000000 00000000 73e8b1d5 00000005 c013aeba c202a1cc 00000001 c02eb953 Call Trace: [<c02eaeea>] __sched_text_start+0x11a/0x379 [<c02eb95e>] do_nanosleep+0x3c/0x67 ======================= Code: 39 c8 73 0c 5e 89 f8 5b 5e 5f 5d e9 9b f6 ff ff 5b 5b 5e 5f 5d c3 55 83 c0 34 31 d2 83 78 08 00 89 e5 74 11 e8 15 fe ff ff 89 c2 <8b> 40 40 85 c0 75 f2 83 ea 30 5d 89 d0 c3 55 89 e5 53 89 d3 83 EIP: [<c011f96c>] pick_next_task_fair+0x15/0x23 SS:ESP 0068:f7481f08 ---[ end trace 18a67066b954c85e ]--- Looking into requirements, I noticed that 2.6.24 needs a udev 081, while the system uses 053-15.4. As it also still has the plain old static /dev setup, I figured, I only need udev during boot, aka initrd. Since that initrd setup differs considerable, I created the initrd on a openSUSE 10.2 system with udev-103-12 and a matching mkinitrd. Note, all I want is overcoming the initrd oopsing, but still no deal: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000040 printing eip: c011f96c *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/md0/dev Modules linked in: xfs ide_cd cdrom ide_disk pata_amd amd74xx ide_core raid456 async_xor async_memcpy async_tx xor sata_ sil24 libata 3w_9xxx sd_mod scsi_mod Pid: 538, comm: udev Tainted: G N (2.6.24.1-35.1-default #1) EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0 EIP is at pick_next_task_fair+0x15/0x23 EAX: 00000000 EBX: f75961b0 ECX: c202cdd0 EDX: 00000000 ESI: 00000000 EDI: 00000001 EBP: f765ff08 ESP: f765ff08 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process udev (pid: 538, ti=f765e000 task=f75961b0 task.ti=f765e000) Stack: f765ff2c c02eaeea c202a1cc 00000000 c202cd80 f7596318 f765ff44 00000001 00000001 f765ff9c c02eb95e 00000001 00000000 00000000 c013af1e f7657f44 00000000 00000000 d00530e1 00000006 c013aeba c202a1cc 00000001 c02eb953 Call Trace: [<c02eaeea>] __sched_text_start+0x11a/0x379 [<c02eb95e>] do_nanosleep+0x3c/0x67 ======================= Code: 39 c8 73 0c 5e 89 f8 5b 5e 5f 5d e9 9b f6 ff ff 5b 5b 5e 5f 5d c3 55 83 c0 34 31 d2 83 78 08 00 89 e5 74 11 e8 15 fe ff ff 89 c2 <8b> 40 40 85 c0 75 f2 83 ea 30 5d 89 d0 c3 55 89 e5 53 89 d3 83 EIP: [<c011f96c>] pick_next_task_fair+0x15/0x23 SS:ESP 0068:f765ff08 ---[ end trace b79f2f543ef32b7a ]--- As it stands, it crashes consistently in pick_next_task_fair, even with a initrd within contraints.. Today I noticed Gregs 2.6.24.3 announcement, with two hrtimer related fixes, with could fit the picture. Another problem in this setup and the reason for this deliberate inquiry is, I normally only have a chance to fiddle with such things at sunday, if at all. Before I spam LKML, I thought, I ask the SUSE kernel people here. Do these oopses sound common to anybody? Do I have a chance to get over it with the 2.6.24.3 patches? If you want more info, just ask.. Thanks in advance, Pete -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Mon, 25 Feb 2008, Hans-Peter Jansen wrote:
printing eip: c011f96c *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/sde/sde1/dev Modules linked in: sata_sil24 libata 3w_9xxx sd_mod scsi_mod Pid: 538, comm: udev Not tainted (2.6.24.1-35.1-default #1) EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0 EIP is at pick_next_task_fair+0x15/0x23
This looks like a bug in the CFS scheduler. Could you please try with latest vanilla (2.6.25-rc3, or at least KOTD snapshot [1] of our HEAD kernel), and if the problem persits, report the bug upstream (to Ingo Molnar, and CC lkml). [1] ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD Thanks, -- Jiri Kosina SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Mon, 25 Feb 2008, Jiri Kosina wrote:
printing eip: c011f96c *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/sde/sde1/dev Modules linked in: sata_sil24 libata 3w_9xxx sd_mod scsi_mod Pid: 538, comm: udev Not tainted (2.6.24.1-35.1-default #1) EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0 EIP is at pick_next_task_fair+0x15/0x23 This looks like a bug in the CFS scheduler. Could you please try with latest vanilla (2.6.25-rc3, or at least KOTD snapshot [1] of our HEAD kernel), and if the problem persits, report the bug upstream (to Ingo Molnar, and CC lkml).
FYI, someone has just raised a very similarly-looking issue on LKML -- http://lkml.org/lkml/2008/2/26/459 -- Jiri Kosina SUSE Labs -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Am Mittwoch, 27. Februar 2008 schrieb Jiri Kosina:
On Mon, 25 Feb 2008, Jiri Kosina wrote:
printing eip: c011f96c *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/sde/sde1/dev Modules linked in: sata_sil24 libata 3w_9xxx sd_mod scsi_mod Pid: 538, comm: udev Not tainted (2.6.24.1-35.1-default #1) EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0 EIP is at pick_next_task_fair+0x15/0x23
This looks like a bug in the CFS scheduler. Could you please try with latest vanilla (2.6.25-rc3, or at least KOTD snapshot [1] of our HEAD kernel), and if the problem persits, report the bug upstream (to Ingo Molnar, and CC lkml).
FYI, someone has just raised a very similarly-looking issue on LKML -- http://lkml.org/lkml/2008/2/26/459
Jiri, thanks for the notice. I've piggypacked my issue.. Pete -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Am Montag, 25. Februar 2008 schrieb Hans-Peter Jansen:
Since the users also complain about 1-10 second hangs in a terminal based order management system, I thought, it would be a good idea to try to move the kernel to 2.6.24.1 with all the fancy (IO) scheduling and engaging BKL, etc.. (I just rpmbuild Kernel:/HEAD/openSUSE_Factory/kernel-default-2.6.24.1-35.1 on that system).
The first tries consistently resulted in Oops during initrd, similar to:
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000040 printing eip: c011f96c *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /block/sde/sde1/dev Modules linked in: sata_sil24 libata 3w_9xxx sd_mod scsi_mod
Pid: 538, comm: udev Not tainted (2.6.24.1-35.1-default #1) EIP: 0060:[<c011f96c>] EFLAGS: 00010046 CPU: 0 EIP is at pick_next_task_fair+0x15/0x23 EAX: 00000000 EBX: f75d11f0 ECX: c202cdd0 EDX: 00000000 ESI: 00000000 EDI: 00000001 EBP: f7481f08 ESP: f7481f08 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process udev (pid: 538, ti=f7480000 task=f75d11f0 task.ti=f7480000) Stack: f7481f2c c02eaeea c202a1cc 00000000 c202cd80 f75d1358 f7481f44 00000001 00000001 f7481f9c c02eb95e 00000001 00000000 00000000 c013af1e f7ffc944 00000000 00000000 73e8b1d5 00000005 c013aeba c202a1cc 00000001 c02eb953 Call Trace: [<c02eaeea>] __sched_text_start+0x11a/0x379 [<c02eb95e>] do_nanosleep+0x3c/0x67 ======================= Code: 39 c8 73 0c 5e 89 f8 5b 5e 5f 5d e9 9b f6 ff ff 5b 5b 5e 5f 5d c3 55 83 c0 34 31 d2 83 78 08 00 89 e5 74 11 e8 15 fe ff ff 89 c2 <8b> 40 40 85 c0 75 f2 83 ea 30 5d 89 d0 c3 55 89 e5 53 89 d3 83 EIP: [<c011f96c>] pick_next_task_fair+0x15/0x23 SS:ESP 0068:f7481f08 ---[ end trace 18a67066b954c85e ]---
As it stands, it crashes consistently in pick_next_task_fair, even with a initrd within contraints..
Today I noticed Gregs 2.6.24.3 announcement, with two hrtimer related fixes, with could fit the picture.
I do confirm, that 2.6.24.3 overcomes the reported initrd problem (even using the native mkinitrd/udev setup from 9.3). With a few kernel config touches (NO_HZ disabled, switched to HZ_1000) and few nfs related package rebuilds from factory, it's finally running in production now. If it survives today, I feel much better. Hopefully the latency problems dimished, but I reported some not so funny looking numbers from a different setup to LKML, gathered with latencytop (which hopefully didn't induced some Heisenberg uncertainty relation problem). Should I report such problems here, too, given, that 2.6.24.3 isn't easily available for SUSE setups ATM? OTOH, I noticed still some problems with that kernel on openSUSE 10.2 (probably related to inet interface renaming?) - which resulted in: - a hard freeze immediately after initing lo in RL 3 (forces manual reset) - failing rename, which left the major device with some obscure ethxx(?) name Pete -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
participants (2)
-
Hans-Peter Jansen
-
Jiri Kosina