[Bug 224778] New: softirq.c#cpu_callback BUG_ON
https://bugzilla.novell.com/show_bug.cgi?id=224778 Summary: softirq.c#cpu_callback BUG_ON Product: openSUSE 10.2 Version: RC 4 Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: dhecht@vmware.com QAContact: qa@suse.de While installing SuSE 10.2 alpha4 32-bit release on VMware, we hit the following kernel BUG_ON: <0>kernel BUG at kernel/softirq.c:577! We've confirmed the bug is still in RC1. Note that this was also reported in: https://bugzilla.novell.com/show_bug.cgi?id=210931. Also note that the kernel race leading to this BUG_ON is not specific to running in a VM (granted, it may be hard to reproduce on native hardware). The race is between the init path and the timer interrupt. Before init thread completes spawn_ksoftirqd(), the timer interrupt fires, calling update_process_times -> rcu_check_callbacks -> tasklet_schedule -> __tasklet_schedule, which does __get_cpu_var(tasklet_vec).list = t. This causes spawn_ksoftirqd -> cpu_callback to BUG_ON since the tasklet_vec list is no longer empty. I'm not sure who is using RCU so early to cause rcu_pending() to return true before ksoftirqd is spawned. Detailed description of the race: The BUG_ON encountered is: <4>CPU0: AMD Dual Core AMD Opteron(tm) Processor 275 stepping 02 <6>Total of 1 processors activated (4409.57 BogoMIPS). <4>ENABLING IO-APIC IRQs <6>..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 <0>------------[ cut here ]------------ <0>kernel BUG at kernel/softirq.c:577! <0>invalid opcode: 0000 [#1] <0>SMP <0>last sysfs file: <4>Modules linked in: <0>CPU: 0 <4>EIP: 0060:[<c0124a6d>] Not tainted VLI <4>EFLAGS: 00010286 (2.6.18-9-default #1) <0>EIP is at cpu_callback+0x45/0x238 <0>eax: c03e606c ebx: 00000000 ecx: 00000000 edx: 00e1f100 <0>esi: 00000000 edi: 00000000 ebp: 00000000 esp: c1263f84 <0>ds: 007b es: 007b ss: 0068 <0>Process swapper (pid: 1, ti=c1262000 task=c12615f0 task.ti=c1262000) <0>Stack: c01002fc 00000000 00000000 00000000 00000000 00000000 c03c5a71 c01002f c <0> c0100344 c12615f0 c12042e0 c03b1fcc c0103ca6 00000202 c01002fc 0000000 0 <0> 00000000 00000000 00000000 00000000 00000000 0000007b c01002fc 0000000 0 <0>Call Trace: <4> [<c03c5a71>] spawn_ksoftirqd+0x1c/0x3b <4> [<c0100344>] init+0x48/0x2bc <4> [<c0102005>] kernel_thread_helper+0x5/0xb <4>DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb <4>Leftover inexact backtrace: <0>Code: 00 00 83 fa 04 0f 84 ae 00 00 00 83 fa 07 0f 85 ff 01 00 00 e9 d3 00 00 00 8b 14 8d 80 cb 35 c0 b8 6c 60 3e c0 83 3c 10 00 74 08 <0f> 0b 41 02 66 cc 2c c0 b8 70 60 3e c0 83 3c 10 00 74 08 0f 0b <0>EIP: [<c0124a6d>] cpu_callback+0x45/0x238 SS:ESP 0068:c1263f84 <4> <0>Kernel panic - not syncing: Attempted to kill init! <4> The BUG_ON statement hit is in kernel/softirq.c#cpu_callback: BUG_ON(per_cpu(tasklet_vec, hotcpu).list); We debugged this further to confirm it is a kernel race. We instrumented all the places in that prepend to tasklet_vec[cpu].list with a BUG_ON checking that the cpu_callback(...CPU_UP_PREPARE) had a chance to execute, and found that the path that wins the race to updating tasklet_vec[cpu].list is: checking if image is initramfs... it is ------------[ cut here ]------------ kernel BUG at kernel/softirq.c:356! invalid opcode: 0000 [#1] SMP last sysfs file: Modules linked in: CPU: 0 EIP: 0060:[<c012539b>] Not tainted VLI EFLAGS: 00010046 (2.6.18-9-vanilla #1) EIP is at __tasklet_schedule+0x41/0x8c eax: c03de078 ebx: 00000000 ecx: c03de06c edx: 00e27100 esi: cfb82000 edi: 00000046 ebp: 00000000 esp: cfb83784 ds: 007b es: 007b ss: 0068 Process swapper (pid: 1, ti=cfb82000 task=cfb815f0 task.ti=cfb82000) Stack: cfb815f0 00000000 00000000 c0128f72 cfb837fc 00000000 00000000 c0107871 c02fca20 c0146fba cfb837fc c034ce28 c034ce00 00000000 cfb83868 c0147075 000000af cfb837fc c02fca20 00000000 cfb837fc 00000000 cfb83868 c0106966 Call Trace: [<c0128f72>] update_process_times+0x4d/0x5c [<c0107871>] timer_interrupt+0x4b/0x72 [<c0146fba>] handle_IRQ_event+0x23/0x49 [<c0147075>] __do_IRQ+0x95/0xee [<c0106966>] do_IRQ+0x71/0x83 [<c0104e1a>] common_interrupt+0x1a/0x20 DWARF2 unwinder stuck at common_interrupt+0x1a/0x20 Leftover inexact backtrace: [<c017332e>] do_path_lookup+0x106/0x25f [<c017214d>] getname+0x59/0xb0 [<c0173bfb>] __user_walk_fd+0x2f/0x40 [<c016d6eb>] vfs_lstat_fd+0x16/0x3d [<c016d726>] sys_newlstat+0x14/0x28 [<c03b1ef2>] clean_path+0x19/0x4e [<c03b2d8b>] do_header+0x1a9/0x1b3 [<c03b27bd>] do_name+0x7f/0x1c2 [<c03b1a1b>] write_buffer+0x1a/0x28 [<c03b1a8a>] flush_window+0x61/0xaf [<c03b1e6a>] inflate_codes+0x392/0x3f7 [<c03b327d>] inflate_dynamic+0x4e8/0x548 [<c03b37c0>] unpack_to_rootfs+0x4e3/0x8dc [<c01002fc>] init+0x0/0x2bc [<c01002fc>] init+0x0/0x2bc [<c03b3c34>] populate_rootfs+0x7b/0xe2 [<c01002fc>] init+0x0/0x2bc [<c01002fc>] init+0x0/0x2bc [<c010032b>] init+0x2f/0x2bc [<c0103ca6>] ret_from_fork+0x6/0x20 [<c01002fc>] init+0x0/0x2bc [<c01002fc>] init+0x0/0x2bc [<c0102005>] kernel_thread_helper+0x5/0xb Code: 8b 14 9d 80 4b 35 c0 8b 14 11 89 10 8b 14 9d 80 4b 35 c0 89 04 11 8b 56 10 b8 78 e0 3d c0 8b 14 95 80 4b 35 c0 83 3c 10 00 75 08 <0f> 0b 64 01 94 59 2c c0 b8 80 c3 3d c0 83 0c 10 20 89 e2 81 e2 EIP: [<c012539b>] __tasklet_schedule+0x41/0x8c SS:ESP 0068:cfb83784 <0>Kernel panic - not syncing: Fatal exception in interrupt Note that 0xc0128f72 is the instruction after the call to rcu_check_callbacks from update_process_times. So, it is rcu_check_callbacks that is calling: tasklet_schedule(&per_cpu(rcu_tasklet, cpu)); before spawn_ksoftirqd() had a chance to execute, leading to the original BUG_ON. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 ------- Comment #1 from zach@vmware.com 2006-12-01 00:57 MST ------- Created an attachment (id=107805) --> (https://bugzilla.novell.com/attachment.cgi?id=107805&action=view) Fix for softirq race This is the fix Andrew Morton accepted for 2.6.19-stable. It would be great to apply it to OpenSUSE, as the bug affects all virtual machine vendors as well as potentially occuring on native hardware. The fix has been confirmed to work fine. It is also trivially reviewable and unable to introduce a regression, as it just deprecates some broken BUG_ON statements. http://lkml.org/lkml/2006/11/30/192 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 ------- Comment #2 from zach@vmware.com 2006-12-01 01:06 MST ------- See also bug 210931 which was marked closed/invalid, and is related to this bug. It also affects multiple virtual machine vendors including Xen, as reported on LKML: http://lkml.org/lkml/2006/10/19/55 https://bugzilla.novell.com/show_bug.cgi?id=210931 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #107805|application/octet-stream |text/plain mime type| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |zach@vmware.com ------- Comment #3 from gregkh@novell.com 2006-12-01 09:43 MST ------- Any idea why we aren't seeing this on our Xen virtual machines? And is this patch in 2.6.19? Should it also be included in 2.6.18-stable? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |madjari@inode.at ------- Comment #4 from gregkh@novell.com 2006-12-01 09:45 MST ------- *** Bug 210931 has been marked as a duplicate of this bug. *** -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 zach@vmware.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |zach@vmware.com ------- Comment #5 from zach@vmware.com 2006-12-01 10:17 MST ------- (In reply to comment #3)
Any idea why we aren't seeing this on our Xen virtual machines?
Timing conditions different. To reproduce it more easily, you can increase the
size of the compressed init ramdisk. This causes more time to be spent
decompressing during boot, thus opening the window to the point where you allow
timer interrupts before softirqd is started. Eventually in the window, you get
a timer interrupt that schedules an RCU callback via a tasklet. Then softirqd
startup code hits this BUG_ON, but there actually is no bug.
I believe S. Caglar Onur
And is this patch in 2.6.19? Should it also be included in 2.6.18-stable?
It is in -mm right now, probably going into 2.6.19.1; it should be in 2.6.18-stable if that is still being maintained. (Apologies again, I don't know what the expiration policy for -stable is on a new release). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 zach@vmware.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|zach@vmware.com | ------- Comment #6 from zach@vmware.com 2006-12-01 10:18 MST ------- Pong -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |gregkh@novell.com |maintainers@forge.provo.nove| |ll.com | ------- Comment #7 from gregkh@novell.com 2006-12-01 10:31 MST ------- the -stable maintainers will do at least one more release for 2.6.18-stable, so please send it to them. I'll add this to our tree and see what happens... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|gregkh@novell.com |ak@novell.com Status|ASSIGNED |NEW ------- Comment #8 from gregkh@novell.com 2006-12-01 10:37 MST ------- Oops, andi already beat me to this, I'll just reassign it to him as he's handled it by checking it into our tree. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 srihan@vmware.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |srihan@vmware.com ------- Comment #9 from srihan@vmware.com 2006-12-15 17:12 MST ------- Looks like this didn't make it into OpenSUSE 10.2. Can we expect it to be in OpenSUSE 10.3? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 ------- Comment #10 from gregkh@novell.com 2006-12-18 12:35 MST ------- As 10.3 will be based on a newer upstream kernel.org kernel, yes, it will be in there. And also, this fix should be in the next 10.2 kernel update, which should be coming soon (pending testing from our QA group...) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=224778 ak@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #11 from ak@novell.com 2007-01-10 05:42 MST ------- Fixed in the current 10.2 tree -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
participants (1)
-
bugzilla_noreply@novell.com