[Bug 685276] New: Machine crashes and don't responde any commands
https://bugzilla.novell.com/show_bug.cgi?id=685276 https://bugzilla.novell.com/show_bug.cgi?id=685276#c0 Summary: Machine crashes and don't responde any commands Classification: openSUSE Product: openSUSE 11.4 Version: Factory Platform: x86-64 OS/Version: openSUSE 11.4 Status: NEW Severity: Major Priority: P5 - None Component: Xen AssignedTo: jdouglas@novell.com ReportedBy: beto.rvs@gmail.com QAContact: qa@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.15) Gecko/20110303 Ubuntu/10.10 (maverick) Firefox/3.6.15 [273411.336945] general protection fault: 0000 [#1] SMP [273411.336961] last sysfs file: /sys/devices/xen-backend/vif-137-0/uevent [273411.336966] CPU 3 [273411.336968] Modules linked in: st xt_mac nfs lockd fscache nfs_acl auth_rpcgss sunrpc tun ip6table_filter ip6_tables usbbk gntdev netbk blkbk blkback_pagemap blktap xenbus_be evtchn edd bridge 8021q garp stp llc bonding arptable_filter arp_tables xt_esp ipt_ah xt_physdev xt_multiport xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables sr_mod 8250_pnp usb_storage uas joydev i7core_edac bnx2 dcdbas edac_core domctl sg ses enclosure pcspkr iTCO_wdt button iTCO_vendor_support ghes 8250 serial_core power_meter hed usbhid hid linear uhci_hcd ehci_hcd usbcore dm_snapshot dm_mod xenblk cdrom xennet fan processor thermal thermal_sys hwmon megaraid_sas [273411.337048] [273411.337054] Pid: 18798, comm: tapdisk Not tainted 2.6.37.1-1.2-xen #1 0N582M/PowerEdge M610 [273411.337059] RIP: e030:[<ffffffffa0606277>] [<ffffffffa0606277>] blktap_clear_pte+0xa7/0x320 [blktap] [273411.337073] RSP: e02b:ffff88015fe99c98 EFLAGS: 00010246 [273411.337077] RAX: 000000000000dead RBX: 0408438348000000 RCX: 0000000000000000 [273411.337086] RDX: 0000000000000000 RSI: 00007f48527b2000 RDI: 0000000000000000 [273411.337091] RBP: ffff880723a57348 R08: ffff88098945b080 R09: 0000000000000000 [273411.337095] R10: 000000000000dead R11: 0000000000000000 R12: ffff880165c87d90 [273411.337100] R13: ffff880165c87d90 R14: ffff880195098c00 R15: 00007f4852800000 [273411.337110] FS: 00007f4852913700(0000) GS:ffff880ba04f6000(0000) knlGS:0000000000000000 [273411.337116] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [273411.337120] CR2: 00007fe7ec00ef30 CR3: 0000000989409000 CR4: 0000000000002660 [273411.337125] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [273411.337130] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [273411.337136] Process tapdisk (pid: 18798, threadinfo ffff88015fe98000, task ffff8801618fc240) [273411.337141] Stack: [273411.337144] ffff880b760c4080 ffffffff8003b9eb ffff880b802b5e98 0000000080114e4a [273411.337151] 0000000000011210 0000000000000001 ffff88015fe99cf8 ffff880b5323b108 [273411.337158] 00007f48527b2000 ffff880ba04fb5c0 ffff880165c87d90 ffff88015fe99e70 [273411.337165] Call Trace: [273411.337197] [<ffffffff800f645a>] zap_pte_range+0x63a/0x700 [273411.337209] [<ffffffff800f6786>] unmap_page_range+0x266/0x370 [273411.337217] [<ffffffff800f7141>] unmap_vmas+0x151/0x210 [273411.337225] [<ffffffff800fc46a>] unmap_region+0xda/0x190 [273411.337233] [<ffffffff800fd774>] do_munmap+0x1d4/0x2c0 [273411.337241] [<ffffffff800fe93d>] sys_munmap+0x4d/0x80 [273411.337251] [<ffffffff80007448>] system_call_fastpath+0x16/0x1b [273411.337262] [<00007f4851a7d3b7>] 0x7f4851a7d3b7 [273411.337267] Code: 14 50 48 98 48 8b 04 c7 29 d3 89 da 89 c3 48 c1 e8 10 83 e3 1f 41 89 c2 8d 3c 9b 8d 1c 7b 8d 14 13 4a 8b 1c d5 40 1b 62 a0 89 d7 <4c> 8b 34 fb 48 8b 3d 9e 6c 34 e0 f0 41 80 66 01 fb 49 8b 98 98 [273411.337307] RIP [<ffffffffa0606277>] blktap_clear_pte+0xa7/0x320 [blktap] [273411.337315] RSP <ffff88015fe99c98> [273411.449029] ---[ end trace ee79595f6d977d18 ]--- [273416.737140] device tap-tec7350.0 entered promiscuous mode [273416.737208] virtbr: port 9(tap-tec7350.0) entering forwarding state [273416.737226] virtbr: port 9(tap-tec7350.0) entering forwarding state [273416.836880] vif168.0 renamed to tec7350.0 by ip [31616] [273416.917711] device tec7350.0 entered promiscuous mode [273416.924730] virtbr: port 10(tec7350.0) entering forwarding state [273416.924762] virtbr: port 10(tec7350.0) entering forwarding state [273427.228937] tap-tec7350.0: no IPv6 routers present [273427.580043] tec7350.0: no IPv6 routers present [273473.696493] virtbr: port 9(tap-tec7350.0) entering forwarding state [273473.756188] device tap-tec7350.0 left promiscuous mode [273473.756207] virtbr: port 9(tap-tec7350.0) entering disabled state [273474.598715] virtbr: port 10(tec7350.0) entering forwarding state [273474.620281] virtbr: port 10(tec7350.0) entering disabled state [273671.079872] virtbr: port 36(tec7348.0) entering forwarding state [273671.124188] virtbr: port 36(tec7348.0) entering disabled state [273672.565099] virtbr: port 35(tap-tec7348.0) entering forwarding state [273672.608208] device tap-tec7348.0 left promiscuous mode [273672.608224] virtbr: port 35(tap-tec7348.0) entering disabled state [273678.656967] device tap-tec7348.0 entered promiscuous mode [273678.657071] virtbr: port 9(tap-tec7348.0) entering forwarding state [273678.657087] virtbr: port 9(tap-tec7348.0) entering forwarding state [273678.788354] vif169.0 renamed to tec7348.0 by ip [2152] [273678.874670] device tec7348.0 entered promiscuous mode [273678.879232] virtbr: port 10(tec7348.0) entering forwarding state [273678.879262] virtbr: port 10(tec7348.0) entering forwarding state [273689.196022] tap-tec7348.0: no IPv6 routers present [273689.884028] tec7348.0: no IPv6 routers present [273714.750569] virtbr: port 9(tap-tec7348.0) entering forwarding state [273714.784189] device tap-tec7348.0 left promiscuous mode [273714.784206] virtbr: port 9(tap-tec7348.0) entering disabled state [273715.492106] virtbr: port 10(tec7348.0) entering forwarding state [273715.512182] virtbr: port 10(tec7348.0) entering disabled state [275053.000282] virtbr: port 85(tec7282.0) entering forwarding state [275053.020396] virtbr: port 85(tec7282.0) entering disabled state [275056.393867] device tap-tec7282.0 entered promiscuous mode [275056.394015] virtbr: port 9(tap-tec7282.0) entering forwarding state [275056.394015] virtbr: port 9(tap-tec7282.0) entering forwarding state [275056.508614] vif170.0 renamed to tec7282.0 by ip [8135] [275056.573149] device tec7282.0 entered promiscuous mode [275056.577493] virtbr: port 10(tec7282.0) entering forwarding state [275056.577524] virtbr: port 10(tec7282.0) entering forwarding state [275066.900024] tap-tec7282.0: no IPv6 routers present [275067.100069] tec7282.0: no IPv6 routers present [275092.476831] virtbr: port 9(tap-tec7282.0) entering forwarding state [275092.512158] device tap-tec7282.0 left promiscuous mode [275092.512176] virtbr: port 9(tap-tec7282.0) entering disabled state [275093.165007] virtbr: port 10(tec7282.0) entering forwarding state [275093.176175] virtbr: port 10(tec7282.0) entering disabled state Reproducible: Sometimes Steps to Reproduce: 1. Start more than 40 VM 2. Reboot all VM after 3 hours 3. Actual Results: Need configure a iptables for any VM. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c
Charles Arnold
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c1
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c2
--- Comment #2 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c3
--- Comment #3 from Roberto Scudeller
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c4
--- Comment #4 from Roberto Scudeller
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c5
--- Comment #5 from Roberto Scudeller
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c6
--- Comment #6 from Roberto Scudeller
Also, how important are the exact numbers you specified above (40 VMs, 3 hours) for reproducing the problem?
Finally, I'm not clear what "Need configure a iptables for any VM" is supposed to tell us.
I run 53 VMs and configure a cronjob every 3 hours to "xm reboot" this VMs. iptables for a anti-spoof filter and basics rules (aloow >1024 tcp ports for example, with statefull). Thanks for your help. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c7
--- Comment #7 from Jan Beulich
I run 53 VMs and configure a cronjob every 3 hours to "xm reboot" this VMs.
That doesn't answer the question: Is it important to have 40 (or 53, or any other particular number of) VMs, or is this reproducible also with just a single VM? Similarly - does rebooting after exactly 3 hours really matter? Is the rebooting part of the description relevant at all?
iptables for a anti-spoof filter and basics rules (aloow >1024 tcp ports for example, with statefull).
Again, for me this doesn't answer the question: It's still unclear what, if any, relation iptables has to a crash in the blktap driver. Finally, still waiting for the hypervisor log (#4 and #5 provide xend logs, which we unlikely will need here). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c8
--- Comment #8 from Roberto Scudeller
(In reply to comment #6)
I run 53 VMs and configure a cronjob every 3 hours to "xm reboot" this VMs.
That doesn't answer the question: Is it important to have 40 (or 53, or any other particular number of) VMs, or is this reproducible also with just a single VM? Similarly - does rebooting after exactly 3 hours really matter? Is the rebooting part of the description relevant at all?
Doesn't matter the number of VMs, as long as I keep low load average, this problem don't appear. When I run a numerous VMs, with high load average (VM linux run CPU and IO benchmarks for example), this BUG appear.
iptables for a anti-spoof filter and basics rules (aloow >1024 tcp ports for example, with statefull).
Again, for me this doesn't answer the question: It's still unclear what, if any, relation iptables has to a crash in the blktap driver.
Finally, still waiting for the hypervisor log (#4 and #5 provide xend logs, which we unlikely will need here).
What other "hypervisor log" are needed? I send messages, xend-debug and xend.log. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c
Roberto Scudeller
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c9
Jan Beulich
Doesn't matter the number of VMs, as long as I keep low load average, this problem don't appear. When I run a numerous VMs, with high load average (VM linux run CPU and IO benchmarks for example), this BUG appear.
With this I'd think running Dom0 and the guest(s) on distinct sets of physical CPUs then should be a usable workaround.
What other "hypervisor log" are needed? I send messages, xend-debug and xend.log.
Either the data collected on the serial console, or the output of "xm dmesg" (if the Dom0 kernel is still usable after the oops). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c10
--- Comment #10 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c11
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c12
--- Comment #12 from Roberto Scudeller
Would you be able to give this a try in a build of your own, or do you depend on us providing you with a test kernel package?
I try apply this patch but this error appear:
/usr/src/linux # patch -p1
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c13
--- Comment #13 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c14
--- Comment #14 from Roberto Scudeller
Presumably those sources are too old. Try the ones under ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.4/.
I try a new test with this patch, but the Dom0 don't responde after I started 10 VMs. Each VM run two benchmarks, IO and CPU. The Dom0 freeze and don't responde any commands. I reboot Dom0 after 30 minutes waiting. My DomUs use tapdisk2 for disk configuration. Example (xm list --long): (device (tap2 (uuid 35a301e3-b67e-64e7-787f-ad773e6b63b4) (bootable 0) (dev xvda:disk) (uname tap:tapdisk:aio:/nfs/vm-test/xvda ) (mode w) (backend 0) ) ) I put last 300 lines for my messages. Are you need more information? Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c15
Roberto Scudeller
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c16
Roberto Scudeller
Created an attachment (id=424721) --> (http://bugzilla.novell.com/attachment.cgi?id=424721) [details] last 300 lines messages.
I recompile kernel source kernel-source-2.6.37.6-0.0.17.fbf0cf7.src.rpm without this fix and the problems continues. In this "Dom0 death's" don't print any log. The Dom0 freeze, and don't respond any xm|xl commands. After 15 minutes, don't respond ps, top or tail (in logs archives). What information are needed? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c17
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c18
--- Comment #18 from Roberto Scudeller
Your statement is sort of inconsistent - if Dom0 freezes, you can't even issue any commands anymore. So the question is what state your system really is in.
Sorry for incomplete information. When I start 10 VMs (with benchmarks), after the time, a commands XM or XL don't responde, but top, ps still respond for a few minutes. After that, the Dom0 don't responde any commands.
If Dom0 is indeed dead, obtaining state through Xen's serial console is going to be the only option (send '0' and/or 'd' as a first step).
If Dom0 is only partially unusable, SysRq-t may also provide insight on the hung process(es).
Also I'm assuming you assigned this bug to yourself in error; I'm reverting this.
I don't get any logs. The Dom0 suffers silent death. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c19
--- Comment #19 from Jan Beulich
I recompile kernel source kernel-source-2.6.37.6-0.0.17.fbf0cf7.src.rpm without this fix and the problems continues.
Did you also try using the kernel binary RPM from the same URL as-is (to exclude problems specific to your rebuild)? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c20
--- Comment #20 from Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c21
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c22
Marcus Meissner
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c23
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c24
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c25
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c26
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c27
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c28
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c29
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c30
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c31
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c32
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c33
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c34
--- Comment #34 from Marcus Meissner
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c35
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c36
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c37
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c38
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c39
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c40
Swamp Workflow Management
https://bugzilla.novell.com/show_bug.cgi?id=685276
https://bugzilla.novell.com/show_bug.cgi?id=685276#c41
--- Comment #41 from Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com