[Bug 825510] New: "shutdown -r now" fails @ Xen 4.3.0 Dom0 -- systems halts, but does not restart
https://bugzilla.novell.com/show_bug.cgi?id=825510 https://bugzilla.novell.com/show_bug.cgi?id=825510#c0 Summary: "shutdown -r now" fails @ Xen 4.3.0 Dom0 -- systems halts, but does not restart Classification: openSUSE Product: openSUSE 12.3 Version: Final Platform: x86-64 OS/Version: openSUSE 12.3 Status: NEW Severity: Major Priority: P5 - None Component: Xen AssignedTo: jdouglas@suse.com ReportedBy: ar16@imapmail.org QAContact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0 per advice @ https://bugzilla.novell.com/show_bug.cgi?id=825501#c1, splitting this here, I'm running Xen 4.3.0 on Opensuse 12.3, rpm -qa | grep -i ^xen xen-devel-4.3.0_03-251.1.x86_64 xen-4.3.0_03-251.1.x86_64 xen-tools-4.3.0_03-251.1.x86_64 xen-libs-4.3.0_03-251.1.x86_64 lsb_release -a LSB Version: n/a Distributor ID: openSUSE project Description: openSUSE 12.3 (x86_64) Release: 12.3 Codename: Dartmouth uname -rm 3.7.10-1.11-xen x86_64 if booted @ kernel-xen Dom0, exec'ing a `shutdown -r now` fails -- the system halts, but does not restart. if booted @ kernel-default, seems to work as expected. I'd tested this system and it working without such problems ~ 2 weeks ago. I supsect, but have not yet tracked down, an update within that timeframe. I've attached systemd status & dmesg after boot @ the referenced bug, https://bugzilla.novell.com/attachment.cgi?id=544545 Not clear if that's diagnostic is sufficient. Can provide add'l info @ request. Reproducible: Always Steps to Reproduce: 1. 2. 3. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c2
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c3
A R
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c4
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c5
A R
we'd need a serial log taken of the shutdown operation, so we can see whether kernel or hypervisor crash in any way, or how far shutdown proceeds.
@ `shutdown -r now`, tail of what i *think* is needed/relevant at serial console output: --------------------------- ... [ OK ] Reached target Unmount All Filesystems. [ OK ] Stopped target Local File Systems (Pre). Stopping Remount Root and Kernel File Systems... [ OK ] Stopped Remount Root and Kernel File Systems. Starting Save Random Seed... Starting Update UTMP about System Shutdown... Stopping Replay Read-Ahead Data... [ OK ] Stopped Replay Read-Ahead Data. Stopping Collect Read-Ahead Data... [ OK ] Stopped Collect Read-Ahead Data. Stopping LSB: Start LVM2... [ OK ] Started Save Random Seed. [ OK ] Started Update UTMP about System Shutdown. [ OK ] Stopped LSB: Start LVM2. Stopping LSB: Multiple Device RAID... [ OK ] Stopped LSB: Multiple Device RAID. [ OK ] Reached target Shutdown. (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:39] mm.c:618:d0 Could not get page ref for pfn fec00 Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... (XEN) [2013-06-27 18:04:43] mm.c:618:d0 Could not get page ref for pfn fec00 (XEN) [2013-06-27 18:04:43] mm.c:618:d0 Could not get page ref for pfn fec00 Hardware watchdog 'SP5100 TCO timer', version 0 (XEN) [2013-06-27 18:04:43] mm.c:618:d0 Could not get page ref for pfn fec00 Unmounting file systems. Unmounting /var/lib/dhcp/proc. Unmounting /var/run. Unmounting /dev/mqueue. All filesystems unmounted. Deactivating swaps. All swaps deactivated. Detaching loop devices. All loop devices detached. Detaching DM devices. Detaching DM 253:7. Detaching DM 253:6. Detaching DM 253:5. Detaching DM 253:4. Detaching DM 253:3. Detaching DM 253:2. Detaching DM 253:0. Not all DM devices detached, 1 left. (XEN) [2013-06-27 18:04:43] mm.c:618:d0 Could not get page ref for pfn fec00 Detaching DM devices. Not all DM devices detached, 1 left. Cannot finalize remaining file systems and devices, giving up. (XEN) [2013-06-27 18:04:45] mm.c:618:d0 Could not get page ref for pfn fec00 [ 1399.988852] Restarting system. --------------------------- at this point it just sits, and goes no further. the system does NOT poweroff.
Does normal shutdown work, or does it also halt the machine without turning it off?
manual/cold reboot, then @ `shutdown -h now`, it *DOES* successfully poweroff. Here's the similar, serial console tail: --------------------------- [ OK ] Reached target Unmount All Filesystems. [ OK ] Stopped target Local File Systems (Pre). Stopping Remount Root and Kernel File Systems... [ OK ] Stopped Remount Root and Kernel File Systems. Starting Save Random Seed... Starting Update UTMP about System Shutdown... Stopping Replay Read-Ahead Data... [ OK ] Stopped Replay Read-Ahead Data. Stopping Collect Read-Ahead Data... [ OK ] Stopped Collect Read-Ahead Data. Stopping LSB: Start LVM2... [ OK ] Started Save Random Seed. [ OK ] Started Update UTMP about System Shutdown. [ OK ] Stopped LSB: Start LVM2. Stopping LSB: Multiple Device RAID... [ OK ] Stopped LSB: Multiple Device RAID. [ OK ] Reached target Shutdown. Sending SIGTERM to remaining processes... Sending SIGKILL to remaining processes... Unmounting file systems. Unmounting /var/lib/dhcp/proc. Unmounting /var/lib/nfs/rpc_pipefs. Unmounting /var/run. Unmounting /dev/mqueue. All filesystems unmounted. Deactivating swaps. All swaps deactivated. Detaching loop devices. All loop devices detached. Detaching DM devices. Detaching DM 253:7. Detaching DM 253:6. Detaching DM 253:5. Detaching DM 253:4. Detaching DM 253:3. Detaching DM 253:2. Detaching DM 253:0. Not all DM devices detached, 1 left. Detaching DM devices. Not all DM devices detached, 1 left. Cannot finalize remaining file systems and devices, giving up. (XEN) [2013-06-27 18:16:05] mm.c:618:d0 Could not get page ref for pfn fec00 [ 256.208915] Power down. (XEN) [2013-06-27 18:16:07] Preparing system for ACPI S5 state. (XEN) [2013-06-27 18:16:07] Disabling non-boot CPUs ... (XEN) [2013-06-27 18:16:07] Breaking affinity for d0v1 (XEN) [2013-06-27 18:16:07] Breaking affinity for d0v2 (XEN) [2013-06-27 18:16:08] Breaking affinity for d0v3 (XEN) [2013-06-27 18:16:08] Entering ACPI S5 state. --------------------------- and, at this point, it's successfully powered-off.
with 12.3 not shipping with Xen 4.3, we'd want you to test with the shipped version of Xen (and, in case you updated that too, kernel).
Pending
with you apparently knowing that it worked before a recent update, narrowing down which update this was would also help.
Pending
with the native kernel working, attaching the boot log of the native kernel (to see eventual log messages regarding applied workarounds) would be as helpful as providing exact hardware details (namely DMI information).
not entirely sure what 'boot log' is being asked for in a systemd world, since boot.*msg no longer appears. here's `journalctl -b | grep -i kernel`: http://pastebin.com/raw.php?i=khT1Da6T -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c6
Jan Beulich
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c7
--- Comment #7 from A R
To simplify the analysis, it would be desirable if you tried this with disabled secondary CPUs ("nosmp" on the Xen command line).
I've found different advice on how best to achieve this. Reading, http://osdir.com/ml/xen-users/2007-11/msg00697.html > Is it possible to disable the SMP function on the dom0 and assign each single domU to have direct access on the multicore processor like Intel Quadcore? "Yes, Edit /etc/xend/xend-config.sxp and use (dom0-cpus x) to assign a single cpu to dom0" and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=637308#22 > it’s also recommended that the dom0 is restricted to one > CPU only, for example by booting with the kernel parameter nosmp." Ian Campbell ==> "Ignoring whether or not this is good advice (I expect it very much depends on your workload) this can be better achieved by adding dom0_max_vcpus=1 to your hypervisor command line or by hot-unplugging vCPUS once the system has booted ..." mod'ing in my grub - ... dom0_max_vcpus=4 ... + ... dom0_max_vcpus=1 ... iiuc, appears to do the trick; after reboot, just one CPU appears active cat /proc/interrupts | head -n 5 CPU0 1: 4 Phys-fasteoi i8042 6: 3 Phys-fasteoi floppy 7: 0 Phys-fasteoi parport0 8: 0 Phys-fasteoi rtc0 Is this sufficient for your request?
So with it not crashing, but just hanging, issuing the 'd' debug key from the serial console ought to still work, and should give us insight into what's going on.
Atm, when booting to Xen, I'm *unable* to get the server to recognize any commands issued from my serial terminal (no cmd keys, not even seeing a login prompt, etc). This *used* to work pre-systemd. Something's possibly off in my serial config. I'm trying to get *that* straightened out @ http://lists.opensuse.org/opensuse-virtual/2013-06/msg00015.html
That said - did you try the various "reboot=" hypervisor command line options, and _none_ of them worked?
I hadn't tried any of them; tbh, not even aware of them. Reading http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html The options appear to be: reboot = b[ios] | t[riple] | k[bd] | n[o] [, [w]arm | [c]old] Default: 0 Specify the host reboot method. warm instructs Xen to not set the cold reboot flag. cold instructs Xen to set the cold reboot flag. bios instructs Xen to reboot the host by jumping to BIOS. This is only available on 32-bit x86 platforms. triple instructs Xen to reboot the host by causing a triple fault. kbd instructs Xen to reboot the host via the keyboard controller. acpi instructs Xen to reboot the host using RESET_REG in the ACPI FADT. with no further explanation(s). Reading http://lists.xen.org/archives/html/xen-devel/2011-09/msg00942.html suggests that @ 'Default', a *sequence* of the reboot options is attempted "Summing up, both Linux 3.1 and Xen 4.1 both do the following sequence by default: ACPI, KBD, ACPI, KBD, TRIPLE, KBD, TRIPLE, KBD, ..." setting, instead, individual reboot= grub options, testing `shutdown -r now` in each case reboot=cold reboot=warm reboot=triple reboot=kbd reboot=acpi (reboot=bios, not applicable. this is x86_64.) exec of `shutdown -r now` hangs, as reported above, @ (XEN) [2013-06-28 16:32:06] Domain 0 shutdown: rebooting machine. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c8
--- Comment #8 from Jan Beulich
+ ... dom0_max_vcpus=1 ... Is this sufficient for your request?
No, it's not. This - as the name says - restricts Dom0's number of vCPU-s, not the number of pCPU-s that Xen uses. Yet the latter is what I'd like to be restricted. Also - as you mention it, and regardless of how much I dislike you scattering all sorts of information here - restricting Dom0's number of vCPU-s to 1 is (at the very least with using xend) _not_ recommended, regardless of what you may have found elsewhere.
Atm, when booting to Xen, I'm *unable* to get the server to recognize any commands issued from my serial terminal (no cmd keys, not even seeing a login prompt, etc). This *used* to work pre-systemd.
Sorry, but the expectation is that you have this working. And I very much doubt that systemd has any effect on Xen receiving input (it may very well have an effect on Dom0 receiving input, but that's two different modes to run the serial console in).
"Summing up, both Linux 3.1 and Xen 4.1 both do the following sequence by default:
ACPI, KBD, ACPI, KBD, TRIPLE, KBD, TRIPLE, KBD, ..."
setting, instead, individual reboot= grub options, testing `shutdown -r now` in each case
reboot=cold reboot=warm reboot=triple reboot=kbd reboot=acpi (reboot=bios, not applicable. this is x86_64.)
exec of `shutdown -r now` hangs, as reported above, @
Very interesting (and odd). And you added these to the hypervisor command line, not the kernel one? If so, you might want to try "reboot=pci", if you're able to rebuild the hypervisor for yourself with the patch at http://lists.xenproject.org/archives/html/xen-devel/2013-06/msg02128.html applied. Failing that, we will need to see the result of the 'd' debug key as per #6. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c9
A R
https://bugzilla.novell.com/show_bug.cgi?id=825510
https://bugzilla.novell.com/show_bug.cgi?id=825510#c10
--- Comment #10 from Jan Beulich
participants (1)
-
bugzilla_noreply@novell.com