[Bug 1105302] New: [Build 20180813] openQA test fails randomly due to 'watchdog: BUG: soft lockup - CPU#0 stuck for 91s!'
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302 Bug ID: 1105302 Summary: [Build 20180813] openQA test fails randomly due to 'watchdog: BUG: soft lockup - CPU#0 stuck for 91s!' Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: aarch64 URL: https://openqa.opensuse.org/tests/737735/modules/hostn ame/steps/3 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: guillaume.gardet@arm.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- ## Observation openQA test in scenario opensuse-Tumbleweed-DVD-aarch64-create_hdd_gnome@aarch64 fails in [hostname](https://openqa.opensuse.org/tests/737735/modules/hostname/steps/3) The error is due to: watchdog: BUG: soft lockup - CPU#0 stuck for 91s! [appstream-util:3699] In the gnome test (https://openqa.opensuse.org/tests/734902#step/prepare_system_for_update_test...), in prepare_system_for_update_tests, we have: watchdog: BUG: soft lockup - CPU#0 stuck for 98s! [AsHelper:4083] ## Reproducible Fails since (at least) Build [20180603](https://openqa.opensuse.org/tests/686547) The error appears randomly in various tests. Most of the time, restating the test is enough to get it passed. ## Expected result Last good: [20180530](https://openqa.opensuse.org/tests/685290) (or more recent) ## Further details Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?version=Tumbleweed&arch=aarch64&distri=opensuse&machine=aarch64&flavor=DVD&test=create_hdd_gnome) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c1
--- Comment #1 from Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
Stefan Brüns
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c2
--- Comment #2 from Oliver Kurz
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
Oliver Kurz
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c3
--- Comment #3 from Dirk Mueller
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c4
Oliver Kurz
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c5
--- Comment #5 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c6
--- Comment #6 from Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c7
--- Comment #7 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c8
Guillaume GARDET
As a workaround, I added 'EXTRABOOTPARAMS= rcupdate.rcu_task_stall_timeout=0' to aarch64 machine settings to disable RCU task stall messages, using grub.
It is not working since there is a reboot after the installation. It seems GRUB_KERNEL_OPTION_APPEND would be a better candidate as it is used in 'grub_test', after the reboot of the installation. GRUB_KERNEL_OPTION_APPEND should probably also be added to 'boot_to_desktop' test. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c9
--- Comment #9 from Richard Palethorpe
It is not working since there is a reboot after the installation. It seems GRUB_KERNEL_OPTION_APPEND would be a better candidate as it is used in 'grub_test', after the reboot of the installation. GRUB_KERNEL_OPTION_APPEND should probably also be added to 'boot_to_desktop' test.
For after installation you possibly want to use GRUB_PARAM as well. See add_custom_grub_entries in lib/bootloaded_setup.pm. You probably need to use both parameters to cover all scenarios. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c10
--- Comment #10 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c11
--- Comment #11 from Guillaume GARDET
For after installation you possibly want to use GRUB_PARAM as well. See add_custom_grub_entries in lib/bootloaded_setup.pm. You probably need to use both parameters to cover all scenarios.
It is currently only used in tests/kernel/install_ltp.pm but could be added in other tests. (In reply to Richard Palethorpe from comment #10)
Apparently there is also GRUB_CMDLINE_LINUX_DEFAULT
This is a grub var, not an openQA one. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c12
--- Comment #12 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c13
--- Comment #13 from Stefan Brüns
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c14
--- Comment #14 from Richard Palethorpe
Just another idea - the snapshotted RAM also includes buffers and caches, correct? Can we just tell the kernel to drop these prior to the snapshot?
Interestingly the first snapshot, which has the most problems, is taken directly after restarting the machine. So the guest kernel will have dropped most the RAM before taking the snapshot. However QEMU does not seem to be aware of this, I am not sure if there is a way for QEMU to know (without the guest agent) if the guest kernel has stopped using a page of RAM. Maybe we could use -no-reboot which should cause QEMU to exit instead of reboot and then restart QEMU. This will certainly drop all the RAM... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c15
--- Comment #15 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c16
--- Comment #16 from Stefan Brüns
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c17
--- Comment #17 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c18
--- Comment #18 from Richard Palethorpe
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c19
--- Comment #19 from Alexander Graf
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c20
--- Comment #20 from Richard Palethorpe
QEMU's RAM migration code treats all-0 pages special, so maybe it's enough to just make as many pages as we can contain all zeros?
This looks similar to a balloon, but where the pages are zeroed instead of just locked inside the balloon :-p -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c21
--- Comment #21 from Richard Palethorpe
(In reply to Alexander Graf from comment #19)
QEMU's RAM migration code treats all-0 pages special, so maybe it's enough to just make as many pages as we can contain all zeros?
This looks similar to a balloon, but where the pages are zeroed instead of just locked inside the balloon :-p
Although I have added it to progress issue anyway if someone else wants to try it or if I get time. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c22
--- Comment #22 from Guillaume GARDET
rcupdate.rcu_task_stall_timeout=0' to aarch64 machine settings to disable RCU task stall messages, using grub.
'rcupdate.rcu_task_stall_timeout=0' did not disabled RCU messages, let's try 'rcupdate.rcu_cpu_stall_suppress=1' for next snapshot. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c24
--- Comment #24 from Guillaume GARDET
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302
http://bugzilla.opensuse.org/show_bug.cgi?id=1105302#c27
Guillaume GARDET
participants (1)
-
bugzilla_noreply@novell.com