Mailinglist Archive: opensuse-bugs (4655 mails)

< Previous Next >
[Bug 1042933] kernel panic due to nmi caused by systemd-watchdog test
  • From: bugzilla_noreply@xxxxxxxxxx
  • Date: Wed, 14 Jun 2017 07:30:14 +0000
  • Message-id: <bug-1042933-21960-vOkjj4VgWo@http.bugzilla.suse.com/>
http://bugzilla.suse.com/show_bug.cgi?id=1042933
http://bugzilla.suse.com/show_bug.cgi?id=1042933#c7

--- Comment #7 from Thomas Blume <thomas.blume@xxxxxxxx> ---
(In reply to Borislav Petkov from comment #6)
Well, someone had the brilliant idea to panic the system unconditionally
in the hpwdt driver:

hpwdt_pretimeout
|-> nmi_panic

nmi_panic(regs, "An NMI occurred. Depending on your system the
reason "
"for the NMI is logged in any one of the following "
"resources:\n"
"1. Integrated Management Log (IML)\n"
"2. OA Syslog\n"
"3. OA Forward Progress Log\n"
"4. iLO Event Log");

Apparently, that driver goes down into the BIOS to ask what the NMI
reason was but since the test is causing the NMI and doesn't do any
special dancing to tell the BIOS that it is a test running and that when
asked, the BIOS should reply something so that the watchdog doesn't
panic the system, this happens.

But from looking at your testcase again, you only want to ping it once:

ioctl(watchdog_fd, WDIOC_KEEPALIVE, 0);

and then close it. Right?

No sorry, I left out parts of the code that comes after the call that causes
the crash.
The full code is:

-->
r = watchdog_set_timeout(&t);
if (r < 0)
printf("Failed to open watchdog");

for (i = 0; i < 5; i++) {
printf("Pinging...");
r = watchdog_ping();
if (r < 0)
printf("Failed to ping watchdog: %m", r);

usleep(t/2);
}

watchdog_close(true);
return 0;
--<

If so, from looking at the code, it apparently expects a magical 'V'
written to /dev/watchdog so that when you close /dev/watchdog (or your
test exists) it will stop the timer and won't trigger an NMI.

So, IINM, if you write a 'V' at the end of open_watchdog(), it should
work.
Alternatively, you can simply send it WDIOS_DISABLECARD flag with
WDIOC_SETOPTIONS and it'll stop the watchdog timer too.

Thanks for the hints, this seems to be included in the watchdog_close function:

-->
void watchdog_close(bool disarm) {
[...]
if (disarm) {
int flags;

/* Explicitly disarm it */
flags = WDIOS_DISABLECARD;
r = ioctl(watchdog_fd, WDIOC_SETOPTIONS, &flags);
if (r < 0)
log_warning_errno(errno, "Failed to disable hardware
watchdog: %m");

/* To be sure, use magic close logic, too */
for (;;) {
static const char v = 'V';

if (write(watchdog_fd, &v, 1) > 0)
break;

if (errno != EINTR) {
log_error_errno(errno, "Failed to disarm
watchdog timer: %m");
break;
}
}
--<

Still, the kernel panics in watchdog_set_timeout, e.g. before watchdog_close
is reached.
I saw another message on the console like:

"unexpected close, not stopping watchdog"

So, I guess this is the reason why the watchdog timer expires, triggering the
panic.
However, this seems to be a bug in the systemd testcode, not a kernel issue.
Thanks for the background, I tanking back this bug.

I still fail to see what this is then even testing but whatever...

The test was introduced with this commit:

-->
commit e96d6be763014be75d480fde503d0b77f41194a0
Author: Lennart Poettering <lennart@xxxxxxxxxxxxxx>
Date: Thu Apr 5 22:08:10 2012 +0200

systemd: add hardware watchdog support

This adds minimal hardware watchdog support to PID 1. The idea is that
PID 1 supervises and watchdogs system services, while the hardware
watchdog is used to supervise PID 1.

This adds two hardware watchdog configuration options, for the runtime
watchdog and for a shutdown watchdog. The former is active during normal
operation, the latter only at reboots to ensure that if a clean reboot
times out we reboot nonetheless.

If the runtime watchdog is enabled PID 1 will automatically wake up at
half the configured interval and write to the watchdog daemon.
[...]
--<

--
You are receiving this mail because:
You are on the CC list for the bug.
< Previous Next >
References