[opensuse] watchdog, anyone?
Hi list, is anyone of you using a (hardware) watchdog to reset the comuter if the OS gets stuck? If so, what HW do you use, and what SW on the linux side? I saw TW (only) has freeipmi-bmc-watchdog, which seems a bit overkill(?) - the 'simple' watchdog daemon is only available via home: repos... I had tried the iTCO watchdog of my Skylake computer here, but while it reset the machine, the boot would hang forever in POST. Seems not too uncommon :( Hints/tips highly welcome :) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
03.02.2018 17:22, Peter Suetterlin пишет:
Hi list,
is anyone of you using a (hardware) watchdog to reset the comuter if the OS gets stuck? If so, what HW do you use, and what SW on the linux side? I saw TW
It's not like you really have a choice. If you talk about hardware watchdog, this is whatever your hardware implements. Or you can use softdog.
(only) has freeipmi-bmc-watchdog, which seems a bit overkill(?) - the 'simple' watchdog daemon is only available via home: repos...
You probably misunderstand how watchdog works. You need something to periodically poke it. This "something" needs to know how to speak with watchdog. Either you have kernel driver that implements standardize interface, then you can really use "simple" daemon - although in this case you could simply enable watchdog in sytsemd which is always running anyway - or you need dedicated program that knows how to access watchdog. bmc-watchdog is obviously useful only if your system actually has BMC (under whatever name) with watchdog support. Do you have one?
I had tried the iTCO watchdog of my Skylake computer here, but while it reset the machine, the boot would hang forever in POST. Seems not too uncommon :(
You may try to play with turn_SMI_watchdog_clear_off parameter. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Andrei Borzenkov wrote:
03.02.2018 17:22, Peter Suetterlin пишет:
Hi list,
is anyone of you using a (hardware) watchdog to reset the comuter if the OS gets stuck? If so, what HW do you use, and what SW on the linux side? I saw TW
It's not like you really have a choice. If you talk about hardware watchdog, this is whatever your hardware implements. Or you can use softdog.
I saw small external counter devices that you can connect to the reset line of the MB. So if I understand correct for those I'd need my own ping-daemon or rather a driver that 'connects' it as some /dev/watchdog<n>? The softdog doesn't help if the system really freezes....
(only) has freeipmi-bmc-watchdog, which seems a bit overkill(?) - the 'simple' watchdog daemon is only available via home: repos...
You probably misunderstand how watchdog works. You need something to periodically poke it. This "something" needs to know how to speak with watchdog.
Yes, sure. But the whole IPMI stuff is rather for (real) server boards with built-in hardware that does (much) more like (HW) health monitoring etc.
Either you have kernel driver that implements standardize interface, then you can really use "simple" daemon - although in this case you could simply enable watchdog in sytsemd which is always running anyway - or you need dedicated program that knows how to access watchdog. bmc-watchdog is obviously useful only if your system actually has BMC (under whatever name) with watchdog support. Do you have one?
Nope :( So indeed the ipmi stuff is not for me, unless I buy appropriate hardware. One thing answered, good!
I had tried the iTCO watchdog of my Skylake computer here, but while it reset the machine, the boot would hang forever in POST. Seems not too uncommon :(
You may try to play with turn_SMI_watchdog_clear_off parameter.
Aah! Thanks! yes I'm going to play with that :) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
03.02.2018 20:50, Peter Suetterlin пишет:
I saw small external counter devices that you can connect to the reset line of the MB. So if I understand correct for those I'd need my own ping-daemon or rather a driver that 'connects' it as some /dev/watchdog<n>?
Correct. Or custom program that knows how to speak to device (but you likely will need some kernel driver anyway in which case it would be easier if kernel driver also implemented standard access via /dev/watchdog). -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 02/03/2018 08:22 AM, Peter Suetterlin wrote:
Hi list,
is anyone of you using a (hardware) watchdog to reset the comuter if the OS gets stuck? If so, what HW do you use, and what SW on the linux side? I saw TW (only) has freeipmi-bmc-watchdog, which seems a bit overkill(?) - the 'simple' watchdog daemon is only available via home: repos...
I had tried the iTCO watchdog of my Skylake computer here, but while it reset the machine, the boot would hang forever in POST. Seems not too uncommon :(
Hints/tips highly welcome :)
Peter, I have a 2 cpu SuperMicro board that apparently was a prototype board for the now defunct 3Leaf.inc early attempts to link 32-Operton processors together as a single server (a SuperMicro H8DM8E-2 board) and the 2009 article on the mega-server idea http://www.eetimes.com/document.asp?doc_id=1172150 (it was a free dual quad-core opteron box with 32G, so I can't complain) This being only one board of what would have been a linked 16-board single-server with the AQUA technology has some lingering effects on boot. At times, (let's say 1 out of 5 boots) it will hang during pci initialization with various spurious errors, but the native watchdog-timer takes care of the problem 30-60 seconds later allowing the box to boot normally. I don't know what watchdog software this was (I always thought it was part of the kernel or something that was just there by default for just this circumstance). Are you looking for something different that would act like this watchdog-timer, but could be controlled and configured from user-space? -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
David C. Rankin wrote:
I have a 2 cpu SuperMicro board that apparently was a prototype board for the now defunct 3Leaf.inc early attempts to link 32-Operton processors together as a single server (a SuperMicro H8DM8E-2 board) and the 2009 article on the mega-server idea http://www.eetimes.com/document.asp?doc_id=1172150 (it was a free dual quad-core opteron box with 32G, so I can't complain)
Hehe, fancy :)
I don't know what watchdog software this was (I always thought it was part of the kernel or something that was just there by default for just this circumstance).
Are you looking for something different that would act like this watchdog-timer, but could be controlled and configured from user-space?
As Andrei mentioned, you need a hardware (with driver) plus some software to 'keep it alive'. The alternative (kernel-space softdog) doesn't help if the system itself crashes, and that's the case I'd like to handle. (it's definitely not-so-nice to have someone drive up a snowy mountain to press a reset button....) I'm currently exploring options, that's why I tried to find what other people use (if any...). But most stuff I found is targeted at embedded systems.... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Monday, 2018-02-05 at 11:43 -0000, Peter Suetterlin wrote:
As Andrei mentioned, you need a hardware (with driver) plus some software to 'keep it alive'. The alternative (kernel-space softdog) doesn't help if the system itself crashes, and that's the case I'd like to handle.
(it's definitely not-so-nice to have someone drive up a snowy mountain to press a reset button....)
Indeed!
I'm currently exploring options, that's why I tried to find what other people use (if any...). But most stuff I found is targeted at embedded systems....
I need something to reboot my router: sometimes it locks and dies, so no external access to do anything remotely. I'm considering some small microcomputer that pings it, and if it doesn't respond in a minute, activate a relay to power cycle it. I have seen very expensive solutions ready made, and relatively cheap ones that I'd have to implement myself completely. - -- Cheers, Carlos E. R. (from openSUSE 42.3 x86_64 "Malachite" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlp4yRwACgkQtTMYHG2NR9Wt1wCeOM5NFOlJBW8wWE8yCVVh6trP JqIAn2WY7upAWDD9EkRImtuASjZTR1Th =Uw8j -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Carlos E. R. wrote:
On Monday, 2018-02-05 at 11:43 -0000, Peter Suetterlin wrote:
I'm currently exploring options, that's why I tried to find what other people use (if any...). But most stuff I found is targeted at embedded systems....
I need something to reboot my router: sometimes it locks and dies, so no external access to do anything remotely.
Yes, that's also a well-known problem for me, though in my case the affected router is only needed when I'm around....
I'm considering some small microcomputer that pings it, and if it doesn't respond in a minute, activate a relay to power cycle it.
I have seen very expensive solutions ready made, and relatively cheap ones that I'd have to implement myself completely.
Yes, I'm meanwhile also inclined to get some USB-controlled power plug and have a Pi controlling it.... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2018-02-06 11:52, Peter Suetterlin wrote:
Carlos E. R. wrote:
On Monday, 2018-02-05 at 11:43 -0000, Peter Suetterlin wrote:
I'm currently exploring options, that's why I tried to find what other people use (if any...). But most stuff I found is targeted at embedded systems....
I need something to reboot my router: sometimes it locks and dies, so no external access to do anything remotely.
Yes, that's also a well-known problem for me, though in my case the affected router is only needed when I'm around....
I'm considering some small microcomputer that pings it, and if it doesn't respond in a minute, activate a relay to power cycle it.
I have seen very expensive solutions ready made, and relatively cheap ones that I'd have to implement myself completely.
Yes, I'm meanwhile also inclined to get some USB-controlled power plug and have a Pi controlling it....
If you find a detailed howto, tell me ;-) -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
participants (4)
-
Andrei Borzenkov
-
Carlos E. R.
-
David C. Rankin
-
Peter Suetterlin