lost interrupts problem
Yesterday logcheck sent me a message about lost interrupts "kernel: hda: lost interrupt" yet all seemed to be working well Google threw up many reports of this error, but no consistent explanation or context for it occurring. On balance, I would say that software, perhaps kernel, errors, or configuration issues, seemed to outnumber hardware ones in the proffered explanations. But I am none the wiser. My machine then locked up when experimenting with some unsupported software in crossover. On rebooting it went into a loop involving lost interrupts and I reset the machine and went into Windows to see if that was working - which it was with no problems (I realise that that's not necessarily much of a guide, but it did rule out catastrophic hardware problems). On rebooting linux, the boot process proceeded normally and the system seems to be fine at the moment, though I do have "submountd: resmgr: server response code 200" messages and I remember that resmgr was upgraded the previous night (I have no idea whether there could be a connection between this and the lost interrupts). I'd appreciate some guidance on how to deal with this. The system, which has operated fine for the last 6 months, is MSI K8T Master dual opteron, with up-to-date bios, 4GB memory, one maxtor drive on first IDE, DVD on second IDE, and a second Maxtor on a promise PCI card, 2.6.5-7.111-smp, 9.1 Prof. - Richard. -- Richard Kimber http://www.psr.keele.ac.uk/
On Wednesday 24 November 2004 1:14 pm, rkimber@ntlworld.com wrote:
Yesterday logcheck sent me a message about lost interrupts "kernel: hda: lost interrupt" yet all seemed to be working well
I'm seeing those logged occasionally too, usually in conjunction with kernel: hda: DMA interrupt recovery. Not sure what it all means but there is no noticeable impact on the system. Scott -- POPFile, the OpenSource EMail Classifier http://popfile.sourceforge.net/ Linux 2.6.5-7.111-default x86_64
On Wed, 24 Nov 2004 18:12:07 -0800
Scott Leighton
I'm seeing those logged occasionally too, usually in conjunction with kernel: hda: DMA interrupt recovery. Not sure what it all means but there is no noticeable impact on the system.
Have you needed to reboot at any time? Booting up failed for me once because of this. I haven't dared to try it again. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
On Thursday 25 November 2004 3:20 am, rkimber@ntlworld.com wrote:
On Wed, 24 Nov 2004 18:12:07 -0800
Scott Leighton
wrote: I'm seeing those logged occasionally too, usually in conjunction with kernel: hda: DMA interrupt recovery. Not sure what it all means but there is no noticeable impact on the system.
Have you needed to reboot at any time? Booting up failed for me once because of this. I haven't dared to try it again.
No, if it weren't for logwatch, I wouldn't even notice these entries were showing up in the log. They are random and occasional but don't seem to impact me in any way that I can detect. Scott -- POPFile, the OpenSource EMail Classifier http://popfile.sourceforge.net/ Linux 2.6.5-7.111-default x86_64
On Wed, 24 Nov 2004 18:12:07 -0800
Scott Leighton
On Wednesday 24 November 2004 1:14 pm, rkimber@ntlworld.com wrote:
Yesterday logcheck sent me a message about lost interrupts "kernel: hda: lost interrupt" yet all seemed to be working well
I'm seeing those logged occasionally too, usually in conjunction with kernel: hda: DMA interrupt recovery. Not sure what it all means but there is no noticeable impact on the system.
Another point I meant to emphasise is that these errors are recent. In the preceding six months they had not occurred at all. This suggests some change in either harware or, more likely, software. My guess is that there is a bug in some recently produced update. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
Another point I meant to emphasise is that these errors are recent. In the preceding six months they had not occurred at all. This suggests some change in either harware or, more likely, software. My guess is that there is a bug in some recently produced update.
More likely hardware actually. Lost interrupt means that the hard disk didn't reply to a command in time. I would check the SMART statistics using smartctl and your cables. -Andi
Andi Kleen wrote:
Another point I meant to emphasise is that these errors are recent. In the preceding six months they had not occurred at all. This suggests some change in either harware or, more likely, software. My guess is that there is a bug in some recently produced update.
More likely hardware actually. Lost interrupt means that the hard disk didn't reply to a command in time. I would check the SMART statistics using smartctl and your cables.
-Andi
Hmm, i have testet it with a new SATA Harddisk an new cables. Same problem. Now i'm fighting against the console to get the kernel panic. friendly regards Andreas
On Thu, 25 Nov 2004 14:12:25 +0100
Andi Kleen
More likely hardware actually. Lost interrupt means that the hard disk didn't reply to a command in time. I would check the SMART statistics using smartctl and your cables.
OK thanks. I'll check, but when I googled, a lot of the discussion seemed software related. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
On Thu, 25 Nov 2004 14:12:25 +0100
Andi Kleen
Another point I meant to emphasise is that these errors are recent. In the preceding six months they had not occurred at all. This suggests some change in either harware or, more likely, software. My guess is that there is a bug in some recently produced update.
More likely hardware actually. Lost interrupt means that the hard disk didn't reply to a command in time. I would check the SMART statistics using smartctl and your cables.
In fact, both disks passed the 'smartctl -a' test when run from the command line. The documentation on SMART doesn't always help in the interpretation of output, though. For example, for the disk that does NOT lose interrupts smartd has returned:- /dev/hde, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 249 to 250 If I've understood the man page correctly, this refers to to bit 3: SMART status check returned "DISK FAILING" Does such a small performance change really preceed a failure? And how could a failing disk pass the test (above)? Or does it regard anything that happens as 'pre-failure', which in a sense is logical but useless. The evidence so far does not support the idea that it's failing hardware, but we'll see how it develops. - Richard. -- Richard Kimber http://www.psr.keele.ac.uk/
In fact, both disks passed the 'smartctl -a' test
That's not a test, that's querying the disk's status. Run smartctl -s on -o on -S on once per disk. This enables smart and the disk's offline selftests. Any bad sectors found here show up as offline_uncorrectable. There was an excellent discusion on this on the smartmontool-users list over the last days. Run the short and long tests, -t short and -t long, waiting for each to finish. Then run smartctl -a and check for errors.
/dev/hde, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 249 to 250
In the sum, these attributes are a measure for the disk's health. Each by themselves is meaningless unless it's right at the diabolical end. You'll probably find this one cycles between 249 and 250. smartd just logs each change. Bruce Allen is always looking for people to program more intelligence into smartd ;) Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me.
On Fri, 26 Nov 2004 09:54:32 +1300
Volker Kuhlmann
In fact, both disks passed the 'smartctl -a' test
That's not a test, that's querying the disk's status.
Thanks. I was confused by the output, which said:- "SMART overall-health self-assessment test result: PASSED" But presumably this is reported because of some previous test
Run smartctl -s on -o on -S on once per disk. This enables smart and the disk's offline selftests. Any bad sectors found here show up as offline_uncorrectable. There was an excellent discusion on this on the smartmontool-users list over the last days. Thanks. I'll join the list. Run the short and long tests, -t short and -t long, waiting for each to finish. Then run smartctl -a and check for errors.
Both tests completed without error. Perhaps it's not a disk error but a controller or driver error (??) Would that show up in these tests? Thanks for your help. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
Thanks. I was confused by the output, which said:- "SMART overall-health self-assessment test result: PASSED"
The sum total of what the disk thinks of itself. If it ever changes to FAILED you try save what you can before going for lunch. In my limited (3 cases) experience, disks are *$@@ed well before this goes FAILED.
Run the short and long tests, -t short and -t long, waiting for each to finish. Then run smartctl -a and check for errors.
Both tests completed without error.
The disk doesn't find anything much wrong with itself. If the reallocatedsectorcount is also 0, you should have nothing to worry.
Perhaps it's not a disk error but a controller or driver error
More likely now.
Would that show up in these tests?
No, unless it corrupts some transfer such that the smart command "tell me your status" becomes "do a low-level format". Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me.
On Fri, 26 Nov 2004 12:04:01 +1300
Volker Kuhlmann
Perhaps it's not a disk error but a controller or driver error
More likely now.
I see I have in the logs:- kernel: hda: dma_timer_expiry: dma status ==0x24 kernel: PDC202XX: Primary channel reset. kernel: hda: DMA interrupt recovery kernel: hda: lost interrupt This is what I got repeats of on the occasion it would not boot up. Is there any mileage in upgrading to 9.2, which has a different kernel? I tried the Live DVD recently and that worked fine, though of course I didn't check for any error messages. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
participants (5)
-
Andi Kleen
-
Andreas Wahlert
-
rkimber@ntlworld.com
-
Scott Leighton
-
Volker Kuhlmann