[Bug 238572] New: SCSI Subsystem Locks-up Machine
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 Summary: SCSI Subsystem Locks-up Machine Product: openSUSE 10.2 Version: Final Platform: x86 OS/Version: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: jim@edwardsj7.com QAContact: qa@suse.de My system has the root, "/", and "/home" directories on two different SCSI drives on the same chain of an Adaptec 29320 (U320) card and they are using the AIC79xx driver version 3.0. From time to time the system will cease to function and the SCSI activity indicator LED will stay on at 100% (usually it only flashes very briefly) and the machine becomes unresponsive. Maybe after a long time it will seem to return, but really the SCSI drives are off-line and even a restart fails back to what looks like an "init 3" screen with "login" displayed, but nothing is accepted. A system hard RESET, or power off and on, is required to restore functionality. The ensuing reboot displays a large number of transactions being replayed for the "/" RFS, 800+ is not unusual, but nothing for the ext3 (when it was RFS it had some transactions). It seems that this lock-up is triggered by things that try to move a lot of data over the SCSI bus, like "mkisofs" or some other large file transfers. Moving "/home" to an IDE drive did not help. "/" is on an RFS version 3.6 partition and "/home" is on an ext3 partition (it was RFS, and changed to ext3 with no effect). The same system hardware (AMD Athlon XP 2500+ etc.) has been running Suse 9.0 successfully with no such problems, and still can, so it seems to me to be likely to be related to the newer drivers (version 9.0's are much older, version 1.3.something, I think), possibly in association with RFS ??? For info ... jim@linux:~> cat /proc/scsi/aic79xx/4 Adaptec AIC79xx driver version: 3.0 Adaptec 29320 Ultra320 SCSI adapter aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI 33 or 66Mhz, 512 SCBs Allocated SCBs: 68, SG List Length: 128 Serial EEPROM: 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x17c8 0x09f4 0x01c7 0x2807 0x0010 0xffff 0xffff 0xffff 0xffff 0xffff 0xffff 0xffff 0xffff 0xffff 0xffff 0x0410 0xb458 Target 0 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Goal: 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit) Curr: 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit) Channel A Target 0 Lun 0 Settings Commands Queued 47423 Commands Active 0 Command Openings 32 Max Tagged Openings 32 Device Queue Frozen Count 0 Target 1 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Goal: 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit) Curr: 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit) Channel A Target 1 Lun 0 Settings Commands Queued 105578 Commands Active 0 Command Openings 32 Max Tagged Openings 32 Device Queue Frozen Count 0 Target 2 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 3 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 4 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 5 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 6 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Goal: 80.000MB/s transfers (40.000MHz, 16bit) Curr: 80.000MB/s transfers (40.000MHz, 16bit) Channel A Target 6 Lun 0 Settings Commands Queued 32843 Commands Active 0 Command Openings 1 Max Tagged Openings 0 Device Queue Frozen Count 0 Target 7 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 8 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 9 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 10 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 11 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 12 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 13 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 14 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Target 15 Negotiation Settings User: 320.000MB/s transfers (160.000MHz RDSTRM|DT|IU|QAS, 16bit) Targets "0" and "1" are the two SCSI hard drives. Thanks for any help Jim -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 lmb@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |hare@novell.com |maintainers@forge.provo.nove| |ll.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |jim@edwardsj7.com ------- Comment #1 from hare@novell.com 2007-01-25 08:01 MST ------- Do you have any logfiles? Does /var/log/messages show anything? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #2 from jim@edwardsj7.com 2007-01-25 16:12 MST ------- OK, there are tons of errors sequential from the SCSI subsystem, but they are associated with a DVD drive that is on the other channel of the controller card and only occur once. I am sure they are not associated with the lock-ups :-( The two SCSI drives are denoted sdb and sdc and there are no error messages about sdc (where "/", RFS, is located) or sbd (where "/home", ext3 is). SMART monitoring is enabled on the drives and it says they are healthy. Scanning the log file shows some cases where there appears to be an abnormal start, based upon syslog-ng starting, without an associated "going down" on the line before. Here are a couple of the sets of lines that are like this :- Jan 18 21:34:20 linux smartd[4125]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 54 to 55 Jan 18 21:34:20 linux smartd[4125]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 54 to 55 Jan 18 21:47:50 linux syslog-ng[2887]: syslog-ng version 1.6.11 starting Jan 19 16:30:47 linux kernel: klogd 1.4.1, ---------- state change ---------- Jan 19 16:38:14 linux smartd[4122]: Device: /dev/hda, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 54 to 53 Jan 19 16:38:14 linux smartd[4122]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 32 to 36 Jan 19 16:38:14 linux smartd[4122]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 54 to 53 Jan 19 16:53:28 linux syslog-ng[2929]: syslog-ng version 1.6.11 starting As you can see the log entry prior to the restart was not immediately before. Usually I did not allow the system to hang for a long time once I realized what was going on ! I think that as the SCSI drives went off-line there was no record of what was happening :-( When I get a chance I have to look and see if I have a suitably sized, unused partition on either an IDE drive, or the other SCSI drive, to move "/" to as a test. Jim -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #3 from hare@novell.com 2007-02-06 01:21 MST ------- Sorry, but the above doesn't really help. SMART is for IDE drives only, so any error there won't explain the SCSI lockup ... I really have to have some logfiles which show the SCSI error. Otherwise there's not much I can do here, sorry. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #4 from jim@edwardsj7.com 2007-02-06 16:06 MST ------- I would love to send you a log file, but once the SCSI system locks-up all writing to anything stops, so the logs show nothing. The issue is very hard to understand, it happened twice yesterday almost immediately after booting and then not for the rest of the evening. My feeling is that heavy traffic on the SCSI bus is what provokes it, but it is only a feeling I have no hard evidence to back it up. For example I can almost guarantee to cause a lock-up by trying to make an "iso" file from my /home/USER directory (~2.4GB), which takes data from a SCSI drive and then sends the iso image back to the same drive ... Lots of traffic ! During a re-boot the zen update software runs and it accesses the drives a lot, maybe if I am adding traffic to that by starting applications (browsers in yesterday's case) it is overloaded. Maybe the timing in the driver is too aggressive ? As I said Suse 9.0 has no such problem with the same hardware, but its AIC79xx driver is much older. In fact I see that Adaptec has only up to a version 2 driver for 2.6 kernels, while 10.2 loads a version 3 ??? Could I downgrade the driver to the Adaptec one ? Anyway, if there is nothing that can be done that's it. I will wait in the hope that there will be an update to the driver that will fix the issue by coincidence ;-) Jim -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|jim@edwardsj7.com | ------- Comment #5 from hare@novell.com 2007-02-07 01:19 MST ------- There actually is a chance that I have a patch which might help. If you have a machine with a reasonable amount of memory (say 4G or somesuch) the adaptec driver might become confused as to how much memory is actually availbable and might switch to the wrong DMA mask. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #6 from hare@novell.com 2007-02-07 01:20 MST ------- Created an attachment (id=117772) --> (https://bugzilla.novell.com/attachment.cgi?id=117772&action=view) aic79xx-use-dma-required-mask Patch to use correct dma_get_required_mask() macros when calculating the DMA mask. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
From the addressing in the diff file I presume I can apply it as a patch from
https://bugzilla.novell.com/show_bug.cgi?id=238572
------- Comment #7 from jim@edwardsj7.com 2007-02-07 19:25 MST -------
Thanks :-) I do not have 4GB of RAM in my current machine, only 1.5GB, but the
system drive will be transferred to a new system with 4GB in the future, so
it's worth a go :-)
the /usr/src/linux directory (it's been a while since I played with diff files,
back on HP-UX 9 I think !).
Maybe as root ...
cd /usr/src/linux/drivers/scsi/aic7xxx
patch -b
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #8 from jim@edwardsj7.com 2007-02-17 20:52 MST ------- The patch was applied and I thought it had worked, the system stayed up for over 4 days with no lock-ups, but in the end it did lock-up and it has done so twice more since :-( Jim -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #9 from mkeys@catoosa.k12.ga.us 2007-02-23 14:31 MST ------- Jim, take a look at my report. It sounds very similar... https://bugzilla.novell.com/show_bug.cgi?id=248448 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #10 from jim@edwardsj7.com 2007-02-23 15:42 MST ------- It is just a feeling, with no statistical data to back it up, but I think that since the DMA update the incidence of lock-ups is lower. At one point I was getting one or more a day some days, now I am often going several days without any misbehaviour (although last night it crashed and trashed a VPN configuration file, the first data loss I can attribute thanks to ext3 and RFS). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #11 from hare@novell.com 2007-02-26 00:48 MST ------- Hmm. given from the evidence of the other referenced bug it might not be SCSI related, but rather something memory / DMA related. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #12 from jim@edwardsj7.com 2007-03-12 18:54 MST ------- I ran the latest kernel upgrade (to 2.6.18.8-0.1) and the lock-up situation got much worse, with 3+ in 24 hours, compared to maybe 0 or 1 in the same period before. I checked and the new kernel had replaced the SCSI drivers with the same ones as before patching, so I have reapplied the patch and we'll see what happens. I think it will get better again, but still not perfect. Jim -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572 ------- Comment #13 from jim@edwardsj7.com 2007-03-28 20:51 MST ------- The re-patched SCSI files continue to work most of the time. I found a sure-fire way to get a lock-up, that was to install the latest version of Nero (CD/DVD Writer) and try running it. It gets as far as the window saying it is "Scannig for Drives" and freezes the machine with the SCSI active LED on, requiring a reboot to recover ! With Suse 9.0 Nero would take my SCSI converted IDE DVD-RAM drive (LG GSA-4163 with Acard 7722) off line until a reboot, but the rest of the machine would continue to work. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c14
Hannes Reinecke
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c15
Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c16
--- Comment #16 from Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c17
Hannes Reinecke
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c18
--- Comment #18 from Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c19
Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c20
--- Comment #20 from Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572
Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c21
--- Comment #21 from Arthur Edwards
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572#c22
Stephan Kulow
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572
User hare@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=238572#c23
Hannes Reinecke
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
https://bugzilla.novell.com/show_bug.cgi?id=238572
User jim@edwardsj7.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=238572#c24
--- Comment #24 from Arthur Edwards
participants (1)
-
bugzilla_noreply@novell.com