[opensuse-kernel] 2.6.34.8-15
Hi, I am deploying 2.6.34.8-15 on many machines successfully. With about 10 HP Workstations I regularily get the following message in the kernel log: [ 1751.292702] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen [ 1751.292706] ata1.00: irq_stat 0x00400040, connection status changed [ 1751.292710] ata1: SError: { PHYRdyChg DevExch } [ 1751.292712] ata1.00: failed command: FLUSH CACHE EXT [ 1751.292719] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 1751.292720] res 40/00:04:83:02:ee/00:00:02:00:00/40 Emask 0x10 (ATA bus error) [ 1751.292723] ata1.00: status: { DRDY } [ 1751.292728] ata1: hard resetting link [ 1756.044876] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1756.047118] ata1.00: configured for UDMA/100 [ 1756.047122] ata1.00: retrying FLUSH 0xea Emask 0x10 [ 1756.047250] ata1: EH complete I also experience severe data loss whenever such an error occurs. Typically I cannot issue "make config" in the kernel sources anymore on an affected system due to binary garbage in C-Sources. Any hints? Shall I open a bugzilla entry? BTW: I already pulled all BIOS options including an upgrade to the most recent and different energy saving options for SATA. The hardware of the affected systems looks like: 00:00.0 Host bridge: Intel Corporation 5520/5500/X58 I/O Hub to ESI Port (rev 13) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 13) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 13) 00:10.0 PIC: Intel Corporation 5520/5500/X58 Physical and Link Layer Registers Port 0 (rev 13) 00:10.1 PIC: Intel Corporation 5520/5500/X58 Routing and Protocol Layer Registers Port 0 (rev 13) 00:11.0 PIC: Intel Corporation 5520/5500 Physical and Link Layer Registers Port 1 (rev 13) 00:11.1 PIC: Intel Corporation 5520/5500 Routing & Protocol Layer Register Port 1 (rev 13) 00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 13) 00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 13) 00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 13) 00:15.0 PIC: Intel Corporation 5520/5500/X58 Trusted Execution Technology Registers (rev 13) 00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 00:1a.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5 00:1a.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6 00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller 00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1 00:1c.5 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 6 00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller 00:1f.2 RAID bus controller: Intel Corporation 82801 SATA RAID Controller 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5764M Gigabit Ethernet PCIe (rev 10) 0f:00.0 VGA compatible controller: nVidia Corporation G94 [Quadro FX 1800] (rev a1) 37:05.0 FireWire (IEEE 1394): Agere Systems FW322/323 (rev 70) 3f:00.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture Generic Non-Core Registers (rev 05) 3f:00.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QuickPath Architecture System Address Decoder (rev 05) 3f:02.0 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Link 0 (rev 05) 3f:02.1 Host bridge: Intel Corporation Xeon 5500/Core i7 QPI Physical 0 (rev 05) 3f:03.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller (rev 05) 3f:03.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Target Address Decoder (rev 05) 3f:03.4 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Test Registers (rev 05) 3f:04.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 0 Control Registers (rev 05) 3f:04.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 0 Address Registers (rev 05) 3f:04.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 0 Rank Registers (rev 05) 3f:04.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 0 Thermal Control Registers (rev 05) 3f:05.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 1 Control Registers (rev 05) 3f:05.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 1 Address Registers (rev 05) 3f:05.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 1 Rank Registers (rev 05) 3f:05.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 1 Thermal Control Registers (rev 05) 3f:06.0 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 2 Control Registers (rev 05) 3f:06.1 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 2 Address Registers (rev 05) 3f:06.2 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 2 Rank Registers (rev 05) 3f:06.3 Host bridge: Intel Corporation Xeon 5500/Core i7 Integrated Memory Controller Channel 2 Thermal Control Registers (rev 05) Regards, Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wed, Mar 09, 2011 at 05:27:52PM +0100, EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
Hi,
I am deploying 2.6.34.8-15 on many machines successfully.
Is this the kernel from openSUSE 11.3?
With about 10 HP Workstations I regularily get the following message in the kernel log:
[ 1751.292702] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen [ 1751.292706] ata1.00: irq_stat 0x00400040, connection status changed [ 1751.292710] ata1: SError: { PHYRdyChg DevExch } [ 1751.292712] ata1.00: failed command: FLUSH CACHE EXT [ 1751.292719] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 1751.292720] res 40/00:04:83:02:ee/00:00:02:00:00/40 Emask 0x10 (ATA bus error) [ 1751.292723] ata1.00: status: { DRDY } [ 1751.292728] ata1: hard resetting link [ 1756.044876] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1756.047118] ata1.00: configured for UDMA/100 [ 1756.047122] ata1.00: retrying FLUSH 0xea Emask 0x10 [ 1756.047250] ata1: EH complete
I also experience severe data loss whenever such an error occurs.
Perhaps you have a failing drive?
Typically I cannot issue "make config" in the kernel sources anymore on an affected system due to binary garbage in C-Sources.
What do you mean by this? thanks, greg k-h -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Hi Greg,
From: Greg KH [mailto:gregkh@suse.de]
I am deploying 2.6.34.8-15 on many machines successfully.
Is this the kernel from openSUSE 11.3?
Yes, it is from http://download.opensuse.org/repositories/Kernel:/openSUSE-11.3/openSUSE_11....
With about 10 HP Workstations I regularily get the
I also experience severe data loss whenever such an error occurs.
Perhaps you have a failing drive?
Yes, this was also my first idea but I can observe the problem (about once a day) distributed over nearly all 10 machines. This fact makes me not believe in a failing drive. On the other hand all these machines ran OpenSUSE 10.2 before without such a problem.
Typically I cannot issue "make config" in the kernel sources anymore on an affected system due to binary garbage in C-Sources.
What do you mean by this?
Whenever the above problem is visible in the kernel log I notice the following strange thing: rt-z9856:/usr/src/linux # file init/* init/calibrate.c: ASCII C program text init/do_mounts.c: ASCII C program text init/do_mounts.h: ASCII C program text init/do_mounts_initrd.c: ASCII C program text init/do_mounts_md.c: ASCII C program text init/do_mounts_rd.c: ASCII C program text init/initramfs.c: ASCII C program text init/Kconfig: ASCII English text init/main.c: ASCII C program text init/Makefile: ASCII English text init/noinitramfs.c: ASCII C program text init/version.c: ASCII C program text /proc/config.gz is fine "make cloneconfig" works if the above error is not in the logs but it does not work otherwise. In the later case files like init/Kconfig contain binary garbage. I must admit that I have no clue why those files get corrupted and how this is related to the observed kernel log entries. Best regards Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Hi Greg,
From: Greg KH [mailto:gregkh@suse.de]
I also experience severe data loss whenever such an error occurs.
Perhaps you have a failing drive?
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a different familiy instead of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing machines and doing endless kiwi builds. Sofar it is working since about 4 hours with an otherwise unmodified setup. Can the observed error be explained with a firmware bug? All failing machines have drives from the same family either WDC WD2500AAJS-60M0A0 or WDC WD3200AAJS-60Z0A0. (Yes I validated that a failing drive had the most recent available firmware 03.03E03) Best regards Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Thu, Mar 10, 2011 at 1:38 PM, EXTERNAL Konold Martin (Firma, RtP2/TEF72) <external.Martin.Konold@de.bosch.com> wrote:
Hi Greg,
From: Greg KH [mailto:gregkh@suse.de]
I also experience severe data loss whenever such an error occurs.
Perhaps you have a failing drive?
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a different familiy instead of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing machines and doing endless kiwi builds.
Sofar it is working since about 4 hours with an otherwise unmodified setup.
Can the observed error be explained with a firmware bug?
All failing machines have drives from the same family either WDC WD2500AAJS-60M0A0 or WDC WD3200AAJS-60Z0A0.
(Yes I validated that a failing drive had the most recent available firmware 03.03E03)
Best regards
Martin Konold
I haven't followed this thread, but there have been reported situations where a firmware bug caused smartd to actually trigger data loss. I don't recall if that was a rotating or SSD drive. That is when smartd would query the drive for diagnostic info, the drive would somehow lose data. If that might be your situation, you could attempt to disable smartd for a period of time and see if your issue goes away. Thanks Greg -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Thu, Mar 10, 2011 at 07:38:49PM +0100, EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
Hi Greg,
From: Greg KH [mailto:gregkh@suse.de]
I also experience severe data loss whenever such an error occurs.
Perhaps you have a failing drive?
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a different familiy instead of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing machines and doing endless kiwi builds.
Sofar it is working since about 4 hours with an otherwise unmodified setup.
Can the observed error be explained with a firmware bug?
Possibly.
All failing machines have drives from the same family either WDC WD2500AAJS-60M0A0 or WDC WD3200AAJS-60Z0A0.
(Yes I validated that a failing drive had the most recent available firmware 03.03E03)
Sounds like a bad drive family :( As you have fixed it by changing the drive, I suggest not using those other models anymore. thanks, greg k-h -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On 3/10/2011 2:05 PM, Greg KH wrote:
On Thu, Mar 10, 2011 at 07:38:49PM +0100, EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
Hi Greg,
From: Greg KH [mailto:gregkh@suse.de]
I also experience severe data loss whenever such an error occurs.
Perhaps you have a failing drive?
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a different familiy instead of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing machines and doing endless kiwi builds.
Sofar it is working since about 4 hours with an otherwise unmodified setup.
Can the observed error be explained with a firmware bug?
Possibly.
All failing machines have drives from the same family either WDC WD2500AAJS-60M0A0 or WDC WD3200AAJS-60Z0A0.
(Yes I validated that a failing drive had the most recent available firmware 03.03E03)
Sounds like a bad drive family :(
As you have fixed it by changing the drive, I suggest not using those other models anymore.
He had no problems at all on 10.2 . -- bkw -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Brian K. White wrote:
On 3/10/2011 Greg KH wrote:
On Thu, Mar 10, 2011 EXTERNAL Konold Martin wrote:
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a different familiy instead of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing machines and doing endless kiwi builds.
Is this the WD TLER problem/ripoff? The originals are 'desktop' while the replacement is 'enterprise'.
Sofar it is working since about 4 hours with an otherwise unmodified setup. [snip] Sounds like a bad drive family :(
I just don't buy WD drives anymore :)
As you have fixed it by changing the drive, I suggest not using those other models anymore.
He had no problems at all on 10.2 .
Martin, at this point I really would suggest taking it to the linux-ide list. Did you also try a new kernel? Cheers, Dave -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Hi Dave,
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a
of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing
Is this the WD TLER problem/ripoff? The originals are 'desktop' while the replacement is 'enterprise'.
I have no clue about a desktop/enterprise difference. Where is such a list maintained? The faulty drives are directly shipped from HP with the z400 and xw4600 workstations.
He had no problems at all on 10.2 .
I verified that the problem is not triggered with 11.0 either.
Martin, at this point I really would suggest taking it to the linux-ide list.
You mean it is not appropriate for opensuse-kernel anymore?
Did you also try a new kernel?
Yes, I tried with OpenSUSE 11.4 DVD install just this morning. During the very first installation of OpenSUSE 11.4 on a problematic machine I got the following crash: [ 74.278186] EDD information not available. [ 74.855799] REISERFS (device sda2): found reiserfs format "3.6" with standard journal [ 74.855809] REISERFS (device sda2): using ordered data mode [ 74.855811] reiserfs: using flush barriers [ 74.861775] REISERFS (device sda2): journal params: device sda2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 [ 74.861916] REISERFS (device sda2): checking transaction log (sda2) [ 74.908725] REISERFS (device sda2): Using r5 hash to sort names [ 75.038521] REISERFS (device sda3): found reiserfs format "3.6" with standard journal [ 75.038534] REISERFS (device sda3): using ordered data mode [ 75.038536] reiserfs: using flush barriers [ 75.045230] REISERFS (device sda3): journal params: device sda3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 [ 75.045387] REISERFS (device sda3): checking transaction log (sda3) [ 75.379076] REISERFS (device sda3): Using r5 hash to sort names [ 75.422596] BUG: unable to handle kernel NULL pointer dereference at 00000004 [ 75.422607] IP: [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.422617] *pde = 00000000 [ 75.422623] Oops: 0002 [#1] SMP [ 75.422628] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/dev [ 75.422635] Modules linked in: reiserfs dm_mod multipath raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 parport_pc parport nls_utf8 usb_storage arc4 ecb acpi_cpufreq mperf fan thermal nfs nfs_acl lockd fscache auth_rpcgss sunrpc nls_iso8859_1 nls_cp437 af_packet st usbhid hid sr_mod sg cdrom uhci_hcd ahci libahci rtc_cmos rtc_core libata tg3 button ehci_hcd usbcore processor rtc_lib thermal_sys hwmon squashfs loop [last unloaded: parport] [ 75.422675] [ 75.422681] Pid: 4221, comm: umount Not tainted 2.6.37.1-1.2-default #1 Hewlett-Packard HP Z400 Workstation/0B4Ch [ 75.422689] EIP: 0060:[<c031fe99>] EFLAGS: 00010246 CPU: 6 [ 75.422695] EIP is at shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.422701] EAX: c031e350 EBX: f67f9c6c ECX: f67f9c98 EDX: 00000000 [ 75.422706] ESI: 00000000 EDI: f67f9ca8 EBP: f0ce2000 ESP: f0ce3edc [ 75.422712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 75.422717] Process umount (pid: 4221, ti=f0ce2000 task=f0c9e530 task.ti=f0ce2000) [ 75.422723] Stack: [ 75.422727] f67f9c6c 00000000 f8a6ea40 f0ce2000 c031e822 f67f9c6c f67f9c74 f6c85e00 [ 75.422735] f67f9c6c f8a6ea40 f0ce2000 f8a6cbd9 f6c85e00 f6c85e70 f8a57944 f0ce3f40 [ 75.422743] f5d3b400 c03216d6 f5d3b3f8 c032196e f6c85e58 f6c85e70 f0ce3f40 c0322920 [ 75.422751] Call Trace: [ 75.422766] [<f8a6cbd9>] reiserfs_xattr_shutdown+0x59/0x80 [reiserfs] [ 75.422810] [<f8a57944>] reiserfs_put_super+0x14/0x130 [reiserfs] [ 75.422830] [<c030f1d7>] generic_shutdown_super+0x57/0xd0 [ 75.422837] [<c030f272>] kill_block_super+0x22/0x40 [ 75.422844] [<c030f6c5>] deactivate_locked_super+0x35/0x50 [ 75.422852] [<c03262c1>] sys_umount+0x61/0xc0 [ 75.422859] [<c0326337>] sys_oldumount+0x17/0x20 [ 75.422867] [<c05fc265>] syscall_call+0x7/0xb [ 75.422875] [<b77e926d>] 0xb77e926d [ 75.422880] Code: e8 0d e9 ff ff 85 f6 74 77 8b 5e 3c 8d 46 3c 39 c3 75 e1 89 f3 8b 03 85 c0 75 73 8b 73 1c 39 de 74 68 f0 ff 0e 8b 53 34 8b 43 38 <89> 42 04 89 10 8b 53 10 c7 43 34 00 01 10 00 c7 43 38 00 02 20 [ 75.422908] EIP: [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 SS:ESP 0068:f0ce3edc [ 75.422916] CR2: 0000000000000004 [ 75.422921] ---[ end trace cb1d9af19b0474a1 ]--- [ 75.422927] ------------[ cut here ]------------ [ 75.422933] WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.37.1/linux-2.6.37/kernel/exit.c:910 do_exit+0x2a4/0x360() [ 75.422940] Hardware name: HP Z400 Workstation [ 75.422944] Modules linked in: reiserfs dm_mod multipath raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 parport_pc parport nls_utf8 usb_storage arc4 ecb acpi_cpufreq mperf fan thermal nfs nfs_acl lockd fscache auth_rpcgss sunrpc nls_iso8859_1 nls_cp437 af_packet st usbhid hid sr_mod sg cdrom uhci_hcd ahci libahci rtc_cmos rtc_core libata tg3 button ehci_hcd usbcore processor rtc_lib thermal_sys hwmon squashfs loop [last unloaded: parport] [ 75.422983] Pid: 4221, comm: umount Tainted: G D 2.6.37.1-1.2-default #1 [ 75.422988] Call Trace: [ 75.422996] [<c02060e3>] try_stack_unwind+0x173/0x190 [ 75.423004] [<c0204e8f>] dump_trace+0x3f/0xe0 [ 75.423011] [<c020614b>] show_trace_log_lvl+0x4b/0x60 [ 75.423017] [<c0206178>] show_trace+0x18/0x20 [ 75.423024] [<c05f9945>] dump_stack+0x6d/0x72 [ 75.423032] [<c0243998>] warn_slowpath_common+0x78/0xb0 [ 75.423040] [<c02439eb>] warn_slowpath_null+0x1b/0x20 [ 75.423046] [<c0247794>] do_exit+0x2a4/0x360 [ 75.423053] [<c05fd3b7>] oops_end+0x87/0xc0 [ 75.423061] [<c0226072>] no_context+0xc2/0x150 [ 75.423069] [<c02262eb>] bad_area+0x3b/0x50 [ 75.423076] [<c05ff50b>] do_page_fault+0x3db/0x430 [ 75.423084] [<c05fc8fa>] error_code+0x5a/0x60 [ 75.423092] [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.423103] [<f8a6cbd9>] reiserfs_xattr_shutdown+0x59/0x80 [reiserfs] [ 75.423142] [<f8a57944>] reiserfs_put_super+0x14/0x130 [reiserfs] [ 75.423161] [<c030f1d7>] generic_shutdown_super+0x57/0xd0 [ 75.423168] [<c030f272>] kill_block_super+0x22/0x40 [ 75.423175] [<c030f6c5>] deactivate_locked_super+0x35/0x50 [ 75.423182] [<c03262c1>] sys_umount+0x61/0xc0 [ 75.423188] [<c0326337>] sys_oldumount+0x17/0x20 [ 75.423195] [<c05fc265>] syscall_call+0x7/0xb [ 75.423203] [<b77e926d>] 0xb77e926d [ 75.423208] ---[ end trace cb1d9af19b0474a2 ]--- [ 75.537152] REISERFS (device sda4): found reiserfs format "3.6" with standard journal [ 75.537181] REISERFS (device sda4): using ordered data mode [ 75.537191] reiserfs: using flush barriers I then clean the hdd using dd and then tried a second install. This time it went well and I am currently compiling the kernel in an endless loop (since 45 min). Best regards Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
Hi Dave,
I hooked up another WD drive WDC WD5002ABYS-02B1B0 from a
of the failing drive WDC WD3200AAJS-60Z0A0 to one of the failing
Is this the WD TLER problem/ripoff? The originals are 'desktop' while the replacement is 'enterprise'.
I have no clue about a desktop/enterprise difference. Where is such a list maintained? The faulty drives are directly shipped from HP with the z400 and xw4600 workstations.
I don't want to say too much, because my opinions as to WD's behaviour are unprintable! 'TLER' is WD's name for a mandatory ATA-8 feature called 'Error Recovery Control' that they do not implement in some drives. So I suggest you google for more details using the words I quoted.
He had no problems at all on 10.2 .
I verified that the problem is not triggered with 11.0 either.
Martin, at this point I really would suggest taking it to the linux-ide list.
You mean it is not appropriate for opensuse-kernel anymore?
No, I'm sorry I didn't state it well; I meant no criticism of this list. It's just that I think there are several people on the linux-ide list who are very familiar with the libata subsystem development and familiar with many different drives who may recognize your exact symptoms and so provide a definitive solution quickly and easily.
Did you also try a new kernel?
Yes, I tried with OpenSUSE 11.4 DVD install just this morning.
During the very first installation of OpenSUSE 11.4 on a problematic machine I got the following crash: [snip] [ 75.422596] BUG: unable to handle kernel NULL pointer dereference at 00000004 [ 75.422607] IP: [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.422617] *pde = 00000000 [ 75.422623] Oops: 0002 [#1] SMP [snip]
I then clean the hdd using dd and then tried a second install. This time it went well and I am currently compiling the kernel in an endless loop (since 45 min).
OK. There was no ata error in the log you posted, just the kernel oops. So perhaps the drive is working correctly with the new kernel whilst a previous failure had corrupted the data, causing the oops. That would be good news :) I believe the current 11.4 kernel is also the current stable one, which I think is 2.6.37.3. Though your subject line says 2.6.37.1-1.2; I guess that is what is on the DVD. Did you run an initial upgrade? So the implication would seem to be that your problem has probably been fixed by some patch between 2.6.34.8-15 and 2.6.37.1-1.2. Which means somebody on the linux-ide list probably can suggest exactly which patch solved the problem and so you can be confident of the fix. Cheers, Dave -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/11/2011 07:55 AM, EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
[ 74.278186] EDD information not available. [ 74.855799] REISERFS (device sda2): found reiserfs format "3.6" with standard journal [ 74.855809] REISERFS (device sda2): using ordered data mode [ 74.855811] reiserfs: using flush barriers [ 74.861775] REISERFS (device sda2): journal params: device sda2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 [ 74.861916] REISERFS (device sda2): checking transaction log (sda2) [ 74.908725] REISERFS (device sda2): Using r5 hash to sort names [ 75.038521] REISERFS (device sda3): found reiserfs format "3.6" with standard journal [ 75.038534] REISERFS (device sda3): using ordered data mode [ 75.038536] reiserfs: using flush barriers [ 75.045230] REISERFS (device sda3): journal params: device sda3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 [ 75.045387] REISERFS (device sda3): checking transaction log (sda3) [ 75.379076] REISERFS (device sda3): Using r5 hash to sort names [ 75.422596] BUG: unable to handle kernel NULL pointer dereference at 00000004 [ 75.422607] IP: [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.422617] *pde = 00000000 [ 75.422623] Oops: 0002 [#1] SMP [ 75.422628] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda/dev [ 75.422635] Modules linked in: reiserfs dm_mod multipath raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 parport_pc parport nls_utf8 usb_storage arc4 ecb acpi_cpufreq mperf fan thermal nfs nfs_acl lockd fscache auth_rpcgss sunrpc nls_iso8859_1 nls_cp437 af_packet st usbhid hid sr_mod sg cdrom uhci_hcd ahci libahci rtc_cmos rtc_core libata tg3 button ehci_hcd usbcore processor rtc_lib thermal_sys hwmon squashfs loop [last unloaded: parport] [ 75.422675] [ 75.422681] Pid: 4221, comm: umount Not tainted 2.6.37.1-1.2-default #1 Hewlett-Packard HP Z400 Workstation/0B4Ch [ 75.422689] EIP: 0060:[<c031fe99>] EFLAGS: 00010246 CPU: 6 [ 75.422695] EIP is at shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.422701] EAX: c031e350 EBX: f67f9c6c ECX: f67f9c98 EDX: 00000000 [ 75.422706] ESI: 00000000 EDI: f67f9ca8 EBP: f0ce2000 ESP: f0ce3edc [ 75.422712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 75.422717] Process umount (pid: 4221, ti=f0ce2000 task=f0c9e530 task.ti=f0ce2000) [ 75.422723] Stack: [ 75.422727] f67f9c6c 00000000 f8a6ea40 f0ce2000 c031e822 f67f9c6c f67f9c74 f6c85e00 [ 75.422735] f67f9c6c f8a6ea40 f0ce2000 f8a6cbd9 f6c85e00 f6c85e70 f8a57944 f0ce3f40 [ 75.422743] f5d3b400 c03216d6 f5d3b3f8 c032196e f6c85e58 f6c85e70 f0ce3f40 c0322920 [ 75.422751] Call Trace: [ 75.422766] [<f8a6cbd9>] reiserfs_xattr_shutdown+0x59/0x80 [reiserfs] [ 75.422810] [<f8a57944>] reiserfs_put_super+0x14/0x130 [reiserfs] [ 75.422830] [<c030f1d7>] generic_shutdown_super+0x57/0xd0 [ 75.422837] [<c030f272>] kill_block_super+0x22/0x40 [ 75.422844] [<c030f6c5>] deactivate_locked_super+0x35/0x50 [ 75.422852] [<c03262c1>] sys_umount+0x61/0xc0 [ 75.422859] [<c0326337>] sys_oldumount+0x17/0x20 [ 75.422867] [<c05fc265>] syscall_call+0x7/0xb [ 75.422875] [<b77e926d>] 0xb77e926d [ 75.422880] Code: e8 0d e9 ff ff 85 f6 74 77 8b 5e 3c 8d 46 3c 39 c3 75 e1 89 f3 8b 03 85 c0 75 73 8b 73 1c 39 de 74 68 f0 ff 0e 8b 53 34 8b 43 38 <89> 42 04 89 10 8b 53 10 c7 43 34 00 01 10 00 c7 43 38 00 02 20 [ 75.422908] EIP: [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 SS:ESP 0068:f0ce3edc [ 75.422916] CR2: 0000000000000004 [ 75.422921] ---[ end trace cb1d9af19b0474a1 ]--- [ 75.422927] ------------[ cut here ]------------ [ 75.422933] WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.37.1/linux-2.6.37/kernel/exit.c:910 do_exit+0x2a4/0x360() [ 75.422940] Hardware name: HP Z400 Workstation [ 75.422944] Modules linked in: reiserfs dm_mod multipath raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 parport_pc parport nls_utf8 usb_storage arc4 ecb acpi_cpufreq mperf fan thermal nfs nfs_acl lockd fscache auth_rpcgss sunrpc nls_iso8859_1 nls_cp437 af_packet st usbhid hid sr_mod sg cdrom uhci_hcd ahci libahci rtc_cmos rtc_core libata tg3 button ehci_hcd usbcore processor rtc_lib thermal_sys hwmon squashfs loop [last unloaded: parport] [ 75.422983] Pid: 4221, comm: umount Tainted: G D 2.6.37.1-1.2-default #1 [ 75.422988] Call Trace: [ 75.422996] [<c02060e3>] try_stack_unwind+0x173/0x190 [ 75.423004] [<c0204e8f>] dump_trace+0x3f/0xe0 [ 75.423011] [<c020614b>] show_trace_log_lvl+0x4b/0x60 [ 75.423017] [<c0206178>] show_trace+0x18/0x20 [ 75.423024] [<c05f9945>] dump_stack+0x6d/0x72 [ 75.423032] [<c0243998>] warn_slowpath_common+0x78/0xb0 [ 75.423040] [<c02439eb>] warn_slowpath_null+0x1b/0x20 [ 75.423046] [<c0247794>] do_exit+0x2a4/0x360 [ 75.423053] [<c05fd3b7>] oops_end+0x87/0xc0 [ 75.423061] [<c0226072>] no_context+0xc2/0x150 [ 75.423069] [<c02262eb>] bad_area+0x3b/0x50 [ 75.423076] [<c05ff50b>] do_page_fault+0x3db/0x430 [ 75.423084] [<c05fc8fa>] error_code+0x5a/0x60 [ 75.423092] [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0 [ 75.423103] [<f8a6cbd9>] reiserfs_xattr_shutdown+0x59/0x80 [reiserfs] [ 75.423142] [<f8a57944>] reiserfs_put_super+0x14/0x130 [reiserfs] [ 75.423161] [<c030f1d7>] generic_shutdown_super+0x57/0xd0 [ 75.423168] [<c030f272>] kill_block_super+0x22/0x40 [ 75.423175] [<c030f6c5>] deactivate_locked_super+0x35/0x50 [ 75.423182] [<c03262c1>] sys_umount+0x61/0xc0 [ 75.423188] [<c0326337>] sys_oldumount+0x17/0x20 [ 75.423195] [<c05fc265>] syscall_call+0x7/0xb [ 75.423203] [<b77e926d>] 0xb77e926d [ 75.423208] ---[ end trace cb1d9af19b0474a2 ]--- [ 75.537152] REISERFS (device sda4): found reiserfs format "3.6" with standard journal [ 75.537181] REISERFS (device sda4): using ordered data mode [ 75.537191] reiserfs: using flush barriers
Well. That's not supposed to happen. Can you file a report? - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/ iEYEARECAAYFAk16UiEACgkQLPWxlyuTD7I3FACeJj2diSzDYDSnE7j5BvQlYq64 NWUAoI86ECEMnGUsX/maHnca4XpyyLVz =XZRW -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Hi Jeff,
[ 75.422596] BUG: unable to handle kernel NULL pointer dereference at 00000004 [ 75.422607] IP: [<c031fe99>] shrink_dcache_for_umount_subtree+0xf9/0x1b0
2.6.37.1-1.2-default #1 Hewlett-Packard HP Z400 Workstation/0B4Ch
[ 75.422689] EIP: 0060:[<c031fe99>] EFLAGS: 00010246 CPU: 6 [ 75.422695] EIP is at shrink_dcache_for_umount_subtree+0xf9/0x1b0
Well. That's not supposed to happen. Can you file a report?
I filed https://bugzilla.novell.com/show_bug.cgi?id=680073 in the meantime. Best regards Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
I am deploying 2.6.34.8-15 on many machines successfully.
With about 10 HP Workstations I regularily get the following message in the kernel log:
[ 1751.292702] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen [ 1751.292706] ata1.00: irq_stat 0x00400040, connection status changed [ 1751.292710] ata1: SError: { PHYRdyChg DevExch } [ 1751.292712] ata1.00: failed command: FLUSH CACHE EXT [ 1751.292719] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 1751.292720] res 40/00:04:83:02:ee/00:00:02:00:00/40 Emask 0x10 (ATA bus error) [ 1751.292723] ata1.00: status: { DRDY } [ 1751.292728] ata1: hard resetting link [ 1756.044876] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1756.047118] ata1.00: configured for UDMA/100 [ 1756.047122] ata1.00: retrying FLUSH 0xea Emask 0x10 [ 1756.047250] ata1: EH complete
I also experience severe data loss whenever such an error occurs.
I can suggest two things to do: (1) Extract the SMART log from the drive just after the problem occurs to see what it thinks has happened. (2) Try a recent kernel on an affected machine (e.g. either install opensuse Kernel:STABLE or just run Knoppix) and see if it cures the problem. Then you would have both a workaround and a strategy to isolate the problem. Also, how does the hardware of the affected machines differ from similar machines where the problem does not occur? Disk type or number of disks? Motherboard? etc. Cheers, Dave -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Hi Dave,
From: Dave Howorth [mailto:dhoworth@mrc-lmb.cam.ac.uk]
[ 1751.292702] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen [ 1751.292706] ata1.00: irq_stat 0x00400040, connection status changed [ 1751.292710] ata1: SError: { PHYRdyChg DevExch } [ 1751.292712] ata1.00: failed command: FLUSH CACHE EXT [ 1751.292719] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 1751.292720] res 40/00:04:83:02:ee/00:00:02:00:00/40 Emask 0x10 (ATA bus error) [ 1751.292723] ata1.00: status: { DRDY } [ 1751.292728] ata1: hard resetting link [ 1756.044876] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 1756.047118] ata1.00: configured for UDMA/100 [ 1756.047122] ata1.00: retrying FLUSH 0xea Emask 0x10 [ 1756.047250] ata1: EH complete
(1) Extract the SMART log from the drive just after the problem occurs to see what it thinks has happened.
In the log if found: Mar 10 11:38:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 200 to 100 The output from smartctl after the problem appeared in the log. (I triggered the issue with a large kiwi build) rt-z9857:~ # smartctl -a -d ata /dev/sda smartctl 5.39.1 2010-01-28 r3054 [i686-pc-linux-gnu] (openSUSE RPM) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Blue Serial ATA family Device Model: WDC WD3200AAJS-60Z0A0 Serial Number: WD-WCAV2S489213 Firmware Version: 03.03E03 User Capacity: 320.072.933.376 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Mar 10 11:09:33 2011 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5760) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 70) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 162 133 021 Pre-fail Always - 2900 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 117 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 929 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 115 184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 063 048 040 Old_age Always - 37 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 36 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 117 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I am running some further smart tests now.
(2) Try a recent kernel on an affected machine (e.g. either install opensuse Kernel:STABLE or just run Knoppix) and see if it cures the problem. Then you would have both a workaround and a strategy to isolate the problem.
Also, how does the hardware of the affected machines differ from similar machines where the problem does not occur? Disk type or number of disks? Motherboard?
The number of disks is always one. The other machines (aboout 20) are very different e.g. Fujitsu-Siemens MB instead of HP. The problem definetly correlates with the Hardware setup. Best regards Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
(I triggered the issue with a large kiwi build)
Reproducible is good. :)
SMART Error Log Version: 1 No Errors Logged
I think that's very significant. The drive didn't even see the bus error.
I am running some further smart tests now.
Definitely worth doing, and I predict they'll come back clean.
The number of disks is always one.
The other machines (aboout 20) are very different e.g. Fujitsu-Siemens MB instead of HP. The problem definetly correlates with the Hardware setup.
It does sound to me like a hardware incompatibility. I would definitely try a recent kernel. I suspect there's a reasonable chance that the problem has already been fixed. And even if it isn't, you'll need to report the symptoms against a new kernel to get a fix! BTW, if you haven't searched the linux-ide list archives, that might be worth doing. Oh, the other question will be, are you sure there's enough power and that all power and data cables are good. Cheers, Dave -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
Hi, On one machine I got during a kiwi build (does a lot of IO) [ 6062.355535] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen [ 6062.355539] ata1.00: irq_stat 0x00400040, connection status changed [ 6062.355543] ata1: SError: { PHYRdyChg DevExch } [ 6062.355545] ata1.00: failed command: FLUSH CACHE EXT [ 6062.355552] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 6062.355553] res 40/00:4c:53:60:ed/00:00:02:00:00/40 Emask 0x10 (ATA bus error) [ 6062.355556] ata1.00: status: { DRDY } [ 6062.355561] ata1: hard resetting link [ 6066.852503] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 6066.854707] ata1.00: configured for UDMA/100 [ 6066.854711] ata1.00: retrying FLUSH 0xea Emask 0x10 [ 6066.854826] ata1: EH complete [ 6068.093583] ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xe frozen [ 6068.093586] ata1.00: irq_stat 0x00400040, connection status changed [ 6068.093588] ata1: SError: { PHYRdyChg DevExch } [ 6068.093590] ata1.00: failed command: FLUSH CACHE EXT [ 6068.093593] ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 6068.093594] res 40/00:5c:de:7e:07/00:00:00:00:00/40 Emask 0x10 (ATA bus error) [ 6068.093596] ata1.00: status: { DRDY } [ 6068.093599] ata1: hard resetting link [ 6072.538468] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 6072.540723] ata1.00: configured for UDMA/100 [ 6072.540727] ata1.00: retrying FLUSH 0xea Emask 0x10 [ 6072.540847] ata1: EH complete [ 7700.291971] BUG: unable to handle kernel NULL pointer dereference at 0000000f [ 7700.291976] IP: [<f8590c88>] prepare_error_buf+0x418/0x510 [reiserfs] [ 7700.291988] *pdpt = 000000001c5d6001 *pde = 0000000000000000 [ 7700.291992] Oops: 0000 [#1] PREEMPT SMP [ 7700.291995] last sysfs file: /sys/devices/pci0000:3f/0000:3f:06.3/modalias [ 7700.291998] Modules linked in: memainUSB nvidia(P) dm_mod sg iTCO_wdt iTCO_vendor_support button reiserfs loop af_packet sr_mod tg3 [last unloaded: cdrom] [ 7700.292011] [ 7700.292014] Pid: 6104, comm: rm Tainted: P 2.6.34.8-15.2-ccs #1 0B4Ch/HP Z400 Workstation [ 7700.292017] EIP: 0060:[<f8590c88>] EFLAGS: 00010286 CPU: 3 [ 7700.292027] EIP is at prepare_error_buf+0x418/0x510 [reiserfs] [ 7700.292028] EAX: 0000001e EBX: ffffffff ECX: 00000000 EDX: f85b0daf [ 7700.292030] ESI: dc593e40 EDI: dc593e3c EBP: f85b09be ESP: dc593d7c [ 7700.292031] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 7700.292033] Process rm (pid: 6104, ti=dc592000 task=dd96f1f0 task.ti=dc592000) [ 7700.292034] Stack: [ 7700.292034] 00000026 f8712000 f859f46b 00000082 c0827580 00000006 00000004 00000008 [ 7700.292037] <0> c028c3d7 00000082 00000002 f637c2e8 dc593eb0 f85b0dbe f6f95400 f85b09a0 [ 7700.292041] <0> f85975ab 00000001 00000282 00000000 c084fdc0 d2921558 c7783ce0 c0827500 [ 7700.292044] Call Trace: [ 7700.292059] [<f8590fb8>] __reiserfs_error+0x28/0xf0 [reiserfs] [ 7700.292070] [<f859914d>] reiserfs_do_truncate+0x4ed/0x600 [reiserfs] [ 7700.292086] [<f8599298>] reiserfs_delete_object+0x38/0x80 [reiserfs] [ 7700.292098] [<f8584a96>] reiserfs_delete_inode+0xb6/0xf0 [reiserfs] [ 7700.292105] [<c02fd308>] generic_delete_inode+0x68/0xf0 [ 7700.292108] [<c02fc834>] iput+0x44/0x50 [ 7700.292112] [<c02f457d>] do_unlinkat+0xdd/0x160 [ 7700.292115] [<c02030d0>] sysenter_do_call+0x12/0x26 [ 7700.292119] [<b78e8424>] 0xb78e8424 [ 7700.292120] Code: 08 89 4c 24 14 c7 44 24 04 5c 6e 5a f8 89 2c 24 e8 5e b0 e2 c7 8b 7c 24 40 e9 35 fd ff ff 8b 1f 8d 77 04 85 db 0f 84 d2 00 00 00 <0f> b6 43 10 bf 3a 98 5a f8 84 c0 74 1c 3c 03 74 6a 3c 02 bf 41 [ 7700.292137] EIP: [<f8590c88>] prepare_error_buf+0x418/0x510 [reiserfs] SS:ESP 0068:dc593d7c [ 7700.292143] CR2: 000000000000000f [ 7700.292145] ---[ end trace 4fbbb5b503c00782 ]--- [ 7700.292146] ------------[ cut here ]------------ [ 7700.292149] WARNING: at /kiwi/packages/BUILD/kernel-ccs-2.6.34.8/linux-2.6.34/kernel/exit.c:918 do_exit+0x2fd/0x350() [ 7700.292151] Hardware name: HP Z400 Workstation [ 7700.292151] Modules linked in: memainUSB nvidia(P) dm_mod sg iTCO_wdt iTCO_vendor_support button reiserfs loop af_packet sr_mod tg3 [last unloaded: cdrom ] [ 7700.292159] Pid: 6104, comm: rm Tainted: P D 2.6.34.8-15.2-ccs #1 [ 7700.292160] Call Trace: [ 7700.292164] [<c0206ba3>] try_stack_unwind+0x173/0x190 [ 7700.292167] [<c02057ef>] dump_trace+0x3f/0xe0 [ 7700.292169] [<c0206c0b>] show_trace_log_lvl+0x4b/0x60 [ 7700.292172] [<c0206c38>] show_trace+0x18/0x20 [ 7700.292176] [<c05af9a7>] dump_stack+0x6d/0x72 [ 7700.292179] [<c0243cfe>] warn_slowpath_common+0x6e/0xb0 [ 7700.292181] [<c0243d53>] warn_slowpath_null+0x13/0x20 [ 7700.292184] [<c02477ed>] do_exit+0x2fd/0x350 [ 7700.292187] [<c0206d76>] oops_end+0x86/0xc0 [ 7700.292191] [<c0223f8f>] bad_area_nosemaphore+0xf/0x20 [ 7700.292194] [<c0224477>] do_page_fault+0x2d7/0x360 [ 7700.292197] [<c05b2cda>] error_code+0x66/0x6c [ 7700.292203] [<f8590c88>] prepare_error_buf+0x418/0x510 [reiserfs] [ 7700.292213] [<f8590fb8>] __reiserfs_error+0x28/0xf0 [reiserfs] [ 7700.292224] [<f859914d>] reiserfs_do_truncate+0x4ed/0x600 [reiserfs] [ 7700.292239] [<f8599298>] reiserfs_delete_object+0x38/0x80 [reiserfs] [ 7700.292251] [<f8584a96>] reiserfs_delete_inode+0xb6/0xf0 [reiserfs] [ 7700.292256] [<c02fd308>] generic_delete_inode+0x68/0xf0 [ 7700.292258] [<c02fc834>] iput+0x44/0x50 [ 7700.292261] [<c02f457d>] do_unlinkat+0xdd/0x160 [ 7700.292263] [<c02030d0>] sysenter_do_call+0x12/0x26 [ 7700.292266] [<b78e8424>] 0xb78e8424 [ 7700.292267] ---[ end trace 4fbbb5b503c00783 ]--- [ 7700.292269] note: rm[6104] exited with preempt_count 1 rt-z9857:/var/log # dmesg | ksymoops ksymoops 2.4.11 on i686 2.6.34.8-15.2-ccs. Options used -V (default) -k /proc/kallsyms (default) -l /proc/modules (default) -o /lib/modules/2.6.34.8-15.2-ccs/ (default) -m /boot/System.map-2.6.34.8-15.2-ccs (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I'll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Warning (read_ksyms): no kernel symbols in ksyms, is /proc/kallsyms a valid ksyms file? No modules in ksyms, skipping objects No ksyms, skipping lsmod Error (regular_file): read_system_map stat /boot/System.map-2.6.34.8-15.2-ccs failed ksymoops: No such file or directory Warning (merge_maps): no symbols in merged map [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1]) [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1]) [ 0.691966] ehci_hcd 0000:00:1a.7: debug port 1 [ 0.706972] ehci_hcd 0000:00:1d.7: debug port 1 [ 7700.291971] BUG: unable to handle kernel NULL pointer dereference at 0000000f 3 warnings and 1 error issued. Results may not be reliable.
-----Original Message----- From: Dave Howorth [mailto:dhoworth@mrc-lmb.cam.ac.uk]
SMART Error Log Version: 1 No Errors Logged
I think that's very significant. The drive didn't even see the bus error.
I used smartctrl do run some tests (offline, short Mar 10 10:08:26 rt-z9857 smartd[4759]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD3200AAJS_60Z0A0-WD_WCAV2S489213.ata.state Mar 10 10:08:26 rt-z9857 smartd[4761]: smartd has fork()ed into background mode. New PID=4761. Mar 10 10:38:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 74 to 64 Mar 10 11:08:27 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 153 to 162 Mar 10 11:08:27 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63 Mar 10 11:38:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 200 to 100 Mar 10 12:08:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 162 to 169 Mar 10 13:08:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 64 Mar 10 13:08:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Usage Attribute: 198 Offline_Uncorrectable changed from 100 to 200 Mar 10 13:08:26 rt-z9857 smartd[4761]: Device: /dev/sda [SAT], SMART Usage Attribute: 200 Multi_Zone_Error_Rate changed from 100 to 200
I am running some further smart tests now.
Definitely worth doing, and I predict they'll come back clean.
Not really, see above.
It does sound to me like a hardware incompatibility. I would definitely try a recent kernel.
These are OpenSUSE 11.3 machines. What do you mean with recent kernel? Please give me a hint which kernel I shall try. I suspect there's a reasonable chance that the
problem has already been fixed.
Oh, the other question will be, are you sure there's enough power and that all power and data cables are good.
Yes, definetly. This is industrial quality cabling. Power is fine. Best regards Martin Konold Robert Bosch GmbH Automotive Electronics (RtP2/TEF72) Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com external.martin.konold@de.bosch.com Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Bernd Bohr, Rudolf Colm, Volkmar Denner, Wolfgang Malchow, Peter Marks, Peter Tyroller; Stefan Asenkerschbaumer, Uwe Raschke, Wolf-Henning Scheider -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
EXTERNAL Konold Martin (Firma, RtP2/TEF72) wrote:
On one machine I got during a kiwi build (does a lot of IO) [snip] [ 7700.291971] BUG: unable to handle kernel NULL pointer dereference at 0000000f [ 7700.291976] IP: [<f8590c88>] prepare_error_buf+0x418/0x510 [reiserfs] [ 7700.291988] *pdpt = 000000001c5d6001 *pde = 0000000000000000 [ 7700.291992] Oops: 0000 [#1] PREEMPT SMP
That looks nasty.
These are OpenSUSE 11.3 machines. What do you mean with recent kernel? Please give me a hint which kernel I shall try.
I'm not an expert in these areas. But I recently had to do something similar because of a problem I had. So hopefully somebody more knowledgeable will jump in if I say something wrong. What I was suggesting is to use the appropriate version of the kernel that you will find at <https://build.opensuse.org/project/show?project=Kernel%3AHEAD> There's also the kernel at <http://download.opensuse.org/repositories/Kernel:/stable/openSUSE_11.3/> which is not quite as recent but is a lot newer than the stock 11.3 one and which may run if the one from HEAD doesn't work. Just add the repo and install. I'd suggest also setting the multiversion option for yast, so that the original kernel is still installed. In my case, doing that let me know that it was a software problem that had been fixed, and I was able to dig down and find the particular patch so I could work out what upgrade would work (the standard kernel in 11.4 cures my problem, for example) Cheers, Dave -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
participants (6)
-
Brian K. White
-
Dave Howorth
-
EXTERNAL Konold Martin (Firma, RtP2/TEF72)
-
Greg Freemyer
-
Greg KH
-
Jeff Mahoney