[Bug 775746] New: mdadm degraded array on boot, random device partition missing
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c0 Summary: mdadm degraded array on boot, random device partition missing Classification: openSUSE Product: openSUSE 12.1 Version: Final Platform: x86-64 OS/Version: openSUSE 12.1 Status: NEW Severity: Critical Priority: P5 - None Component: Other AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: j.langley@gmx.net QAContact: qa-bugs@suse.de Found By: --- Blocker: --- User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20100101 Firefox/14.0.1 I have two soft raid devices configured with yast. Regulary, but not always i got a degraded array after boot. One random partition is missing in one array, in most cases it is sdb4 or sda4 - appears to be alternately. Sometimes in both arrays one partition is missing. I searched the web and found similar problems but not exactly the same. I also booted with raid=noautodetect since some Ubuntu forums suggested that. But no success. /tmp/initrd/lib/udev/rules.d/64-md-raid.rules is in place. I also tried to put it in /etc/udev/rules.d/64-md-raid.rules too, but no difference. The random partition is kicked even though it is clean: I suspect a bug in mdadm used with udev and systemd. linux-dioz:/ # mdadm --examine /dev/sdb4 /dev/sdb4: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 909d58d0:3a4ee94e:8897cfe8:d7aefeea Name : linux-99ig:1 Creation Time : Fri Oct 21 21:36:33 2011 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 878198512 (418.76 GiB 449.64 GB) Array Size : 878198512 (418.76 GiB 449.64 GB) Super Offset : 878198768 sectors State : clean Device UUID : 46360291:b7806b6c:ce1f892e:fa56fc78 Internal Bitmap : -8 sectors from superblock Update Time : Mon Aug 13 22:24:16 2012 Checksum : 5c2804a9 - correct Events : 93354 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) linux-dioz:~ # mdadm -V mdadm - v3.2.2 - 17th June 2011 linux-dioz:~ # uname -r 3.1.10-1.16-desktop linux-dioz:~ # cat /proc/cmdline root=/dev/md0 noresume splash=silent quiet vga=795 raid=noautodetect linux-dioz:~ # cat /etc/fstab: /dev/md0 / xfs defaults 1 1 /dev/md1 /home xfs defaults 1 2 linux-dioz:~ # cat /etc/mdadm.conf DEVICE containers partitions ARRAY /dev/md/0 UUID=7c60e2b2:804071ee:1ef2019b:e3fce998 ARRAY /dev/md/1 UUID=909d58d0:3a4ee94e:8897cfe8:d7aefeea linux-dioz:~ # cat /tmp/initrd/etc/mdadm.conf AUTO -all ARRAY /dev/md0 metadata=1.0 name=linux:0 UUID=7c60e2b2:804071ee:1ef2019b:e3fce998 linux-dioz:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.0 Creation Time : Fri Oct 21 19:56:37 2011 Raid Level : raid1 Array Size : 41945016 (40.00 GiB 42.95 GB) Used Dev Size : 41945016 (40.00 GiB 42.95 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Aug 13 22:04:19 2012 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : linux:0 UUID : 7c60e2b2:804071ee:1ef2019b:e3fce998 Events : 5124 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 18 1 active sync /dev/sdb2 linux-dioz:~ # mdadm --detail /dev/md1 /dev/md1: Version : 1.0 Creation Time : Fri Oct 21 21:36:33 2011 Raid Level : raid1 Array Size : 439099256 (418.76 GiB 449.64 GB) Used Dev Size : 439099256 (418.76 GiB 449.64 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Aug 13 22:05:20 2012 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : linux-99ig:1 UUID : 909d58d0:3a4ee94e:8897cfe8:d7aefeea Events : 93354 Number Major Minor RaidDevice State 0 8 4 0 active sync /dev/sda4 1 8 20 1 active sync /dev/sdb4 linux-dioz:~ # mount /dev/md0 on / type xfs (rw,relatime,attr2,delaylog,noquota) /dev/md1 on /home type xfs (rw,relatime,attr2,delaylog,noquota) linux-dioz:~ # dmesg | grep md [ 0.000000] Command line: root=/dev/md0 noresume splash=silent quiet vga=795 raid=noautodetect [ 0.000000] Kernel command line: root=/dev/md0 noresume splash=silent quiet vga=795 raid=noautodetect [ 1.354683] ata1: SATA max UDMA/133 cmd 0x9f0 ctl 0xbf0 bmdma 0xe000 irq 21 [ 1.354687] ata2: SATA max UDMA/133 cmd 0x970 ctl 0xb70 bmdma 0xe008 irq 21 [ 1.355625] ata3: SATA max UDMA/133 cmd 0x9e0 ctl 0xbe0 bmdma 0xcc00 irq 20 [ 1.355629] ata4: SATA max UDMA/133 cmd 0x960 ctl 0xb60 bmdma 0xcc08 irq 20 [ 3.024577] md: bind<sda2> [ 3.075631] md: bind<sdb2> [ 3.078527] md: raid1 personality registered for level 1 [ 3.078799] md/raid1:md0: active with 2 out of 2 mirrors [ 3.078988] created bitmap (1 pages) for device md0 [ 3.079187] md0: bitmap initialized from disk: read 1/1 pages, set 0 of 641 bits [ 3.149541] md0: detected capacity change from 0 to 42951696384 [ 3.151403] md0: unknown partition table [ 3.392302] md: raid0 personality registered for level 0 [ 3.396557] md: raid10 personality registered for level 10 [ 3.524158] md: raid6 personality registered for level 6 [ 3.524162] md: raid5 personality registered for level 5 [ 3.524164] md: raid4 personality registered for level 4 [ 3.720134] XFS (md0): Mounting Filesystem [ 3.861353] XFS (md0): Ending clean mount [ 4.612088] systemd[1]: systemd 37 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +SYSVINIT +LIBCRYPTSETUP; suse) [ 5.005999] systemd[1]: Set hostname to <linux-dioz.site>. [ 9.145112] EDAC amd64: DRAM ECC disabled. [ 9.145123] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. [ 10.461748] md: md1 stopped. [ 10.645936] md: bind<sdb4> [ 10.675967] md: bind<sda4> [ 10.676014] md: kicking non-fresh sdb4 from array! [ 10.676020] md: unbind<sdb4> [ 10.679034] md: export_rdev(sdb4) [ 10.833257] md/raid1:md1: active with 1 out of 2 mirrors [ 10.854573] created bitmap (4 pages) for device md1 [ 10.854830] md1: bitmap initialized from disk: read 1/1 pages, set 113 of 6701 bits [ 10.900278] md1: detected capacity change from 0 to 449637638144 [ 10.900540] boot.md[462]: Starting MD RAID mdadm: /dev/md/1 has been started with 1 drive (out of 2). [ 11.002196] md1: unknown partition table [ 11.399485] boot.md[462]: ..done [ 11.500712] systemd-fsck[934]: /sbin/fsck.xfs: XFS file system. [ 11.548664] XFS (md1): Mounting Filesystem [ 12.068339] XFS (md1): Ending clean mount [ 5458.784139] ata2.00: cmd 61/01:00:e8:2f:20/00:00:05:00:00/40 tag 0 ncq 512 out [ 9758.290839] md: bind<sdb4> [ 9758.339918] md: recovery of RAID array md1 [ 9758.339922] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 9758.339927] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. [ 9758.339934] md: using 128k window, over a total of 439099256k. [ 9897.447202] md: md1: recovery done. linux-dioz:~ # dmesg | grep sdb [ 2.302027] sd 1:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/465 GiB) [ 2.302230] sd 1:0:0:0: [sdb] Write Protect is off [ 2.302234] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 [ 2.302284] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 2.328030] sdb: sdb1 sdb2 sdb3 sdb4 [ 2.328465] sd 1:0:0:0: [sdb] Attached SCSI disk [ 3.075631] md: bind<sdb2> [ 10.645936] md: bind<sdb4> [ 10.676014] md: kicking non-fresh sdb4 from array! [ 10.676020] md: unbind<sdb4> [ 10.679034] md: export_rdev(sdb4) [ 11.177828] Adding 1051644k swap on /dev/sdb1. Priority:0 extents:1 across:1051644k [ 5458.784082] dhfis 0x1 dmafis 0x1 sdbfis 0x0 [ 5458.784093] ata2: tag : dhfis dmafis sdbfis sactive [ 9758.290839] md: bind<sdb4> [ 9758.339510] disk 1, wo:1, o:1, dev:sdb4 [ 9897.575473] disk 1, wo:0, o:1, dev:sdb4 Reproducible: Sometimes Steps to Reproduce: 1. create array and verify it is running well, move some data around 2. reboot 3. check arrays again with mdadm --detail /dev/md1 Actual Results: [ 10.645936] md: bind<sdb4> [ 10.676014] md: kicking non-fresh sdb4 from array! Expected Results: [ 3.075631] md: bind<sdb2> [ 10.645936] md: bind<sdb4> I had data losses several times now, since the missing partition is alternating! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c1 --- Comment #1 from John Langley <j.langley@gmx.net> 2012-08-13 22:35:49 UTC --- mdadm /dev/md1 --add /dev/sdb4 starts the array again and it works perfectly for hours until the next reboot. after reboot today i ended up in a shell: [ 11.263189] input: Venus USB2.0 Camera as /devices/pci0000:00/0000:00:0b.1/usb1/1-6/1-6:1.0/input/input13 [ 11.263360] usbcore: registered new interface driver uvcvideo [ 11.263363] USB Video Class driver (1.1.1) [ 11.343139] Adding 1051644k swap on /dev/sda1. Priority:0 extents:1 across:1051644k [ 11.354278] md: bind<sda4> [ 11.418756] Adding 1051644k swap on /dev/sdb1. Priority:0 extents:1 across:1051644k [ 11.515482] nvidia 0000:03:00.0: PCI INT A -> Link[APC5] -> GSI 16 (level, low) -> IRQ 16 [ 11.515492] nvidia 0000:03:00.0: setting latency timer to 64 [ 11.515498] vgaarb: device changed decodes: PCI:0000:03:00.0,olddecodes=io+mem,decodes=none:owns=io+mem [ 11.515806] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 295.59 Wed Jun 6 21:19:40 PDT 2012 [ 11.550355] md: could not open unknown-block(8,20). [ 11.550455] md: md_import_device returned -16 [ 11.550766] md: could not open unknown-block(8,20). [ 11.550857] md: md_import_device returned -16 [ 11.601393] boot.md[448]: Starting MD RAID mdadm: /dev/md/1 is already in use. [ 11.601940] boot.md[448]: ..failed [ 11.602543] systemd[1]: md.service: control process exited, code=exited status=1 [ 11.608358] systemd[1]: Unit md.service entered failed state. [ 97.101581] systemd[1]: Job dev-md1.device/start timed out. [ 97.101819] systemd[1]: Job remote-fs-pre.target/start failed with result 'dependency'. [ 97.101826] systemd[1]: Job local-fs.target/start failed with result 'dependency'. [ 97.101831] systemd[1]: Triggering OnFailure= dependencies of local-fs.target. [ 97.102766] systemd[1]: Job home.mount/start failed with result 'dependency'. [ 97.102774] systemd[1]: Job dev-md1.device/start failed with result 'timeout'. [ 97.390477] systemd[1]: Startup finished in 5s 50ms 326us (kernel) + 1min 32s 340ms 36us (userspace) = 1min 37s 390ms 362us. [ 351.569324] md/raid1:md1: active with 1 out of 2 mirrors [ 351.569509] created bitmap (4 pages) for device md1 [ 351.569729] md1: bitmap initialized from disk: read 1/1 pages, set 7 of 6701 bits [ 351.605682] md1: detected capacity change from 0 to 449637638144 [ 369.175955] md1: unknown partition table But the individual disks of the array actually were fine: /dev/sda4: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 909d58d0:3a4ee94e:8897cfe8:d7aefeea Name : linux-99ig:1 Creation Time : Fri Oct 21 21:36:33 2011 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 878198512 (418.76 GiB 449.64 GB) Array Size : 878198512 (418.76 GiB 449.64 GB) Super Offset : 878198768 sectors State : clean Device UUID : 82cdb9ef:19aaa29b:1d7b1d7c:ea687817 Internal Bitmap : -8 sectors from superblock Update Time : Mon Aug 13 22:46:08 2012 Checksum : d62737a5 - correct Events : 93354 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing) /dev/sdb4: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 909d58d0:3a4ee94e:8897cfe8:d7aefeea Name : linux-99ig:1 Creation Time : Fri Oct 21 21:36:33 2011 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 878198512 (418.76 GiB 449.64 GB) Array Size : 878198512 (418.76 GiB 449.64 GB) Super Offset : 878198768 sectors State : clean Device UUID : 46360291:b7806b6c:ce1f892e:fa56fc78 Internal Bitmap : -8 sectors from superblock Update Time : Mon Aug 13 22:46:08 2012 Checksum : 5c2809c9 - correct Events : 93354 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) "mdadm --run /dev/md1" started the array "mdadm /dev/md1 --add /dev/sdb4" fixed the array and it works well again That the array is degraded happens only on reboot. Mostly sdb4 is missing but i also had sda4 (md1) or sda2 or sdb2 (md0) Always just one partition! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c kk zhang <kkzhang@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kkzhang@suse.com AssignedTo|bnc-team-screening@forge.pr |nfbrown@suse.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c2 --- Comment #2 from John Langley <j.langley@gmx.net> 2012-08-19 23:32:07 UTC --- Since my previous comment everthing works well. No problems on boot. But today my dmesg log showed this: [43299.808066] ata2: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1 [43299.808076] ata2: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0 [43299.808086] ata2: ATA_REG 0x40 ERR_REG 0x0 [43299.808090] ata2: tag : dhfis dmafis sdbfis sactive [43299.808096] ata2: tag 0x0: 1 1 0 1 [43299.808114] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen [43299.808123] ata2.00: failed command: WRITE FPDMA QUEUED [43299.808137] ata2.00: cmd 61/01:00:e8:2f:20/00:00:05:00:00/40 tag 0 ncq 512 out [43299.808146] ata2.00: status: { DRDY } [43299.808156] ata2: hard resetting link [43299.808160] ata2: nv: skipping hardreset on occupied port [43300.262069] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [43300.265222] ata2.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80) [43300.265232] ata2.00: revalidation failed (errno=-5) [43305.262059] ata2: hard resetting link [43305.262068] ata2: nv: skipping hardreset on occupied port [43305.716090] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [43305.723383] ata2.00: configured for UDMA/133 [43305.723395] ata2.00: device reported invalid CHS sector 0 [43305.723409] ata2: EH complete [50448.736069] ata1: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1 [50448.736079] ata1: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0 [50448.736090] ata1: ATA_REG 0x40 ERR_REG 0x0 [50448.736094] ata1: tag : dhfis dmafis sdbfis sactive [50448.736099] ata1: tag 0x0: 1 1 0 1 [50448.736118] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen [50448.736127] ata1.00: failed command: WRITE FPDMA QUEUED [50448.736141] ata1.00: cmd 61/01:00:e8:2f:20/00:00:05:00:00/40 tag 0 ncq 512 out [50448.736150] ata1.00: status: { DRDY } [50448.736160] ata1: hard resetting link [50448.736164] ata1: nv: skipping hardreset on occupied port [50449.190060] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [50449.193226] ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80) [50449.193237] ata1.00: revalidation failed (errno=-5) [50454.190038] ata1: hard resetting link [50454.190043] ata1: nv: skipping hardreset on occupied port [50454.644053] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [50454.650338] ata1.00: configured for UDMA/133 [50454.650352] ata1: EH complete -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c3 --- Comment #3 from John Langley <j.langley@gmx.net> 2012-08-19 23:38:58 UTC --- Created an attachment (id=502803) --> (http://bugzilla.novell.com/attachment.cgi?id=502803) dmesg output This is the full dmesg output -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c4 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #4 from Neil Brown <nfbrown@suse.com> 2012-08-20 07:21:57 UTC --- I think the fix to your problem is the patch below. If the line that is added is already in your udev rules files, or if adding it (to /lib/udev/rules.d/64-md-raid.rules and running mkinitrd) doesn't fix the problem, please re-open. diff --git a/udev-md-raid.rules b/udev-md-raid.rules index f564f70..814c897 100644 --- a/udev-md-raid.rules +++ b/udev-md-raid.rules @@ -28,7 +28,7 @@ ENV{DEVTYPE}=="partition", GOTO="md_ignore_state" # never leave state 'inactive' ATTR{md/metadata_version}=="external:[A-Za-z]*", ATTR{md/array_state}=="inactive", GOTO="md_ignore_state" TEST!="md/array_state", GOTO="md_end" -ATTR{md/array_state}=="|clear|inactive", GOTO="md_end" +ATTR{md/array_state}=="|clear|inactive", ENV{SYSTEMD_READY}="0", GOTO="md_end" LABEL="md_ignore_state" IMPORT{program}="/sbin/mdadm --detail --export $tempnode" This ensures systemd waits for the array to be properly assembled. This patch is in the mdadm in Factory. The errors in comment 2 look like some sort of hardware problem, maybe loose or bad cable or something. They are not directly related to RAID. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c5 John Langley <j.langley@gmx.net> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED | --- Comment #5 from John Langley <j.langley@gmx.net> 2012-09-04 18:44:13 UTC --- Suggested patch is already in udev files: /lib/udev/rules.d/64-md-raid.rules /tmp/initrd/lib/udev/rules.d/64-md-raid.rules (after executing zcat /boot/initrd-3.1.10-1.16-desktop | cpio -idv ) Since my last message the problem suddenly disappeared until few days ago. Same problem as described before. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c6 --- Comment #6 from John Langley <j.langley@gmx.net> 2012-09-08 12:54:51 UTC --- Created an attachment (id=504952) --> (http://bugzilla.novell.com/attachment.cgi?id=504952) dmesg output, hard drive not recognized In very few cases one hard drive is missing completely. I got the boot message "ata2.00: revalidation failed (errno=-19)" Please refer to attached file. The system also states that it is doing a fast boot. I don't know if this could be related? By the way, put the same hard drives in a different computer with new SATA cables and the problem persists. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c7 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nfbrown@suse.com Component|Other |Kernel AssignedTo|nfbrown@suse.com |kernel-maintainers@forge.pr | |ovo.novell.com --- Comment #7 from Neil Brown <nfbrown@suse.com> 2012-09-12 02:49:17 UTC --- This is really looking like a problem with the Nvidia SATA driver rather than with md. This: [ 11.550355] md: could not open unknown-block(8,20). [ 11.550455] md: md_import_device returned -16 [ 11.550766] md: could not open unknown-block(8,20). [ 11.550857] md: md_import_device returned -16 should have alerted me to that. It suggests that 'mdadm' found /dev/sdb4, but by the time that the 'md' kernel driver tried to open it, it has disappeared. Other error message seem to confirm that. The drives seem to work fine once boot has completed, but maybe when it is all booting and trying to identify all the devices, some race or something is causing some confusion. Unfortunately I cannot help with the nv SATA driver. So I'm reassigning to "default assignee". It should probably be reassigned to someone who knows about SATA drivers or libata or something like that. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c8 --- Comment #8 from John Langley <j.langley@gmx.net> 2012-09-13 10:50:00 UTC --- I don't think that Nvidia driver causes the problems since the other machine didn't use sata_nv. But somewhere else i read about a race condition of sata with usb driver when searching for "ata2.00: revalidation failed (errno=-19)" See also http://johnbokma.com/mexit/2008/08/05/fixing-the-vostro-hang-issue.html They suggested to force load of sata driver before usb. I will try that. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c Jeff Mahoney <jeffm@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jeffm@suse.com AssignedTo|kernel-maintainers@forge.pr |jlee@suse.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c9 --- Comment #9 from Joey Lee <jlee@suse.com> 2012-09-26 04:25:10 UTC --- Per return error from md_import_device: [ 11.550355] md: could not open unknown-block(8,20). [ 11.550455] md: md_import_device returned -16 The -16 is EBUSY that means 'Device or resource busy', looks it's a race condition. There have something grab block device before md driver. Need find a way(maybe log?) to catch which component grab it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c10 Joey Lee <jlee@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |NEEDINFO InfoProvider| |nfbrown@suse.com --- Comment #10 from Joey Lee <jlee@suse.com> 2012-09-26 07:13:48 UTC --- Hi Neil, Found there have a patchset '[PATCH 0/3] Fix mdadm vs udev race in Incremental and Assemble' from Jes Sorensen: http://www.digipedia.pl/usenet/thread/19071/35509/ [PATCH 1/3] Remove race for starting container devices. 5fc8cff3a4177dfbab594947283117620b4b8c9c [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 382afe49b10cf3e5a4764cee74649d1cd8c91813 [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev eafa60fd6ec35ac7c0a01a17c3018af4c90046ef Those patches already merged in mdadm git, but do not included in mdadm v3.2.2 in openSUSE 12.1. Do you think that's worth to try? Then I will backport those patches for testing. Thanks a lot! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c11 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- InfoProvider|nfbrown@suse.com |j.langley@gmx.net --- Comment #11 from Neil Brown <nfbrown@suse.com> 2012-09-26 07:32:35 UTC --- If it is a race with mdadm it can probably be fixed by adding 'udev-trigger' to the end of the line that start '# Should-Start:' in /etc/init.d/boot.md. That fixes similar problems. However the fact that it says "unknown-block(8,20)", and that we see errors like: [43299.808156] ata2: hard resetting link looks like there is some recurrent hardware issue. I thought it might be a driver issue, but maybe not. John; can you try making that change to boot.md and see if it improves the situation? It ensures that all udev triggers have fired before boot.md tries to assemble anything. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c12 --- Comment #12 from Joey Lee <jlee@suse.com> 2012-09-26 08:30:55 UTC --- (In reply to comment #11)
If it is a race with mdadm it can probably be fixed by adding 'udev-trigger' to the end of the line that start '# Should-Start:' in /etc/init.d/boot.md.
I just checked and found the 'udev-trigger' is not in openSUSE 12.2, need use '/sbin/udevadm trigger' -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c13 --- Comment #13 from Joey Lee <jlee@suse.com> 2012-09-26 08:43:51 UTC --- (In reply to comment #12)
(In reply to comment #11)
If it is a race with mdadm it can probably be fixed by adding 'udev-trigger' to the end of the line that start '# Should-Start:' in /etc/init.d/boot.md.
I just checked and found the 'udev-trigger' is not in openSUSE 12.2, need use '/sbin/udevadm trigger'
And, openSUSE 12.1 also -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c14 --- Comment #14 from Neil Brown <nfbrown@suse.com> 2012-09-26 11:07:23 UTC --- You looked in the wrong place. It is in /lib/systemd/system/udev-trigger.service. You only need that extra bit in boot.md when systemd is being used. That is why it is "Should-Start", "Required-Start". -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c15 --- Comment #15 from John Langley <j.langley@gmx.net> 2012-10-06 16:45:15 UTC --- I put that into /etc/init.d/boot.md : # Should-Start: boot.scsidev boot.multipath udev-trigger But the problems persist. Should it read udev-trigger or udev-trigger.service ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c16 --- Comment #16 from Neil Brown <nfbrown@suse.com> 2012-10-07 22:41:21 UTC --- "udev-trigger" is correct. I still think it is some hardware-related issue. Do you still get could not open unknown-block... that indicates something odd. Can you attach complete boot message (dmesg) immediately after boot? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c17 --- Comment #17 from John Langley <j.langley@gmx.net> 2012-11-09 15:09:52 UTC --- Created an attachment (id=512689) --> (http://bugzilla.novell.com/attachment.cgi?id=512689) dmesg log 16.10.2012 Attached is another dmesg file, i will add 2 more later. Regarding hardware error i swapped all parts (different PC, different ATA cables) except hard disks. The errors persist. But i ordered 2 new hard disks which will arrive this week. I will inform you if the errors persist. In fact one hard disk (sda) is reporting a smartd error since some days (refer to dmesg output). Is it possible that the errors are related to the weird way i set up my raid arrays? Instead of having 1 partition per hard disk configured as raid device and the raid device partitioned for file systems - as done with hardware raids, i did it the other way round: Each hard disk with 4 patitions and only (sda2,sdb2) or (sda4,sdb4) respectively is assembled to an raid array: (sda2,sdb2)=>md0, (sda4,sdb4)=>md1 Device Boot Start End Blocks Id System /dev/sda1 2048 2105343 1051648 82 Linux swap / Solaris /dev/sda2 * 2105344 85995519 41945088 fd Linux raid autodetect /dev/sda3 85995520 98574335 6289408 a0 IBM Thinkpad hibernation /dev/sda4 98574336 976773119 439099392 fd Linux raid autodetect -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c18 --- Comment #18 from John Langley <j.langley@gmx.net> 2012-11-09 15:11:17 UTC --- Created an attachment (id=512690) --> (http://bugzilla.novell.com/attachment.cgi?id=512690) dmesg output 24.10.2012 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c19 --- Comment #19 from John Langley <j.langley@gmx.net> 2012-11-09 15:12:01 UTC --- Created an attachment (id=512691) --> (http://bugzilla.novell.com/attachment.cgi?id=512691) dmesg output 8.11.2012 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c20 Herbert Meier <herbert@women-at-work.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |herbert@women-at-work.org --- Comment #20 from Herbert Meier <herbert@women-at-work.org> 2012-12-07 21:35:07 UTC --- I collected the following dmesg output yesterday with a fully updated OpenSuse 12.2 release (Kernel 3.4.11-2.16-desktop). sdb1/5 are root/swap partitions and while sdb6 triggers the unknown-block error, sdb3 can be added to the second raid device. There are also no ata related errors. md0 : active raid5 sda7[0] sdb6[2] sdc2[1] md126 : active raid1 sda3[0] sdb3[1] [ 2.209089] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null) ... [ 12.443459] md: bind<sda7> [ 12.446701] md: bind<sda3> [ 12.485187] md: bind<sdc2> [ 13.723996] Adding 1959892k swap on /dev/sdb5. Priority:-1 extents:1 across:1959892k [ 15.845492] scsi_verify_blk_ioctl: 502 callbacks suppressed [ 15.845496] mdadm: sending ioctl 1261 to a partition! [ 15.845498] mdadm: sending ioctl 1261 to a partition! [ 16.133358] mdadm: sending ioctl 800c0910 to a partition! [ 16.133362] mdadm: sending ioctl 800c0910 to a partition! [ 16.133365] mdadm: sending ioctl 1261 to a partition! [ 16.133367] mdadm: sending ioctl 1261 to a partition! [ 16.133521] mdadm: sending ioctl 1261 to a partition! [ 16.133524] mdadm: sending ioctl 1261 to a partition! [ 16.133674] mdadm: sending ioctl 1261 to a partition! [ 16.133676] mdadm: sending ioctl 1261 to a partition! [ 16.600163] md: md127 stopped. [ 16.600755] md: bind<sdb6> [ 16.602197] md: could not open unknown-block(8,22). [ 16.602347] md: md_import_device returned -16 [ 16.602641] md: could not open unknown-block(8,22). [ 16.602787] md: md_import_device returned -16 [ 16.611152] md: bind<sdb3> [ 16.666080] md: raid1 personality registered for level 1 [ 16.666211] bio: create slab <bio-1> at 1 [ 16.666253] md/raid1:md126: active with 2 out of 2 mirrors [ 16.666266] md126: detected capacity change from 0 to 15011020800 [ 16.667302] md126: unknown partition table [ 19.470740] EXT4-fs (sdc1): mounted filesystem with ordered data mode. Opts: (null) [ 19.547976] EXT4-fs (sdc4): mounted filesystem with ordered data mode. Opts: (null) [ 19.664781] kjournald starting. Commit interval 5 seconds [ 19.665127] EXT3-fs (sda5): using internal journal [ 19.665129] EXT3-fs (sda5): mounted filesystem with ordered data mode [ 19.839153] EXT4-fs (sda6): mounted filesystem with ordered data mode. Opts: (null) [ 19.981646] EXT4-fs (sda8): mounted filesystem with ordered data mode. Opts: (null) [ 20.070497] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null) [ 20.172760] EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null) [ 21.308920] EXT4-fs (sdb7): mounted filesystem with ordered data mode. Opts: (null) [ 21.505698] EXT4-fs (sdb2): mounted filesystem with ordered data mode. Opts: (null) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775746 https://bugzilla.novell.com/show_bug.cgi?id=775746#c21 Jeffrey Cheung <jcheung@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |CLOSED CC| |jcheung@suse.com InfoProvider|j.langley@gmx.net | Resolution| |WONTFIX --- Comment #21 from Jeffrey Cheung <jcheung@suse.com> 2014-02-07 03:53:20 UTC --- With the release of the gnumeric on January 27th, 2014 the SUSE sponsored maintenance of openSUSE 12.2 has ended. openSUSE 12.2 is now officially discontinued and out of support by SUSE. The bug was created againist openSUSE 12.1, thus closed for WONTFIX. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com