Mailinglist Archive: opensuse-support (97 mails)

< Previous Next >
Re: [opensuse-support] RAID1 disk pending sectors
Felix Miata composed on 2018-11-09 20:01 (UTC-0500):

Felix Miata composed on 2018-11-08 00:07 (UTC-0500):

Felix Miata composed on 2018-11-07 06:39 (UTC-0500):

Carlos E. R. composed on 2018-11-07 10:35 (UTC+0100):

On 07/11/2018 04.48, Felix Miata wrote:

# journalctl -b -e
...
Nov 06 21:54:55 00srv smartd[937]: Device: /dev/sdc [SAT], 8 Currently
unreadable (pending) sectors
# fdisk -l /dev/sdc; smartctl -x /dev/sdc
http://fm.no-ip.com/Tmp/Hardware/Disk/smartctlx-msi85-hgst1000.txt
shows Current_Pending_Sector raw value is 8 after only 1660 power on
hours. :-(
...
Basically, your disk fails consistently on LBA 142446713, read error.
You have to find out what is there and rewrite that sector.

# smartctl -t long /dev/sdc
is now running

I missed the section pointing to the LBA. 142446713 is on sdc8, which
happens
to be md3, which is on /home. There remains to identify the file or
structure
that uses 142446713.

That same secton has this puzzling line:
#10 Short offline Completed without error 00% 43301 -
That's failure at a lifetime of 43301 hours on a disk the appears to have
only
1660 power on hours.

# cat /proc/mdstat
...
md3 : active raid1 sdb8[0] sdc8[1]
73727872 blocks super 1.0 [2/2] [UU]
bitmap: 1/1 pages [4KB], 65536KB chunk
...
# fdisk -l /dev/sdc
...
Device Start End Sectors Size Type
...
/dev/sdc8 61171712 208627711 147456000 70.3G Linux RAID
...

Shouldn't the following process force LBA 142446713 to be reallocated?

fail sdc8 from md3
remove sdc8 from md3
dd if=/dev/zero of=/devsdc8 bs=32768
add sdc8 to md3

It seems it would be simpler than trying to figure out which file or inode
uses it.

I didn't see a direct response, so I tried:

smartctl -a /dev/sdc | less
cat /proc/mdstat
Mnt
mdadm --manage /dev/md3 --fail /dev/sdc8
mdadm --manage /dev/md3 --remove /dev/sdc8
cat /proc/mdstat
smartctl -a /dev/sdb | less
cat /proc/mdstat
dd if=/dev/zero of=/dev/sdc8 bs=32768
smartctl -a /dev/sdc | less
mdadm --manage /dev/md3 --add /dev/sdc8
cat /proc/mdstat
smartctl -a /dev/sdc | less
cat /proc/mdstat

Subsequently:
smartctl -t long /dev/sdc
(5 hours later)
smartctl -a /dev/sdc | less
to find:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
...
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always
- 3
...
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always
- 5

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed: read failure 70% 1724
605782533
# 2 Extended offline Completed: read failure 70% 1719
605782533
# 3 Extended offline Completed: read failure 90% 1668
142446713
# 4 Extended offline Completed: read failure 90% 1591
142446713
# 5 Extended offline Completed: read failure 90% 1590
142446713
# 6 Extended offline Completed: read failure 90% 1175
142446713
# 7 Extended offline Completed: read failure 90% 919
142446713
# 8 Extended offline Completed: read failure 90% 249
142446713
# 9 Extended offline Completed: read failure 90% 81
142446713
#10 Short offline Completed without error 00% 18 -
#11 Short offline Completed without error 00% 12 -
#12 Short offline Completed without error 00% 8 -
#13 Short offline Completed without error 00% 43301 -
...
journalctl | tail -n(#)
Nov 09 14:26:00 00srv kernel: CIFS VFS: bogus file nlink value 0
Nov 09 14:40:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 15:10:03 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 15:40:03 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 15:40:03 00srv smartd[962]: Device: /dev/sdc [SAT], previous self-test
completed with error (read test element)
Nov 09 15:40:04 00srv smartd[962]: Device: /dev/sdc [SAT], Self-Test Log
error count increased from 8 to 9
Nov 09 16:10:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 16:40:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 17:10:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 17:40:03 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 18:10:03 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 18:40:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 19:10:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 19:40:04 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors

:-(

I guess now I need to fail the entirety of sdc, zero fill with ddrescue
instead of dd,
and smartctl -x again before either replacing or trying to add back to RAID?

mdadm failed and removed all, then
dd /dev/zero /dev/sdc
Now:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
...
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always
- 13
...
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always
- 13

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Extended offline Completed without error 00% 1736 -
# 2 Extended offline Completed: read failure 70% 1724
605782533
# 3 Extended offline Completed: read failure 70% 1719
605782533
# 4 Extended offline Completed: read failure 90% 1668
142446713
# 5 Extended offline Completed: read failure 90% 1591
142446713
# 6 Extended offline Completed: read failure 90% 1590
142446713
# 7 Extended offline Completed: read failure 90% 1175
142446713
# 8 Extended offline Completed: read failure 90% 919
142446713
# 9 Extended offline Completed: read failure 90% 249
142446713
#10 Extended offline Completed: read failure 90% 81
142446713
#11 Short offline Completed without error 00% 18 -
#12 Short offline Completed without error 00% 12 -
#13 Short offline Completed without error 00% 8 -
#14 Short offline Completed without error 00% 43301 -
9 of 9 failed self-tests are outdated by newer successful extended offline
self-test # 1

# journalctl | tail
Nov 09 21:23:42 00srv kernel: md/raid1:md4: Disk failure on sdc9, disabling
device.
md/raid1:md4: Operation continuing on 1 devices.
Nov 09 21:23:50 00srv kernel: md/raid1:md5: Disk failure on sdc10, disabling
device.
md/raid1:md5: Operation continuing on 1 devices.
Nov 09 21:40:05 00srv smartd[962]: Device: /dev/sdc [SAT], 5 Currently
unreadable (pending) sectors
Nov 09 22:21:56 00srv systemd-udevd[8125]: Process '/sbin/mdadm -If sdc10
--path pci-0000:00:1f.2-ata-5' failed with exit code 1.
Nov 10 03:37:33 00srv kernel: CIFS VFS: bogus file nlink value 0
Nov 10 03:37:33 00srv kernel: CIFS VFS: bogus file nlink value 0
Nov 10 03:37:36 00srv kernel: CIFS VFS: bogus file nlink value 0
Nov 10 03:37:36 00srv kernel: CIFS VFS: bogus file nlink value 0

I guess I can only see how will it can do by adding it back and seeing what
happens. :-p I guess if I see more pendings show up, put in the spare and
file an RMA claim.
--
Evolution as taught in public schools is religion, not science.

Team OS/2 ** Reg. Linux User #211409 ** a11y rocks!

Felix Miata *** http://fm.no-ip.com/
--
To unsubscribe, e-mail: opensuse-support+unsubscribe@xxxxxxxxxxxx
To contact the owner, e-mail: opensuse-support+owner@xxxxxxxxxxxx

< Previous Next >
Follow Ups