Mailinglist Archive: opensuse (626 mails)

< Previous Next >
Re: [opensuse] Login weirdness
On 02/11/2018 15.27, Michael Fischer wrote:
On Thu, Nov 01, Carlos E. R. wrote:
On 01/11/2018 16.07, Michael Fischer wrote:

smartctl in this case.


...

I append the output of `smartctl -a /dev/sda`. I meant to
run the --test=long last night, but fell asleep without triggering it... (bah)

It happens :-)


I note that `smartctl -a` basically said "PASSED" but that there were a few
read errors. I've no idea what (more) to make of them.

I'll look.

FWIW, I realized that /dev/sda is my /home and /tmp on rust, not my
/ on an SSD, so it is perhaps _less_ likely to be the cause of my login
weirdness (I hope), and more amenable to a clean reinstall/upgrade.

Ah, yes. Better run the test on all disks.


smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.14.6-1.g45f120a-default] (SUSE
RPM)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
...


=== START OF READ SMART DATA SECTION ===
...
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 074 063 006 Pre-fail Always
- 26493824
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always
- 46
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always
- 0

Ok, but watch this parameter.

7 Seek_Error_Rate 0x000f 077 060 045 Pre-fail Always
- 57728258
9 Power_On_Hours 0x0032 092 092 000 Old_age Always
- 7752

Not an old disk.

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 41
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
188 Command_Timeout 0x0032 099 099 000 Old_age Always
- 1 1 1
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always
- 1
190 Airflow_Temperature_Cel 0x0022 060 050 040 Old_age Always
- 40 (Min/Max 30/41)
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 363
194 Temperature_Celsius 0x0022 040 016 000 Old_age Always
- 40 (0 16 0 0 0)
195 Hardware_ECC_Recovered 0x001a 008 004 000 Old_age Always
- 26493824
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 16
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline
- 16

Ah. Yes, this is important.

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline
- 7716h+03m+29.370s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline
- 6351787553
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline
- 352717284

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 7752 hours (323 days + 0 hours)
When the command that caused the error occurred, the device was in an
unknown state.

This section exceeds my skills, sorry. They are internal errors (to the
disk firmware). And it is very recent, at 7752 hours.

...

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Short captive Interrupted (host reset) 60% 7752 -
# 2 Short offline Completed without error 00% 7752 -
# 3 Extended offline Completed: read failure 50% 7136
1001593016
# 4 Extended offline Completed: read failure 50% 6296
1001593016
# 5 Extended offline Completed: read failure 50% 5624
1001593016
# 6 Extended offline Completed: read failure 50% 4784
1001593016
# 7 Extended offline Completed: read failure 50% 4112
1001593016
# 8 Extended offline Completed: read failure 50% 3440
1001593016
# 9 Extended offline Completed: read failure 50% 2600
1001593016
#10 Extended offline Completed: read failure 50% 1929
1001593016
#11 Extended offline Completed: read failure 50% 1257
1021004240
#12 Extended offline Completed: read failure 50% 585
1001593008
#13 Short offline Completed without error 00% 0 -


Well, you have to do the long test to be sure. Notice that you can do
the testing while you use the computer: it just may become sluggish or
not respond. Do not power it off if it happens. Of course, the test will
take longer if the computer is busy.

Parameter 197.

All hard disks develop errors. Operating systems know that, and can mark
the bad sectors in order to just not use them. Modern (since years)
disks can remap bad sectors to other sectors that are reserved for that
purpose since manufacture date. This is done automatically by the
firmware when writing to that bad sector. This parameter says that there
are a number of sectors that have not being remapped.

Concurrent to this, notice that there are several "extended offline"
tests that did not complete, all at the same LBA. I would rewrite that LBA.

You could try to find out to what file does that LBA belong, recover the
file if possible or replace with backup copy, and write to that LBA. Not
trivial. The write operation should trigger the remap.

Then run again the long test to see if it stops at another LBA, then
repeat till none appears.


You can also run "badblocks" on that disk. This test takes many hours
(even days), has to be done while umounted, thus from rescue media.
Sometimes this is enough to clear those bad sectors, sometimes they
appear again days later. If the command produces a list of bad sectors,
then write to them to force a remap.

One method is to rewrite to the entire partition with zeros or whatever,
then recover the data from backup.

Eventually, if you see the number of bad sectors to grow (seen on
parameter 5) the only solution is to replace the disk. Some people
"panic" at the first bad sector and replace the disk. I had in use disks
that got some bad sectors very soon, did as above, and heard of no more
bad sectors for years, till I replaced them with bigger disks because I
wanted more space.

The first hard disk I bought came with a bad sector list printed in
paper by the manufacturer. All disks came with that at that time. It was
30 megabytes big, a huge disk at that time.


--
Cheers / Saludos,

Carlos E. R.
(from 42.3 x86_64 "Malachite" at Telcontar)

< Previous Next >