Mailinglist Archive: opensuse (626 mails)

< Previous Next >
Re: [opensuse] Login weirdness
On 03/11/2018 06.25, David Haller wrote:
Hello,

On Fri, 02 Nov 2018, Liam Proven wrote:
On 02/11/2018 15:57, Carlos E. R. wrote:
[..]
This parameter says that there
are a number of sectors that have not being remapped.

To me, that is a danger sign. I don't know exactly what it means or why
but it's worrying.

It means, that the drive was at least _once_ not able to read from
that sector (see the list at the end of 'smartctl -a' output).
Thus, that sector is *pending* to be reallocated, but is not yet.
That happens when you write to that sector. That can be done with
hdparm.

==== man hdparm ====
--write-sector
Writes zeros to the specified sector number. VERY DANGEROUS.
The sector number must be given (base10) after this option.
hdparm will issue a low-level write (completely bypassing the
usual block layer read/write mechanisms) to the specified sec-
tor. This can be used to force a drive to repair a bad sector
(media error).
====

AND YES, ALL DATA ON THAT SECTOR WILL BE GONE!

But that sector will disappear from the "pending/offline
uncorrectable" count but appear as "reallocated" instead (as long as
there are sectors left to be remapped to).

So you can write to that sector number, but it will (physically) be
another sector than originally.

E.g.: sector 10 is reported as bad, you have "pending: 1, offline
uncorrectable: 1, reallocated: 0 and in the test-log (from a smartctl
long test IIRC) shows "LBA_of_first_error: 10".
Say then, you use hdparm to write to that sector. Then you'll get:

"pending: 0, offline unc.: 0, reallocated: 1".

Yes, this is what I was suggesting.


The error-log does not change, but a further test will not fail at
sector 10 again, as that has now been reallocated.

Besides the reallocated-count going up, the error seems to disappear.
Until you run out of sectors to reallocate to and of course, the data
that was on unreadable sectors.

It might be that the disc can scrape the data from the sector by
reading multiple times, but will still mark it as "pending" and
reallocate it when written to.

I have a disk (replaced) that had such bad sectors, but SMART would not
say which LBA. So I had to run badblocks... which somehow found nothing.
And a run of the long smart test would then also find nothing, for some
days... then they would appear again. So I wrote the entire disk with
zeroes. Some time later other sectors would appear. Repeat. Then I
noticed the remapped count going up and up on each test... and I finally
decided to replace the disk.



Concurrent to this, notice that there are several "extended offline"
tests that did not complete, all at the same LBA. I would rewrite that LBA.
[...]

I am afraid I must disagree.

You can also run "badblocks" on that disk[...]

OK, I must disagree more.

Care to elaborate as to why? The following?

[..]
All hard drives have some bad sectors, it's true. Most develop more
during their operational life, also true.

But they have a pretty large reserved area(maybe 10-15%, it varies a lot
with model and makers don't like to disclose it. An >1% number of
blocks, anyway.) and failed blocks are replaced from the spare blocks.

I doubt that it's that much. Might have been in days long gone.

This remapping is normal and invisible. The OS never knows there was a
read error, it's just switched on the fly.

See above.

Years ago, you could even _hear_ (esp. Seagates) trying to
re-re-re-re-re-read a sector....
*gnuuiii*gnuuiii*gnuuiii*gnuuiii*schloink*

Taking _minutes_...

Yes. But if it was doing a "write", after the firmware (not the os)
decides the sector is bad, it remaps it, and writes the waiting data on
the new sector instead. The operating system knows nothing, only that
the disk took way longer than usual. From that point on, writes to that
sector will be fast as usual - except that the block is not contiguous
with the rest, needing one head movement and one back. Taking milliseconds.

The "minutes" part was because the operating system tried 10 times on
disk error. I think it was 10.




If the OS can see errors, that means that either [a] the disk's
replacement blocks are used up, meaning it has millions of bad blocks,
or [b] the disk is defective in some other way.

smartctl != the OS ;)

In either instance, I would regard that as a failing drive and replace
it immediately.

Agreed. Esp. if the drive is not brand new. Cue the bathtub curve.

Some drives may come/develop some badblocks real fast but then be
stable over years. But if the drive is older, developing badblocks is
a sure warning sign.

Don't waste time trying to rescue it. Get any remaining data off it,
ASAP. Return it for warranty replacement, if possible. If not, send it
for recycling, or take it to bits if you're curious.

Do not waste time trying to fix it, and never use it for anything other
than test purposes again.

Or use it for epheremal stuff, news-spool, DL-caches, whatnot where
you don't really care if you lose it and/or need to reget it.

I never discard a disk "fast". I give them at least a second chance. If
they don't develop more bad sectors, they stay in place.

So far, I have not lost data that way, in some decades... :-)

Of course, that was luck. A single transient bad sector may destroy an
important file.

--
Cheers / Saludos,

Carlos E. R.
(from 42.3 x86_64 "Malachite" at Telcontar)

< Previous Next >