Re: [opensuse] smartctl - Help with smartctl output - should I be concerned?

21 Jan 2010

      Carlos E. R. said the following on 01/21/2010 03:50 PM:
...
On Thursday, 2010-01-21 at 12:31 -0000, Dave Howorth wrote:

...
No, the operating system doesn't know a thing, because this is completely 
internal to the HD firmware.
It is now.
It didn't used to be, and I suspect it isn't always.
...
I don't know the details, that is, I haven't
seen a paper from a manufacturer explaining how exactly they do it.
You go on to make a remarkably good guess.
...
From 
what I gathered, when the HD attempts to write to a sector and it fails, 
and determines (somehow?) that that sector is bad and not recoverable,
And that's the important point - is it recoverable?
See later.
...
it decides to write the data to another sector, a spare sector defined as 
such during design by the manufacturer. Somehow, somewhere, external to 
the filesystem data, it stores that any read/write operation destined to 
the "bad" sector will happen instead on the remapped sector: meaning that 
the head has to move there, and operation is a tad slower.
All the system notices is that the original write operation went slower. 
The HD disk reports success... nothing happened. Afterward, if you run 
smartctl, you see the remap counter has gone one up, that's all.
...
It is different, though, if the problem occurs during a read. The system 
will probably get a read failure code, but the HD will do no remapping; I 
guess because it doesn't know what the correct data to write should be.
I don't know if that's the case now, but see below.
...
Again, it is possible that there is a protocol defined (perhaps it is 
manufacturer dependent) for the operating system to intervene and trigger 
a remap. I haven't heard of such, but certainly, in case of a raid, it 
would be very interesting to have.
That would be interesting ..

Anyway:

Back at the beginning the 1980s I was working for a UNIX "OEM" shop.
Mostly we were porting UNIX to the new microprocessors.  If you recall,
that was when there were lots of 16-bit micros coming onto the market,
most of which aren't around today.  There were also a lot of
manufacturers trying to do a computer-in-a -box, and wanted an OS, a
*real* OS, not a single user thing like MS-DOS.  Just as DEC had found a
niche below IBM, they were finding niches below DEC/DG.

I took an idea that a colleague had sketched out and wrote a disk driver
for the PDP-11 under UNIX Version 7, you know, the tapes with "Love,
Denis".  I also back ported it to Version 6 for a Northern Telecom site.
I still have the backup tapes of that project but they are probably
unreadable now.

The big difference between what you described and what I built was that
those old huge drives had a CRC checksum at the end of each sector, and
it contained enough information to perform at least one bit of
correction.  If you trusted the CRC absolutely you could perform a few
more bits and do some run-length correction.  However all this was
carried out during an interrupt, so you didn't want to do too much
computation.

Once a bad read was detected and corrected the corrected sector was
written to one of the spare sectors and the mapping table updated.

That early correction and early re-mapping is the basic difference
between what you described and what I built.  I strongly suspect that
modern on-board drive controllers do much the same.

Well, OK it was a bit more than that.  If every slight read error
resulted in a remap those old drives would be remapped to hell and back!
No, a first error caused a re-read.  On some controllers you could
pre-program that so by the time the s/w driver saw the error the
hardware & microprogramming had given up.

Of course an on-board drive controller has better low-level access to
the drive and the raw signals than my host-based driver.

However, I recall when using AIX with a large IBM RAID array which was
supporting a extensive DB2 database that I once had to update the
microcode on each and every disk drive in the array -- while it was
running.  (Yes, there were 'pauses' in performance.)  This makes me
think that at least some modern machines are well integrated with the
internals of the disk drives, which addresses your final point.

-- 
The scientific name for an animal that doesn't either run from or fight
its enemies is lunch.
  - Michael Friedman
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org

Re: [opensuse] smartctl - Help with smartctl output - should I be concerned?

Anton Aylward