Re: [opensuse] Error's on raid disk
"Carlos E. R." <robin.listas@telefonica.net> 2007-05-08 12:30:55
The Tuesday 2007-05-08 at 10:22 +0200, Wilfred van Velzen wrote:
I started the test at about 18:00 (local time), and now at 10:15, it says:
Self-test execution status: ( 243) Self-test routine in progress... 30% of test remaining.
Uau, that's a large disk. Or busy. Usually, it's about two hours or so.
It's big: 750GB Not too busy, but busy enough during work hours, because the 30% hasn't moved yet...
The performance of the server seems to be ok, so I let it run for now...
Mine crawls while doing the surface test part. On my older disks I can continue working almost transparently.
None of the users are complaining, so it's fast enough! ;-)
You can also look at the smart log of the disk. If there was an uncorrectable error and there was a write attempt to that sector, it will already be remapped, and thus it will not show again on tests.
There is nothing in the logs.
Not the system log, but the smart log that resides in the disk; you can dig it out with "smartctl -a device".
Yes, that was what I meant. I checked with: smartctl -l error /dev/sdb and: smartctl -l selftest /dev/sdb But that shows the same output as the -a option...
Only the remap counter should show it (Reallocated_Sector_Ct).
For /dev/sdb:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 4
/dev/sda:
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
The RAW_VALUE is different on the disk that has the "problem", so is this the value that you should look at?
Right. If I interpret it correctly, your sda has four sectors remapped. It
sdb!
probably can work like this for years without problems, but watch it, and if they keep increasing, you should think about replacing the HD.
I'll keep a close eye on it!
Disks are designed so they survive bad sectors, it's a normal ocurrence, and they are prepared for that. But if they keep growing, then it becomes a problem or a symptom of failure.
I'll advice the one who controls the money to order a spare one in advance, so we can replace it if necessary. It's one of the disks in a raid 1 configuration, so it shouldn't be an immediate problem if one disk fails...
This isn't something that can be fixed on short notice ;), so I hope you will see this message!
Yep, I noticed, because you sent also a CC to me: in those cases Pine shows a yellow mark :-)
I will keep doing this, then... ;) Met vriendelijke groet / Best regards, Wilfred van Velzen -- SERCOM Regeltechniek b.v. Heereweg 9 2161 AB Lisse Nederland +31 (0)252 416530 (voice) +31 (0)252 419481 (fax) <http://www.sercom.nl/> Op al onze offertes, op alle opdrachten aan ons en op alle met ons gesloten overeenkomsten zijn toepasselijk de METAALUNIEVOORWAARDEN, gedeponeerd ter Griffie van de Rechtbank te Rotterdam, zoals deze luiden volgens de laatstelijk aldaar neergelegde tekst. De leveringsvoorwaarden worden u op verzoek toegezonden. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Tuesday 2007-05-08 at 13:09 +0200, Wilfred van Velzen wrote:
Uau, that's a large disk. Or busy. Usually, it's about two hours or so.
It's big: 750GB
Not too busy, but busy enough during work hours, because the 30% hasn't moved yet...
Probably busy enough that the test doesn't progress (much). If the test is well designed, the normal disk activity has priority. Probably you just have to wait longer and see. There is a trick, although you may not like it. f the raid is in software, you can deactivate one of the hard disks (simulate a failure). The other disk(s) take over the load, the failed one goes idle, and the test can happily progress on that one. However, if the other disk goes down in the interval... ouch :-(
Not the system log, but the smart log that resides in the disk; you can dig it out with "smartctl -a device".
Yes, that was what I meant. I checked with:
smartctl -l error /dev/sdb
and:
smartctl -l selftest /dev/sdb
But that shows the same output as the -a option...
Ah... I expected something like this (I see it with -a): SMART Error Log Version: 1 ATA Error Count: 251 (device log contains only the most recent five errors) ... Error 251 occurred at disk power-on lifetime: 3734 hours (155 days + 14 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 00 e8 f6 83 f0 Error: ICRC, ABRT at LBA = 0x0083f6e8 = 8648424 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 f0 f9 f5 83 f0 00 00:04:45.606 READ DMA EXT 25 00 f0 f9 f5 83 f0 00 00:04:44.706 READ DMA EXT 10 00 3f 00 00 00 f0 00 00:04:44.705 RECALIBRATE [OBS-4] 25 00 f0 f9 f5 83 f0 00 00:04:44.421 READ DMA EXT 25 00 f0 f9 f5 83 f0 00 00:04:44.248 READ DMA EXT I think these logs depend on the disk manufacturer.
Right. If I interpret it correctly, your sda has four sectors remapped. It
sdb!
Right, sdb, I got confused.
I'll advice the one who controls the money to order a spare one in advance, so we can replace it if necessary. It's one of the disks in a raid 1 configuration, so it shouldn't be an immediate problem if one disk fails...
In the case of a production server that you consider important enough to have a raid, it should always be important to have a disks spare at hand, errors or not ;-) Also, you know that you can have an "active spare" inside the raid. If there is a problem, it will immediately activate it and switch over. The disadvantage is, obviously, that the spare is powered up, although idle. In those cases, I would have an spare outside, too - maybe I'm too paranoid ;-)
This isn't something that can be fixed on short notice ;), so I hope you will see this message!
Yep, I noticed, because you sent also a CC to me: in those cases Pine shows a yellow mark :-)
I will keep doing this, then... ;)
No problem. Just remember that some people here do not like those at all - I really don't mind, my filters work nicely ;-) - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Made with pgp4pine 1.76 iD8DBQFGQG3+tTMYHG2NR9URAopmAJwPH+9oifhx6UZdRmWYdBcM7UA3+gCeKaYn wHv5e9D4vePAc5Kw8eyTKPU= =lHY7 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (2)
-
Carlos E. R.
-
Wilfred van Velzen