[opensuse] Ever see a disk-controller just die?
All, Sunday, I experienced a server failure and I am trying to deduce why? No storms, electrical interruptions, nothing... I received an automated text from the other boxes in my office that the server was down. I happened to have an open ssh session on my laptop and the connection was still up. So I typed $ uptime Bus error (core dumped) WTF? It is an older server (MSI K9N2 SLI Platinum) w/Phenom 9850. Always been rock solid. Spinning 6 drives on the native controllers. 2 primary drives on 1T Carvair Black drives mdraid0 4 partitions, 2 Secondary drives on Seagate 250G drives with dmraid (fake raid) 4 partitions (old install) and finally 2 1.5T Seagates that were just spare miscellaneous storage attached to ESATA 5/6. Suspecting a drive failure, I pulled all but the primary drives. Boot hangs at the "Detecting Hard Drives" POST/BIOS point. Disconnected all drives and it boots fine ("No discs connected - insert system disk"). Suspecting one or the other of the primaries, I removed the second drive on the primary channel (stuck at "Detecting Hard Drives"), so reverse the config and remove the primary drive and reconnect the secondary (stuck at "Detecting Hard Drives"). Huh? Remove the secondary leaving all drives removed (boots just fine to "No discs connected - insert system disk"). So this has me scratching my head. Unless I had simultaneous failure of both drives, it appears when anything is connected to the primary controller, boot hangs at "Detecting Hard Drives". (I have not tried discs alone on the secondary controller channel alone) ESATA is not bootable. The DVD is detected just fine and I've pulled all cards and memory and reseated just in case there was a stray bit of resistance somewhere. Has anyone experienced anything similar? If so, any pointers? I did have 1 SATA cable go bad a year or so ago, but given my diagnostics, I can't see 2 cables going bad at once. Any thoughts from the brain-trust are appreciated.. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On August 9, 2015 9:27:29 PM PDT, "David C. Rankin"
All,
Sunday, I experienced a server failure and I am trying to deduce why? No storms, electrical interruptions, nothing... I received an automated text from the other boxes in my office that the server was down. I happened to have an open ssh session on my laptop and the connection was still up. So I typed
$ uptime Bus error (core dumped)
WTF?
It is an older server (MSI K9N2 SLI Platinum) w/Phenom 9850. Always been rock solid. Spinning 6 drives on the native controllers. 2 primary drives on 1T Carvair Black drives mdraid0 4 partitions, 2 Secondary drives on Seagate 250G drives with dmraid (fake raid) 4 partitions (old install) and finally 2 1.5T Seagates that were just spare miscellaneous storage attached to ESATA 5/6.
Suspecting a drive failure, I pulled all but the primary drives. Boot hangs at the "Detecting Hard Drives" POST/BIOS point. Disconnected all drives and it boots fine ("No discs connected - insert system disk"). Suspecting one or the other of the primaries, I removed the second drive on the primary channel (stuck at "Detecting Hard Drives"), so reverse the config and remove the primary drive and reconnect the secondary (stuck at "Detecting Hard Drives"). Huh?
Remove the secondary leaving all drives removed (boots just fine to "No discs connected - insert system disk").
So this has me scratching my head. Unless I had simultaneous failure of both drives, it appears when anything is connected to the primary controller, boot hangs at "Detecting Hard Drives". (I have not tried discs alone on the secondary controller channel alone) ESATA is not bootable.
The DVD is detected just fine and I've pulled all cards and memory and reseated just in case there was a stray bit of resistance somewhere.
Has anyone experienced anything similar? If so, any pointers? I did have 1 SATA cable go bad a year or so ago, but given my diagnostics, I can't see 2 cables going bad at once.
Any thoughts from the brain-trust are appreciated..
If it is an on board controller, as opposed to a board in a slot, look for bad caps. Bulging capacitors. They could be anywhere on the mobo but I'd start looking close to the chipset of the controller. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 12:13 AM, John Andersen wrote:
If it is an on board controller, as opposed to a board in a slot, look for bad caps. Bulging capacitors. They could be anywhere on the mobo but I'd start looking close to the chipset of the controller.
Thanks John, Yes, that was my first thought. I've had caps go bad before on gigabyte boards and I've found (badcaps.net) that will fix them. Oh, brother, this is one of those, "damn! still such a good box!" situations. 8G of PC85 RAM, Phenom 9850, but hell, it's all AM2+ (yesteryear in computer terms). It's hard to justify the trouble of buying a replacement motherboard. Looks like if I can't get this thing going, it just another pile of pretty shelf art (and it is with all the finned copper cooling pipes, etc...) I'll look over the caps again with a real close eye. If there are any that seem puffy at all, then I'll have to weigh the option of sending it in to badcaps.net and hope for the best. Funny, I've been through 2 (now maybe 3) servers in the past 7 years, but the original one I built for my office in late 2000 - early 2001 (with that feisty AMD Tbird processor) is still running 24-7 happily serving and receiving faxes via hylafax/avantfax with openSuSE 11.0... (they don't make them like that anymore) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
John Andersen composed on 2015-08-09 22:13 (UTC-0700):
"David C. Rankin" wrote:
Any thoughts from the brain-trust are appreciated..
If it is an on board controller, as opposed to a board in a slot, look for bad caps. Bulging capacitors. They could be anywhere on the mobo but I'd start looking close to the chipset of the controller.
Don't just look at the motherboard. Check the PS caps too. OST caps often fail without any telltale signs. What year was it made? -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 12:27 AM, David C. Rankin wrote:
So this has me scratching my head. Unless I had simultaneous failure of both drives, it appears when anything is connected to the primary controller, boot hangs at "Detecting Hard Drives". (I have not tried discs alone on the secondary controller channel alone) ESATA is not bootable.
Boot from USB?
The DVD is detected just fine and I've pulled all cards and memory and reseated just in case there was a stray bit of resistance somewhere.
Boot from (Live) DVD?
Has anyone experienced anything similar? If so, any pointers? I did have 1 SATA cable go bad a year or so ago, but given my diagnostics, I can't see 2 cables going bad at once.
You need to establish basic motherboard integrity, Cpu integrity. There are many self-test, memory test and so on you can use. My dell has short-form and long form test BIOS settings. Check you BIOS for options. It may give you details. You say "native controller" implying that it is integrated to the motherboard. Perhaps the issue isn't the disk controller has died but the motherboard has aged out. Is it from the Years of The Bad Capacitors? These fully integrated motherboards represent a particular approach to mass production economics. Sometimes its a bit of a lowest common denominator set of decisions. Often you can disable the on-board and make use of a plugin board. But crank the numbers. On the one hand ... I once had a mobo with a pretty useless SiS video. I got a decent ATI Radeon from a friend for $20. That was worth it for a home "hobby" system. On the other hand ... A production server goes down and its costing $mucho$ per hour, so replacing the mobo, heck replacing the whose 1U, is the cheapest approach. Its all fibrechanneled RAID data anyway so its a Line Replaceable Unit. The replacements are all preconfigured. In fact its not worth even diagnosing what was wrong with the pulled unit. Its not that the cost of the day or so work by a $30/hr tech isn't worth it, but you do have to consider (a) the bureaucratic/managerial overhead makes that closer to $250/hr and(b) can't he be doing something productive instead of 'firefighting'. I learnt this LRU technique (discard & replace rather than stop & repair) in the military. You can, I'm sure, find videos on Youtube showing fast turn around service of aircraft on carriers using this technique. When we were doing it we didn't even test the LRUs we just replaced them anyway. beleive me, that was the best economics. A certified box was better than one that had been stressed on a previous mission and left in place. The sub-text here is "Disaster Planning". Having those LRU 1U boxes preconfigured. One site I served at approached this in what I thought was an idiotic manner. They had a spare 'Net line, spare router, spare switch sitting next to the 'live' one ready to cut over. 100% redundancy, inactive. The thing is though, no-one had done any analysis as to load, bandwidth; what would it be like to run both. If one failed the other was there without any switchover time. You could take one down for maintenance. You you could argue both ways, but no-one ever had, no-one had ever looked at the economics. So you want to find out what went wrong, David. But what is your time worth? What is the down-time worth? What is the cost of a new motherboard? What is the cost of just having a spare driver board, video board to hand? I'm (semi-retired) and my role isn't (most of the time) on the bleeding edge. But my often referred to 'Closet of Anxieties' very often serves as a "Keep Things Working" resource. A number of people are running Linux on machines that could only run XP, but since they only need email/browser and are happy with Firefox/Thunderbird they never see the Linux layer. Perhaps one day they will get a W/10 machine and complain bitterly about the menus and all the things Microsoft changes between releases, while their old (Linux) system stayed consistent. The point is they can keep working. They thank me for this. Eventually some have asked about Linux. "Oh, it look pretty much like the Word I'm used to, not the thing Microsoft are doing now". That, too, is "economics". The "sameness" means they don't have to go off on re-training courses. My thoughts, David, run to economics. The cost of DR planning, the cost of preparedness vs other costs. I'm sure someone is going to say that "economics" and "DR planning" is OT on a technical forum like this. But if it comes down to "how can I get my system running", you do need to consider if its easier/cheaper/faster to just replace the mobo. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 09:02 AM, Anton Aylward wrote:
My thoughts, David, run to economics. The cost of DR planning, the cost of preparedness vs other costs.
I'm sure someone is going to say that "economics" and "DR planning" is OT on a technical forum like this. But if it comes down to "how can I get my system running", you do need to consider if its easier/cheaper/faster to just replace the mobo.
Anton, I also worked in a "jerk and jam" facility, the so-called first line maintenance. We got the systems up and running and then sent the boards,assemblies/whatever to a repair facility. Each group then maximized their effort for the corp. At a vendor-run equip maint school there were a couple of guys from the FAA who had to know the trouble resolution down to component level, which let me know that the fed gov't wasn't running very efficient but hey, big surprise, eh? What really bothered us was the "bad off the shelf" new parts that had their own set of failure indications which could, and often did, run us around for hours trying to resolve the problems. David, I second the motion to swap and toss. "You might be an engineer if you've ever tried to fix a $7.00 radio". Or a toaster, or ... My son-in-law (bless his heart) once bought a $29 printer and was griping when he found that replacement ink was $40. I said, "Throw away the printer and buy another" but he said it was still good, to which I replied that how could it be, it doesn't print any more. He never could get it and bought the ink. Doofus. Fred -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 10:30 AM, Stevens wrote:
Anton, I also worked in a "jerk and jam" facility, the so-called first line maintenance. We got the systems up and running and then sent the boards,assemblies/whatever to a repair facility. Each group then maximized their effort for the corp. At a vendor-run equip maint school there were a couple of guys from the FAA who had to know the trouble resolution down to component level, which let me know that the fed gov't wasn't running very efficient but hey, big surprise, eh? What really bothered us was the "bad off the shelf" new parts that had their own set of failure indications which could, and often did, run us around for hours trying to resolve the problems.
I like that term! Believe me, an aircraft carrier can carry a LOT of spares! And they better be 'burnt in' before they are put on the shelf. MIL-SPEC is a strange world in many ways. Not only can many electronics survive in conditions where humans cannot, but the attitudes towards development & documentation are very different. It isn't so much that the equipment has to be maintained by people with an IQ of 30 so much as it has to be maintained by untrained people in extremes of weather and temperature who are under fire, wearing thick gloves and body armour, at night, possibly in a swamp or in a desert sandstorm, with no tools other than a (large) knife and the but of their machine pistol. The documentation has to reflect this. OBTW, the documentation was left behind because of weight considerations in favour of extra ammunition. I was told that on the 'carriers, during the quiet time, the "jerked" modules get tested. nobody would waste time repairing them, but maybe they weren't shot up so much and can be re-used. But is a low priority issue. When a module has a bullet though it or similar there is very little doubt about why it failed. Once we got a module back that baffled us ... until we put it in a low pressure chamber and saw a couple of the caps ... expand and push away from the circuit board. And I mean low pressure, not mile-high Denver pressure. That and pulling a hi-G turn ... well not the sort of thing you find in a server room. Least way not often. And one day I'll get around to unblocking that printer .... -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 09:02 AM, Anton Aylward wrote:
You need to establish basic motherboard integrity, Cpu integrity. There are many self-test, memory test and so on you can use. My dell has short-form and long form test BIOS settings. Check you BIOS for options. It may give you details.
You guys are great! Always good to smile at all the imaginable/unimaginable situations... CPU integrity seems fine. With drives pulled, it POSTS fine and gets to the point of "Insert Operating System Disk and press Any Key". Pop the install CD in, hit the "Any Key" and bam, I'm running linux just fine, this really seems limited to disk detection/disk controller - for whatever reason. This motherboard is also equipped with 16 status LEDs, 4-11 (in 4 groups of 2) give the POST status and then go all green when booting the OS. With no drives attached, they flow through the light sequences ending all green. Popping in the CD and booting works like a champ. When the POST fails, they are stuck green, green, red, red, which according to the manual indicates "Initializing Hard Drive Controller" (makes sense). I have inserted spare drives into the empty bays to confirm that the PSU is providing power (it is). The PSU is a fairly reliable HEC Zephyr 750, so it looks to be powering the drives fine. However, the only thing common to all drives, aside from the controller, is power. If it is a power issue, then it has to be a dirty power or partial power that allows the disk to spin, but may be preventing it from properly powering all its circuitry to respond to the drive initialization. Pretty bizarre. The board looks in tact, I've been over all CAPS and there are none puffy/bulging/etc that I can see (I've seen plenty of bad caps before). The fact that the board runs completely through the POST and operates fine when booted from the CD really adds to the puzzle. Partial PSU failure in one of the power pins to the board? Partial PSU failure in the power leg that is powering the SATA connections? Like I said in my original post, nothing eventful happened, it was just like somebody turned a switch to the disk controller off (or at least that is how it appears) causing the buss error/core dump. It's happily running now booted from the CD. I think I'll give everything a sniff test. There is no pronounced "I'm burned alive" smell, but it does smell like a box that has been operating 24-7 for 7 years or so. If you have any other thoughts, let me know. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Just a word: you said at a moment than esata can't boot, this is not true, at least not everywhere, I use very often an external esata dock to test new drives. just in case also may be you could add a sata card and connect the disks on it? jdd -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 11:44 AM, jdd wrote:
also may be you could add a sata card and connect the disks on it?
If I can't find anything else bad, then I'll buy a sata card and disable the onboard controlers. Even if I end up buying a new board, processor and RAM, I'll probably still try a new controller. Would be a complete waste to dump this board and processor over a bad disc controller. The build speed on this box was phenomenal. Older p4 3.33GHz boxes would take 7-8 hours for a full KDE3 build, this box with the Phenom 9650 Black Edition would do it in less than 3. If a $25 disk controller will fix it, then that's money well spent. Thanks. Any favorite controller brands? I'll need one with primary and secondary channels plus 2 ESATA. (that seems pretty standard these days) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2015-08-10 18:34, David C. Rankin wrote:
When the POST fails, they are stuck green, green, red, red, which according to the manual indicates "Initializing Hard Drive Controller" (makes sense). I have inserted spare drives into the empty bays to confirm that the PSU is providing power (it is). The PSU is a fairly reliable HEC Zephyr 750, so it looks to be powering the drives fine. However, the only thing common to all drives, aside from the controller, is power. If it is a power issue, then it has to be a dirty power or partial power that allows the disk to spin, but may be preventing it from properly powering all its circuitry to respond to the drive initialization.
I think you should try with another PSU. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)
David C. Rankin composed on 2015-08-10 11:34 (UTC-0500):
The PSU is a fairly reliable HEC Zephyr 750, so it looks to be powering the drives fine. However, the only thing common to all drives, aside from the controller, is power. If it is a power issue, then it has to be a dirty power or partial power that allows the disk to spin, but may be preventing it from properly powering all its circuitry to respond to the drive initialization.
According to http://www.badcaps.net/forum/printthread.php?t=6390 that HEC is somewhat unusual, with 4 12V rails, besides using Teapo caps. I don't understand how multiple rails differs from all 12V power on one rail, but it wouldn't surprise me in a more complicated than usual design that some sort of fallible synchronization mechanism is built in or that only one rail is supplying the power reaching the controller, so I'd definitely not rule out PS trouble if it's 7 years old. Newer power supplies seem to have mostly switched back to using only a single 12V rail, or at most two. No idea whether this would be about cost rather than functional efficacy or reliability. -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
I see no reason to worry about a 7 year old power supply. As long it was not affected by the bad caps era, these things can last forever, with maybe a fan replacement. Swapping drivers seemed to suggest it wasn't a drive problem as I read the thread. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2015-08-10 21:35, John Andersen wrote:
I see no reason to worry about a 7 year old power supply. As long it was not affected by the bad caps era, these things can last forever, with maybe a fan replacement.
Plastics and several insulators degrade. -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" at Telcontar)
On August 10, 2015 2:41:43 PM PDT, "Carlos E. R."
On 2015-08-10 21:35, John Andersen wrote:
I see no reason to worry about a 7 year old power supply. As long it was not affected by the bad caps era, these things can last forever, with maybe a fan replacement.
Plastics and several insulators degrade.
But the disk swaps David did.....??? -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 04:59 PM, John Andersen wrote:
But the disk swaps David did.....???
That's where I'm still up in the air. Testing the primary onboard controller. 2 disks connected - no POST disk 1 only - no POST disk 2 only - no POST both disks spin up and are running. no disks - POSTS fine, boots from CD and runs fine. So to me, it looks like the only culprits common to all disks is either: a) power to one of the 24-pins in the ATX connector b) disk controller (or cap on board related to it) c) power to the disks themselves. Since the all spin up when plugged into the SATA power connector and sound normal, (c) looks like a long shot. (a) or (b) look the most probable. I think I have a new PS in the spare parts bin. Failing that, it looks like a new system is in order along with the stand-alone SATA controller to fix the old one as time permits. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 05:23 PM, David C. Rankin wrote:
On 08/10/2015 04:59 PM, John Andersen wrote:
But the disk swaps David did.....???
That's where I'm still up in the air. Testing the primary onboard controller.
2 disks connected - no POST disk 1 only - no POST disk 2 only - no POST
both disks spin up and are running.
no disks - POSTS fine, boots from CD and runs fine.
So to me, it looks like the only culprits common to all disks is either:
a) power to one of the 24-pins in the ATX connector b) disk controller (or cap on board related to it) c) power to the disks themselves.
Since the all spin up when plugged into the SATA power connector and sound normal, (c) looks like a long shot. (a) or (b) look the most probable.
I think I have a new PS in the spare parts bin. Failing that, it looks like a new system is in order along with the stand-alone SATA controller to fix the old one as time permits.
For the curious, I've copied the page from the manual concerning the LED POST sequence. With any disk attached, the sequence stops at box 13 (Initializing Hard Drive Controller). With no disks connected, it flows though the boot sequence without a hitch ending all green. http://nirvana.3111skyline.com/dl/img/ss/opensuse/msi_k9n2_SLI_LED_seq.jpg will report back with PS swap results. Found a new in box BFG 550W PS. (what a good company - Lifetime Warranties, no questions asked. It the honest companies that go bust...) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2015-08-11 01:06, David C. Rankin wrote:
For the curious, I've copied the page from the manual concerning the LED POST sequence. With any disk attached, the sequence stops at box 13 (Initializing Hard Drive Controller). With no disks connected, it flows though the boot sequence without a hitch ending all green.
Well, with no disks, it can not really initialize the hard disk controller. It can not do all tests. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iF4EAREIAAYFAlXJRTYACgkQja8UbcUWM1wtiAD/YfFxb0M+1LEwXeImxaxYtBCZ wMt0GWUuA+toiEdGYf8A/1k5jr78EnKCj/EZt25xHVry3aFlKGRSVsfE4Y/59ild =X0qL -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 07:43 PM, Carlos E. R. wrote:
On 2015-08-11 01:06, David C. Rankin wrote:
For the curious, I've copied the page from the manual concerning the LED POST sequence. With any disk attached, the sequence stops at box 13 (Initializing Hard Drive Controller). With no disks connected, it flows though the boot sequence without a hitch ending all green. Well, with no disks, it can not really initialize the hard disk controller. It can not do all tests.
And the winner is...... bad disk controller or bad capacitor related to it. Replaced the power-supply with a new 550W supply. Reconnected everything, fired it up and ... exact same behavior. With drives attached to the controller, POST hangs at "Detecting Hard Drives" (although I did get it to post with 1 drive connected, it never saw the drive, and booted straight to the CD) So motherboard, processor, and RAM shopping. First round with EUFI (damn, not looking forward to that). ASUS still makes a board with 1394, other than that, options are Gigabyte, or MSI. Any other favorites? Thanks for all the suggestions. (sure was hoping it was the PSU :) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
David C. Rankin composed on 2015-08-10 20:49 (UTC-0500):
And the winner is......
bad disk controller or bad capacitor related to it.
Replaced the power-supply with a new 550W supply. Reconnected everything, fired it up and ... exact same behavior. With drives attached to the controller, POST hangs at "Detecting Hard Drives" (although I did get it to post with 1 drive connected, it never saw the drive, and booted straight to the CD)
So motherboard, processor, and RAM shopping. First round with EUFI (damn, not looking forward to that). ASUS still makes a board with 1394, other than that, options are Gigabyte, or MSI. Any other favorites?
Thanks for all the suggestions. (sure was hoping it was the PSU :)
Maybe tiny likelihood that all SATA cables (how many?) failed together, but trying other ought to be easier than mobo shopping. You never know if you don't try. ;-) One other thing before shopping: look at the bottom of the mobo. Could be a tiny vermin nest or a long lost unnoticed screw created a short. Not knowing actual age of this mobo, and guessing 8+, I'm voting RoHS the or an underlying cause if the mobo has no OST caps. -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 09:40 PM, Felix Miata wrote:
Maybe tiny likelihood that all SATA cables (how many?) failed together, but trying other ought to be easier than mobo shopping. You never know if you don't try. ;-)
One other thing before shopping: look at the bottom of the mobo. Could be a tiny vermin nest or a long lost unnoticed screw created a short.
Not knowing actual age of this mobo, and guessing 8+, I'm voting RoHS the or an underlying cause if the mobo has no OST caps.
All great words of wisdom. As far as cables, there were 6 sata drives spinning in the box (and had been for years). Six cables going toes up? I'll go with the small possibility that both from the primary controller went bad, but if swapping in the other 4 doesn't change anything, then I will be more than satisfied it isn't a cable. If I really think about what the probable trigger was, we are undergoing August in Texas, and we have had our hottest days they past 3 days. Sunday was 105 w/heat indexes close to 115. During the weekend the A/C raises to 85 degrees, and the server closet with its vent probably gets to 90. That little bit of added heat load/stress probably caused whatever weak component died -- to die. I suspect a capacitor, but the particular capacitors on this board have the fully-formed aluminum sheaths over them with no seam at all on the top. they have 1 intention ring about 1/4 up from the bottom and are capped on the bottom. They all look perfectly in shape, all though 1 between the south bridge and the top pci-16 slot is leading slightly toward the CPU socket (it could have been soldered like that originally). I'll have to pull the board from the case to get access to the back side. This is a big Antec case. One of the best I ever bought. 6 full bays up top, with 2 being a 2 hard-drive quick connect chassis with its own 120mm fan, The in has another 4 drive chassis at the bottom front with it own 120mm fan. It then has 2 more 120mm fans in the back and top. (all 3 speed adjustable and thermostatic controlled) Hell, this thing has its own 12v light on a flexible boom stored under the top to aid in installs/repairs. I'll pull the board and survey the back side. The case was completely together at the time, so it is hard to see what could possibly have come loose. I'm meticulous with plastic wire-ties to secure/route any excess lengths of wires to prevent movement and promote better air-flow in the case. Gremlins? So James was wrong this time, no "Warp Core Breach", but it looks like the problem occurred with the Dilithium Crystal at the point of Warp Core injection -- alignment was good, so that points to the injection controller or matter/anti-matter containment field. We'll know more once I get the crystal, Oh - I mean board pulled :p -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
David C. Rankin composed on 2015-08-11 01:23 (UTC-0500):
Felix Miata wrote: ...
You never know if you don't try. ;-) ... As far as cables, there were 6 sata drives spinning in the box (and had been for years). Six cables going toes up? I'll go with the small possibility that both from the primary controller went bad, but if swapping in the other 4 doesn't change anything, then I will be more than satisfied it isn't a cable.
The fact that two different drives were tried individually, with probably two different cables, probably confirmed that it isn't a cable problem. Still, when drive number was dropped to one at a time, were all other cables detached from mobo port connectors? Maybe with failing old cables, a single partial short or leak could be visible to other ports?
If I really think about what the probable trigger was, we are undergoing August in Texas, and we have had our hottest days they past 3 days. Sunday was 105 w/heat indexes close to 115. During the weekend the A/C raises to 85 degrees, and the server closet with its vent probably gets to 90. That little bit of added heat load/stress probably caused whatever weak component died -- to die.
Had it been shutdown more than a few minutes while the heat was elevated, with no fans running, allowing ordinarily cooler components to see extra heat soak or humidity? Slim possibility of all 6, of course. Nevertheless, never say never, and never say always. If all 6 are from an identical batch, could elevated heat have tipped them all over an edge together? Maybe not as close to zero probability as one would expect. (I'm still thinking RoHS aka lead-free solder joints, but there are parts on a mobo that are known to occasionally fail besides solder and electrolytics. Horse probably dead too.) -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/11/2015 01:52 AM, Felix Miata wrote:
Had it been shutdown more than a few minutes while the heat was elevated, with no fans running, allowing ordinarily cooler components to see extra heat soak or humidity? Slim possibility of all 6, of course. Nevertheless, never say never, and never say always. If all 6 are from an identical batch, could elevated heat have tipped them all over an edge together? Maybe not as close to zero probability as one would expect. (I'm still thinking RoHS aka lead-free solder joints, but there are parts on a mobo that are known to occasionally fail besides solder and electrolytics. Horse probably dead too.)
No, this box runs 24-7/365 (only UPS exhaustion prompts shutdown (or kernel update) -- neither happened Sunday) I don't know what the actual failure was -- either solder or bad cap, but I'm going to give the guys at badcaps.net a crack at it. I have it up and running and networked presently and it has been running over 24 hours w/o issue, so the failure seems limited to controller. I'll report back on the issue after I get the board back. It will either work, or I've tossed $70 into finding out it won't :) See my next post i5-4590 Haswell or FX-8350 Black Edition? -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/11/2015 02:23 AM, David C. Rankin wrote:
If I really think about what the probable trigger was, we are undergoing August in Texas, and we have had our hottest days they past 3 days. Sunday was 105 w/heat indexes close to 115. During the weekend the A/C raises to 85 degrees, and the server closet with its vent probably gets to 90. That little bit of added heat load/stress probably caused whatever weak component died -- to die.
What? No thermal shut-down? I had an otherwise nice Compaq laptop that had insufficient fan/cooling. On my lap while sitting on the patio one summer if felt it overheating ... on my lap. Indoors, on a angle with an auxiliary cooling pad it managed OK, but in warm weather without the cooling pad it managed to fry a couple of battery packs. I had set the thermal shut-down to 90 degrees. Perhaps it should have been less, but then, perhaps, I'd never get any work done. Even so, after one shut-down the screen had a !FAIL!, so something got stressed beyond tollerance. Obviously not MIL-SPEC rated components. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/11/2015 07:47 AM, Anton Aylward wrote:
What? No thermal shut-down?
I had an otherwise nice Compaq laptop that had insufficient fan/cooling. On my lap while sitting on the patio one summer if felt it overheating ... on my lap. Indoors, on a angle with an auxiliary cooling pad it managed OK, but in warm weather without the cooling pad it managed to fry a couple of battery packs. I had set the thermal shut-down to 90 degrees. Perhaps it should have been less, but then, perhaps, I'd never get any work done. Even so, after one shut-down the screen had a !FAIL!, so something got stressed beyond tollerance.
Obviously not MIL-SPEC rated components.
Grinning... I did have thermal shutdown enabled, but that was just to protect the processor. I don't recall if there was an ambient shutdown sensor on the board itself. I was a little bit torked because I usually leave the door cracked a bit when it is really hot, but one of the office tenants closed it after I left Friday. Been in the same building since 2003, so I suspect that it was a combination of poor solder/bad caps/bad controller and time. Of course..., the Abit-KT7 w/Athlon Tbird sitting next to it just keeps humming along happily as it has since what 2001? -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
David C. Rankin composed on 2015-08-11 15:13 (UTC-0500):
Been in the same building since 2003, so I suspect that it was a combination of poor solder/bad caps/bad controller and time.
03 was the transitioning from pre-RoHS into RoHS, which of course was compounded by the caps plague, but since yours seems to have been built entirely with polys, all that's left with substantial likelihood is solder joints. 12 years non-stop is very respectable from out of that period. -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Le 11/08/2015 01:06, David C. Rankin a écrit :
(Initializing Hard Drive Controller). With no disks connected, it flows though the boot sequence without a hitch ending all green.
don't you have some other disk to test with new (of different) hardware? I mean disk never seen by this system? jdd -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
John Andersen composed on 2015-08-10 14:59 (UTC-0700):
Carlos E. R. wrote:
On 2015-08-10 21:35, John Andersen wrote:
I see no reason to worry about a 7 year old power supply. As long it
was not affected by the bad caps era, these things can last forever, with maybe a fan replacement.
Plastics and several insulators degrade.
But the disk swaps David did.....???
The way I remember it, drive motors run on 12V, controllers on 5V. If all he did was plug them in and confirm they spun up, I have my doubts he proved anything by inserting "spare drives into the empty bays". PS failure tops the list of PC failure causes, yet nothing I've seen him report yet serves well to prove PS is not his problem. If he hasn't another PS to try, he should at least try booting from some other HD connected to the mobo controller. Are any of his SATA cables red? Old red cables apparently have a higher incidence of failure than others[1], and regardless of color, they are a known failure vector. Newer ones usually have snaps to help ensure they stay well connected. David C. Rankin composed on 2015-08-10 17:23 (UTC-0500):
Testing the primary onboard controller.
2 disks connected - no POST disk 1 only - no POST disk 2 only - no POST
both disks spin up and are running.
no disks - POSTS fine, boots from CD and runs fine.
So to me, it looks like the only culprits common to all disks is either:
a) power to one of the 24-pins in the ATX connector b) disk controller (or cap on board related to it) c) power to the disks themselves.
Since the all spin up when plugged into the SATA power connector and sound normal, (c) looks like a long shot. (a) or (b) look the most probable.
I think I have a new PS in the spare parts bin. Failing that, it looks like a new system is in order along with the stand-alone SATA controller to fix the old one as time permits.
Did I miss it, or have you only ever used the same original SATA cables? If so, there should be a d), although I still suspect power trouble, as more than one SATA problem manifesting at the same time seems too big a stretch. [1] https://lists.ubuntu.com/archives/kubuntu-users/2011-January/053282.html -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2015-08-10 23:59, John Andersen wrote:
On August 10, 2015 2:41:43 PM PDT, "Carlos E. R."
wrote: On 2015-08-10 21:35, John Andersen wrote:
I see no reason to worry about a 7 year old power supply. As long it was not affected by the bad caps era, these things can last forever, with maybe a fan replacement.
Plastics and several insulators degrade.
But the disk swaps David did.....???
Well, I don't mean in this particular case, the cause is yet unknown. I mean in general, in electronics. I have some valve radios still working (made around 1950, perhaps earlier), or rather, they worked last time I tried, ten years ago. I reviewed them when I was a student, replacing several of the capacitors and some browned resistors. Apparently electronics last for ever, but it is not so. Metal lasts, if it doesn't rust, but plastic insulators crack. Paper also degrades (there were paper capacitors, and paper layers in transformers). And paper is one of the most durable materials. It can be eaten, too! :-) - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iF4EAREIAAYFAlXJRvkACgkQja8UbcUWM1zNkgEAjrDQV2wxYhsalkohd+sltlBs k5kVXLL3TIxH0e89NngA/jFQN7oWVn+s69xHk3ZX/BhNgUhZ7y5ZTnLeiIXRJHlZ =CTky -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 08:51 PM, Carlos E. R. wrote:
I reviewed them when I was a student, replacing several of the capacitors and some browned resistors. Apparently electronics last for ever, but it is not so.
One problem vacuum tube equipment had was heat, which caused a lot of components to fail prematurely. We had a tube TV when I was a kid and the top of it was *HOT*! Properly rated devices don't normally fail on their own. It generally takes something else, such as a power surge to degrade them. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2015-08-11 03:08, James Knott wrote:
On 08/10/2015 08:51 PM, Carlos E. R. wrote:
I reviewed them when I was a student, replacing several of the capacitors and some browned resistors. Apparently electronics last for ever, but it is not so.
One problem vacuum tube equipment had was heat, which caused a lot of components to fail prematurely.
Yes. For that reason they were typically built in a box with a metal grid, horizontal, some centimetres from the bottom. The lamps were above, and the passive electronics and wiring, below, where it was cooler. And the components were built for the heat: those things could easily survive half a century. Do you know that valve electronics work better than solid state under radiation? Like in space. Or in a nuclear plant.
We had a tube TV when I was a kid and the top of it was *HOT*! Properly rated devices don't normally fail on their own. It generally takes something else, such as a power surge to degrade them.
Oh, they do now. Things are not built to last. Programmed obsolescence. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iF4EAREIAAYFAlXJU2UACgkQja8UbcUWM1yBfwD/XtbV6P/eLti+CiMvn81sb75x +9RSb3L17RSk7999a1IA/jc09ydhnws9IR0SBuUnyHZ21U6zlEOqXUGN8B7wFcXg =jMUx -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 09:08 PM, James Knott wrote:
I reviewed them when I was a student, replacing several of the capacitors and some browned resistors. Apparently electronics last for ever, but it is not so. One problem vacuum tube equipment had was heat, which caused a lot of components to fail prematurely. We had a tube TV when I was a kid and
On 08/10/2015 08:51 PM, Carlos E. R. wrote: the top of it was *HOT*! Properly rated devices don't normally fail on their own. It generally takes something else, such as a power surge to degrade them.
Well, there is a question of what is "properly rated?" Mil-spec usually requires a large derating (to 50% or less) for resistors and capacitors, such that a bypass cap on a 5VDC line would have to be rated at a minimum of 10V. A resistor dissapating ½ Watt would need to be rated at 1W. It is unlikely that much commercial gear, particularly of Asian origin, holds to this specification. In addition, it is widely known by this time that a lot of Asian electrolytic caps fail, whether this is due to lack of derating, heat, aging, or something else, it is a logical place to look first when failure occurs. Finally, _everything_ has a "mean time before failure"--MTBF--no matter what derating that was imposed on the design. That is to say, that it is _expected_ that at some point the device _will _fail. --doug, retired electronics engineer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 05:41 PM, Carlos E. R. wrote:
Plastics and several insulators degrade.
The long term !FAIL! was solder joints. that was one reason for a) component density, ultimately to 'chips' b) flow-soldered PCBs, to get better 1) solder mix and 2) consistency "Yes it used to be but we changed all that". Once that was addressed, what was next on the list? A couple of years ago I came across and speculative paper from the ... was it Bell System Journal or some IBM paper ... anyway .. It talked of the future computers being 'hot fuzzy golf-balls. Golf-ball sized 'cos of the level of integration Fuzzy 'cos of the wires to do the IO Hot 'cos of the power density. Nobody, it seems, thought that even though UNIX came out of Bell, it would be powering the "not a hot fuzzy golf-ball" that are cell phones. Which are mostly plastic. Many of our older beliefs about material engineering get overthrown. I've had mobos fail but never from blown capacitors. I've had PSUs fail, but it turned out to be fuses in an inaccessible place. I've had memory fail, but !not! through handling without a static guard. Sometimes I wish David Cheriton's V system had been developed so we could have RAICs Redundant Arrays of CPUS with the Os really distributed across the network, rather than the stamp-and-repeat method we have now. We don't tolerate that kind of thing in programming any more, why should we tolerate it in hardware? Ah, dream on! -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2015 06:27 AM, David C. Rankin wrote:
$ uptime Bus error (core dumped)
This is a software error: https://en.wikipedia.org/wiki/Bus_error Your "uptime" program is broken. Check the core-file with gdb as shown in the wikipedia article.
On 08/11/2015 04:01 AM, Florian Gleixner wrote:
On 08/10/2015 06:27 AM, David C. Rankin wrote:
$ uptime Bus error (core dumped)
This is a software error:
https://en.wikipedia.org/wiki/Bus_error
Your "uptime" program is broken. Check the core-file with gdb as shown in the wikipedia article.
Thanks Florian, Yes... and No... The Buss Error (core dumped) is a software error, granted. However, it was the result of the disc controller failure, which, I'm fairly certain resulted from the processor attempting to retrieve info that had paged to swap. When the OS core dumped various parts of the OS were left running (such as the login session providing the prompt). When I went to my open ssh terminal to the server and typed uptime and it attempted to load /usr/bin/uptime and found the core dump condition, and responded with the software error. No question at all by now -- it was a hardware failure that was the root cause. (failure analysis conducted :) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (10)
-
Anton Aylward
-
Carlos E. R.
-
David C. Rankin
-
doug
-
Felix Miata
-
Florian Gleixner
-
James Knott
-
jdd
-
John Andersen
-
Stevens