JW wrote:
Hello,
This is probably the start of a long thread.... I'm quite tired so I'll only post the most important stuff now. The moral of the story is: we just purcahsed a (nearly) $50,000 PowerEdge 6450 from Dell. Quad Xeon CPUS, 8GB of RAM, 6x36 GB RAID5 Ultra160.
Server's Dell home page: http://www.dell.com/us/en/bsd/products/model_pedge_1_pedge_6450.htm
It came with RedHat 7.1 installed. I promptly replaced it with SuSE 8.0. Dell will not support SuSE of course so I'm on my own.
Please help - I don't want to have to put RH back on it :-/
Anyway, we took it to the co-lo whereit ran without issue for a week while I tried to tune the mySQL server for such a large sytem. No problems at all.
The purpose of this enourmous server was to run _nothing_ but MySQL (With InnoDB - mysql-Max) as a replacement for our poor PowerEdge 2450 Dual PII that was straining under the heavy load.
When we "switched" all the service from the "old" 2540 to the new server everythign seemed fine... untill about 3 hours later when monitors alreted us to the fact that it was locked up cold.
We had the ISP reboot it. After that it started locking up randomly... at first is was as fast as every 10 mintues... then it settleed down to a couple hours between lockups.
What unnervs me is the fact that there's nothing obvious -- not obvious to me anyway -- going wrong.
I have brought the server back to my office here and I'm monitoring it but, I just don't know what to look for.
Quite unfortunately it hasn't locked up since we brought it back. That's computers for ya isn't it? But I haven't put a simulated SQL load on it yet. I'm going to in the morning - I suspect that it will bring the bugs back up.
The only thing that gives me any clue are the following:
In /var/log/warn I see the following a few times per day:
May 2 14:07:48 cp001 kernel: scsi : aborting command due to timeout : pid 4100, scsi0, channel 2, id 0, lun 0 0x2a 00 04 22 f3 b9 00 00 08 00 May 2 14:07:56 cp001 kernel: scsi : aborting command due to timeout : pid 4245, scsi0, channel 2, id 0, lun 0 0x2a 00 04 22 f3 c1 00 00 08 00
Searching for that string on google has inconclusive results. I didsee one thing about shared PCI intterupts, I'll invetigate that more in the morning.
My boss thinks it might be a RAM error. This is very possible, as the RAM consist of sixteen 512MB sticks on a giant daughtercard. Sort of scary looking. Plenty of chances for there to be a bad byte in there. Even a back socket.
There are 2 things in support of this: 1. the crashes didn't start untill we put the server under the heavy MySQL load. MySQL was configued to use lots of memory, and possibly if one of the higher sticks of RAM is bad, MySQL just "happened to get up high enough in memory usage to reach it. And, notice that before and after the load was on it, (i.e. while no load is on it) the crashes are not happening at all (yet). Although, it did lock up one last time at the co-lo while it was sitting idle, waiting for us to pick it up.
Another thing that might suport the bad-memory theory is that when I boot in the memtext86 program, the program locks up cold - I have to hit reset. It also looks like it's only seeing 1/4 of the memory -- though I'm not sure I'm reading it right.
In the morning I'll take a screen shot of the memtest page for accuracy, but for tonight, here's a quick list of some of the figures it shows:
L1 CACHE 32K 8914MB/s L2 CACHE 2048K 3995MB/s Memory 3584M 386MB/s Cacheable 3584M
ANd it just locks up and sits there, won't do anything. Makes me a little suspicious.
Well... I'm going to sleep over it but any ideas would be really appreciated..... Thanks.
I have had nothing but bad experiences with these boxes. One thing to try is to make sure that the MPS setting in the bios is set to rev 1.1 and not 1.4. Another thing "and I know this sounds strange" Don't put any pci cards in slots 3 and 5. Slot 1 being the only 32bit slot. We ended up returning 2 of these boxes back to dell because of the slot 3 and 5 thing. We had used RH-7.2 and SuSE-7.3 no difference. We got NO support from DELL other than they did come out and replace the MB in one of the boxes to see if it would help. It didn't. Bothe these machines did the same thing and Dell didn't want anything to do with our problem because we weren't using pci cards purchased from DELL. Like I said they got them back and we will not purchase any more of them. Super-Micro makes a very nice quad MB now that hold P4 zeons at 1.6GZ and up. Those 6450's are only P3's and a max of 900MZ. Good Luck Mark