On 1/11/2012 7:07 AM, Basil Chupin wrote:
On 11/01/12 04:17, Brian K. White wrote:
On 1/9/2012 9:20 PM, Basil Chupin wrote:
I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.
I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?
Anybody know the answer, please?
BC
There is a built-in feature that can do the same thing. Just use the memmap=size$start kernel command line option.
From /boot/grub/menu.lst an old 10.1 box of mine:
# server is randomly crashing # memtest 4.00 reports 0x0006703d144,0xffffffffffc ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE Linux 10.1 root (hd0,1) kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts memmap=1K$0x0006703d144 initrd /boot/initrd
So, memtest said that a very small discrepancy happens starting at address 0x0006703d144
So, I told the kernel via memmap=size$start to ignore 1K starting at that same address.
Ypu have to let memtest run through all tests for a day or two and at full hot temperature (like, not with the case off and out on a bench instead of in whatever hole you would normally run it), to really be sure you are getting all marginal addresses.
Then depending on the results, maybe the errors are all close together and you can encompass them in one memmap range that isn't too wastefully large, or you may want to use a few different smaller ranges.
In the server above there were a few bad addresses, but they were all very close to each other and that tiny 1K range covers them all.
And that server no longer crashes. The change was made about a year ago and at that time it was crashing once or twice a week. So that really did fix it.
It's a stock opensuse 10.1 kernel (oss & updates repos).
I think it still works the same at least as of 11.4. A few months ago I had a new box with new faulty ram that I had to use until the RMA came in.
Another box would give errors but only because it was a desktop motherboard in a 1u rackmount case, which means the airflow was all wrong and the cpu got no air at all. The memory errors were pretty random, all over the map the longer the tests ran. Switching to a different case with different fan and power-supply arrangements made the memory errors go away on that one.
Many thanks, Brian, for this. I am now wading thru what memtest86+ is producing for me (I just got 2 different results from 2 runs[#] which hasn't helped in working out what the heck is going on :-( ). I just read the skimpy docs on how to use memtest to see how I can get it to display for me a range of bad memory bytes so that I can put into menu.lst that memmap thingie. BTW, how does one handle if there is more than one range or even should the range be greater than 1K (and is 1K=1,000 or is = to 1,024 or larger? :-) ).
[#] The first run gave me 56 errors in a range of addresses in the 4GB+ range which is impossible as I only have 3GB of RAM, while the second run gave me only 23 errors but I didn't note the address range. Also, which is a nuisance, the errors scroll off the screen.
BC
There is a summary option in one or the other or both memtest versions, there are two versions of memtest one is memtest86, one is memtest86+ The summary option will display a single summary in the screen that never scrolls off, but you can't see the individual addresses, just the lowest and highest, which might be useless if there is one error near the bottom and one error near the top, you don't exactly want to exclude 90% of your ram. You handle more than one range of bad addresses by just putting more memmap options on the kernel command line... memmap=size1$start1 memmap=size2$start2 memmap=size3$start3 ...with different sizes and start addresses. The start addresses can be copied right from the memtest screen as per my example, use the lowest one you see out of all, then make the size big enough to go just past the highest one you see of all. Or instead of "of all", "of all in a particular close grouping" The 1K in my case I didn't care if it was 1000 or 1024 because I really only had a few addresses and any interpretation of "1K" would more than cover it, and any interpretation of "1K" is more than small enough that it's silly to even try to get smaller just to save 500 bytes of ram out of 4 gigs. If you have many errors that scroll off the screen, and they aren't all in the same place or just a few places, really all over your memory space, then you have too many errors to think about merely using memmap to fix it. You have a bigger problem that needs a proper fix. Failing capacitors on motherboard, too-hot cpu or ram, etc. -- bkw -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org