Re: [opensuse-kernel] BadRAM patch for kernel

11 Jan 2012

      On 11/01/12 04:17, Brian K. White wrote:
...
On 1/9/2012 9:20 PM, Basil Chupin wrote:
...
I had the need to go to the memtest86.com site to grab a copy of the
test to check the condition of my RAM and came across a patch for the
kernel which purports to use a feature of the kernel to map bad RAM
bytes and not use them thus avoiding the need to replace RAM (perhaps
unnecessarily). The patches do not go beyond kernel 2.6.x.
I was wondering if this patch is now automatically built-in into the
kernel and patching the kernel is therefore no longer necessary (I am
using kernel 3.2 for example)?
Anybody know the answer, please?
BC
There is a built-in feature that can do the same thing.
Just use the memmap=size$start kernel command line option.
From  /boot/grub/menu.lst an old 10.1 box of mine:
# server is randomly crashing
# memtest 4.00 reports 0x0006703d144,0xffffffffffc
###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux 10.1
    root (hd0,1)
    kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts 
memmap=1K$0x0006703d144
    initrd /boot/initrd
So, memtest said that a very small discrepancy happens starting at 
address 0x0006703d144
So, I told the kernel via memmap=size$start to ignore 1K starting at 
that same address.
Ypu have to let memtest run through all tests for a day or two and at 
full hot temperature (like, not with the case off and out on a bench 
instead of in whatever hole you would normally run it), to really be 
sure you are getting all marginal addresses.
Then depending on the results, maybe the errors are all close together 
and you can encompass them in one memmap range that isn't too 
wastefully large, or you may want to use a few different smaller ranges.
In the server above there were a few bad addresses, but they were all 
very close to each other and that tiny 1K range covers them all.
And that server no longer crashes. The change was made about a year 
ago and at that time it was crashing once or twice a week. So that 
really did fix it.
It's a stock opensuse 10.1 kernel (oss & updates repos).
I think it still works the same at least as of 11.4. A few months ago 
I had a new box with new faulty ram that I had to use until the RMA 
came in.
Another box would give errors but only because it was a desktop 
motherboard in a 1u rackmount case, which means the airflow was all 
wrong and the cpu got no air at all. The memory errors were pretty 
random, all over the map the longer the tests ran. Switching to a 
different case with different fan and power-supply arrangements made 
the memory errors go away on that one.
Many thanks, Brian, for this. I am now wading thru what memtest86+ is 
producing for me (I just got 2 different results from 2 runs[#] which 
hasn't helped in working out what the heck is going on :-( ). I just 
read the skimpy docs on how to use memtest to see how I can get it to 
display for me a range of bad memory bytes so that I can put into 
menu.lst that memmap thingie. BTW, how does one handle if there is more 
than one range or even should the range be greater than 1K (and is 
1K=1,000 or is = to 1,024 or larger? :-) ).

[#] The first run gave me 56 errors in a range of addresses in the 4GB+ 
range which is impossible as I only have 3GB of RAM, while the second 
run gave me only 23 errors but I didn't note the address range. Also, 
which is a nuisance, the errors scroll off the screen.

BC

-- 
It is easy to convince people of something, but hard to keep them convinced.
                              Niccolo Machiavelli

-- 
To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org