Mailinglist Archive: opensuse-kernel (85 mails)

< Previous Next >
Re: [opensuse-kernel] BadRAM patch for kernel
On 11/01/12 04:17, Brian K. White wrote:
On 1/9/2012 9:20 PM, Basil Chupin wrote:
I had the need to go to the memtest86.com site to grab a copy of the
test to check the condition of my RAM and came across a patch for the
kernel which purports to use a feature of the kernel to map bad RAM
bytes and not use them thus avoiding the need to replace RAM (perhaps
unnecessarily). The patches do not go beyond kernel 2.6.x.

I was wondering if this patch is now automatically built-in into the
kernel and patching the kernel is therefore no longer necessary (I am
using kernel 3.2 for example)?

Anybody know the answer, please?

BC


There is a built-in feature that can do the same thing.
Just use the memmap=size$start kernel command line option.

From /boot/grub/menu.lst an old 10.1 box of mine:

# server is randomly crashing
# memtest 4.00 reports 0x0006703d144,0xffffffffffc
###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux 10.1
root (hd0,1)
kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts memmap=1K$0x0006703d144
initrd /boot/initrd


So, memtest said that a very small discrepancy happens starting at address 0x0006703d144

So, I told the kernel via memmap=size$start to ignore 1K starting at that same address.

Ypu have to let memtest run through all tests for a day or two and at full hot temperature (like, not with the case off and out on a bench instead of in whatever hole you would normally run it), to really be sure you are getting all marginal addresses.

Then depending on the results, maybe the errors are all close together and you can encompass them in one memmap range that isn't too wastefully large, or you may want to use a few different smaller ranges.

In the server above there were a few bad addresses, but they were all very close to each other and that tiny 1K range covers them all.

And that server no longer crashes. The change was made about a year ago and at that time it was crashing once or twice a week. So that really did fix it.

It's a stock opensuse 10.1 kernel (oss & updates repos).

I think it still works the same at least as of 11.4. A few months ago I had a new box with new faulty ram that I had to use until the RMA came in.

Another box would give errors but only because it was a desktop motherboard in a 1u rackmount case, which means the airflow was all wrong and the cpu got no air at all. The memory errors were pretty random, all over the map the longer the tests ran. Switching to a different case with different fan and power-supply arrangements made the memory errors go away on that one.

Many thanks, Brian, for this. I am now wading thru what memtest86+ is producing for me (I just got 2 different results from 2 runs[#] which hasn't helped in working out what the heck is going on :-( ). I just read the skimpy docs on how to use memtest to see how I can get it to display for me a range of bad memory bytes so that I can put into menu.lst that memmap thingie. BTW, how does one handle if there is more than one range or even should the range be greater than 1K (and is 1K=1,000 or is = to 1,024 or larger? :-) ).

[#] The first run gave me 56 errors in a range of addresses in the 4GB+ range which is impossible as I only have 3GB of RAM, while the second run gave me only 23 errors but I didn't note the address range. Also, which is a nuisance, the errors scroll off the screen.

BC

--
It is easy to convince people of something, but hard to keep them convinced.
Niccolo Machiavelli

--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe@xxxxxxxxxxxx
To contact the owner, e-mail: opensuse-kernel+owner@xxxxxxxxxxxx

< Previous Next >
Follow Ups