[opensuse-kernel] BadRAM patch for kernel

newer
[opensuse-kernel] Audio capture...

older
[opensuse-kernel] BTRFS: no space...

Basil Chupin

10 Jan 2012 10 Jan '12

02:20

I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x. I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)? Anybody know the answer, please? BC -- What religion were Adam and Eve? -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Show replies by date

Jean Delvare

10 Jan 10 Jan

08:56

Hi Basil, On Tuesday 10 January 2012 03:20:43 am Basil Chupin wrote:

...

I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM

Note that memtest86+ (from memtest.org) is packaged for openSUSE. Just install it and it will show in your boot menu.

...

and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.

I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?

Anybody know the answer, please?

It would be easier to answer the question if you would point us to the patch in question. My wild guess is that the patch still applies to current kernels. There was no architectural change between 2.6.39 and 3.0, it's really only a marketing (or convenience) move. As people did not expect it to happen, there are still many references to "kernel 2.6.x" left, which really mean "kernel 2.6 or later". -- Jean Delvare Suse L3 -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Basil Chupin

10:35

On 10/01/12 19:56, Jean Delvare wrote:

...

Hi Basil,

On Tuesday 10 January 2012 03:20:43 am Basil Chupin wrote:

...
I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM Note that memtest86+ (from memtest.org) is packaged for openSUSE. Just install it and it will show in your boot menu.

You mention memtest.org as the source of the tester in openSUSE which may explain why the copy I downloaded of memtest86 is version 4.0a which is claimed to be the latest version (released 20 August 2011), and it came from memtest86.com and not memtest.org. All very confusing to say the least :-) . My confusion continues below......

...

and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x. I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)? Anybody know the answer, please? It would be easier to answer the question if you would point us to the patch in question.

I was going to supply an URL but the memtest86.com site doesn't have a different URL for its different pages and the one which would be relevant only takes you back to the memtest86.com site......It's easier if you looked for yourself :-) . The URL is: http://home.zonnet.nl/vanrein/badram

...

My wild guess is that the patch still applies to current kernels. There was no architectural change between 2.6.39 and 3.0, it's really only a marketing (or convenience) move. As people did not expect it to happen, there are still many references to "kernel 2.6.x" left, which really mean "kernel 2.6 or later".

I am no expert so I shall leave it up to you to work this one out :-) . (My confusion was further confused when I looked at memtest.org which appears to have been evolved from memtest86.com - but then, memtest86 v4.0a was only released last August.......) BC -- What religion were Adam and Eve? -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Michal Kubeček

12:44

On Tuesday 10 of January 2012 21:35EN, Basil Chupin wrote:

...

You mention memtest.org as the source of the tester in openSUSE which may explain why the copy I downloaded of memtest86 is version 4.0a which is claimed to be the latest version (released 20 August 2011), and it came from memtest86.com and not memtest.org. All very confusing to say the least :-) .

memtest86 and memtest86+ are two different projects. OpenSuSE contains memtest86+ (memtest.org) while you have memtest (from memtest86.com). Michal Kubeček -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Basil Chupin

11 Jan 11 Jan

11:56

On 10/01/12 23:44, Michal Kubeček wrote:

...

On Tuesday 10 of January 2012 21:35EN, Basil Chupin wrote:

...
You mention memtest.org as the source of the tester in openSUSE which may explain why the copy I downloaded of memtest86 is version 4.0a which is claimed to be the latest version (released 20 August 2011), and it came from memtest86.com and not memtest.org. All very confusing to say the least :-) . memtest86 and memtest86+ are two different projects. OpenSuSE contains memtest86+ (memtest.org) while you have memtest (from memtest86.com).

Michal Kube�ek

Thanks for your response, but I am not sure that you are correct in your claim that the software is from 2 different projects because if one clicks on the links given on memtest.org one is taken to memtest86.com - the original source of the software. All very confusing to say the least..... :-( . BC -- It is easy to convince people of something, but hard to keep them convinced. Niccolo Machiavelli -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Michal Hocko

10 Jan 10 Jan

13:04

Hi, I wasn't involved in any BadRAM discussions so I can only do these statements from reading mailing list discussions[1]. On Tue 10-01-12 13:20:43, Basil Chupin wrote: [...]

...

and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.

The patch hasn't been accepted yet (if it gets accepted at all is questionable). There are some concerns about the interface for bad addresses specification.

...

I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?

No, it is not a part of the upstream (vanilla) nor our opensuse kernels. The patch would need to be accepted (or at least there was a consensus that it _will_ be in a compatible form) into upstream before we will take it. [1] https://lkml.org/lkml/2011/6/22/140 -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Basil Chupin

11 Jan 11 Jan

11:58

On 11/01/12 00:04, Michal Hocko wrote:

...

Hi, I wasn't involved in any BadRAM discussions so I can only do these statements from reading mailing list discussions[1].

On Tue 10-01-12 13:20:43, Basil Chupin wrote: [...]

...
and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x. The patch hasn't been accepted yet (if it gets accepted at all is questionable). There are some concerns about the interface for bad addresses specification.

...
I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)? No, it is not a part of the upstream (vanilla) nor our opensuse kernels. The patch would need to be accepted (or at least there was a consensus that it _will_ be in a compatible form) into upstream before we will take it.

[1] https://lkml.org/lkml/2011/6/22/140

Thanks for your response. The patch has been around for some time I believe but then things move slowly when one is having fun :-) . BC -- It is easy to convince people of something, but hard to keep them convinced. Niccolo Machiavelli -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Brian K. White

10 Jan 10 Jan

17:17

On 1/9/2012 9:20 PM, Basil Chupin wrote:

...

I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.

I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?

Anybody know the answer, please?

BC

There is a built-in feature that can do the same thing. Just use the memmap=size$start kernel command line option. From /boot/grub/menu.lst an old 10.1 box of mine: # server is randomly crashing # memtest 4.00 reports 0x0006703d144,0xffffffffffc ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE Linux 10.1 root (hd0,1) kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts memmap=1K$0x0006703d144 initrd /boot/initrd So, memtest said that a very small discrepancy happens starting at address 0x0006703d144 So, I told the kernel via memmap=size$start to ignore 1K starting at that same address. Ypu have to let memtest run through all tests for a day or two and at full hot temperature (like, not with the case off and out on a bench instead of in whatever hole you would normally run it), to really be sure you are getting all marginal addresses. Then depending on the results, maybe the errors are all close together and you can encompass them in one memmap range that isn't too wastefully large, or you may want to use a few different smaller ranges. In the server above there were a few bad addresses, but they were all very close to each other and that tiny 1K range covers them all. And that server no longer crashes. The change was made about a year ago and at that time it was crashing once or twice a week. So that really did fix it. It's a stock opensuse 10.1 kernel (oss & updates repos). I think it still works the same at least as of 11.4. A few months ago I had a new box with new faulty ram that I had to use until the RMA came in. Another box would give errors but only because it was a desktop motherboard in a 1u rackmount case, which means the airflow was all wrong and the cpu got no air at all. The memory errors were pretty random, all over the map the longer the tests ran. Switching to a different case with different fan and power-supply arrangements made the memory errors go away on that one. -- bkw -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Basil Chupin

11 Jan 11 Jan

12:07

On 11/01/12 04:17, Brian K. White wrote:

...

On 1/9/2012 9:20 PM, Basil Chupin wrote:

...
I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.

I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?

Anybody know the answer, please?

BC

There is a built-in feature that can do the same thing. Just use the memmap=size$start kernel command line option.

From /boot/grub/menu.lst an old 10.1 box of mine:

# server is randomly crashing # memtest 4.00 reports 0x0006703d144,0xffffffffffc ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE Linux 10.1 root (hd0,1) kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts memmap=1K$0x0006703d144 initrd /boot/initrd

So, memtest said that a very small discrepancy happens starting at address 0x0006703d144

So, I told the kernel via memmap=size$start to ignore 1K starting at that same address.

Ypu have to let memtest run through all tests for a day or two and at full hot temperature (like, not with the case off and out on a bench instead of in whatever hole you would normally run it), to really be sure you are getting all marginal addresses.

Then depending on the results, maybe the errors are all close together and you can encompass them in one memmap range that isn't too wastefully large, or you may want to use a few different smaller ranges.

In the server above there were a few bad addresses, but they were all very close to each other and that tiny 1K range covers them all.

And that server no longer crashes. The change was made about a year ago and at that time it was crashing once or twice a week. So that really did fix it.

It's a stock opensuse 10.1 kernel (oss & updates repos).

I think it still works the same at least as of 11.4. A few months ago I had a new box with new faulty ram that I had to use until the RMA came in.

Another box would give errors but only because it was a desktop motherboard in a 1u rackmount case, which means the airflow was all wrong and the cpu got no air at all. The memory errors were pretty random, all over the map the longer the tests ran. Switching to a different case with different fan and power-supply arrangements made the memory errors go away on that one.

Many thanks, Brian, for this. I am now wading thru what memtest86+ is producing for me (I just got 2 different results from 2 runs[#] which hasn't helped in working out what the heck is going on :-( ). I just read the skimpy docs on how to use memtest to see how I can get it to display for me a range of bad memory bytes so that I can put into menu.lst that memmap thingie. BTW, how does one handle if there is more than one range or even should the range be greater than 1K (and is 1K=1,000 or is = to 1,024 or larger? :-) ). [#] The first run gave me 56 errors in a range of addresses in the 4GB+ range which is impossible as I only have 3GB of RAM, while the second run gave me only 23 errors but I didn't note the address range. Also, which is a nuisance, the errors scroll off the screen. BC -- It is easy to convince people of something, but hard to keep them convinced. Niccolo Machiavelli -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Brian K. White

17:22

On 1/11/2012 7:07 AM, Basil Chupin wrote:

...

On 11/01/12 04:17, Brian K. White wrote:

...
On 1/9/2012 9:20 PM, Basil Chupin wrote:

...
I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.

I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?

Anybody know the answer, please?

BC

There is a built-in feature that can do the same thing. Just use the memmap=size$start kernel command line option.

From /boot/grub/menu.lst an old 10.1 box of mine:

# server is randomly crashing # memtest 4.00 reports 0x0006703d144,0xffffffffffc ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE Linux 10.1 root (hd0,1) kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts memmap=1K$0x0006703d144 initrd /boot/initrd

So, memtest said that a very small discrepancy happens starting at address 0x0006703d144

So, I told the kernel via memmap=size$start to ignore 1K starting at that same address.

Ypu have to let memtest run through all tests for a day or two and at full hot temperature (like, not with the case off and out on a bench instead of in whatever hole you would normally run it), to really be sure you are getting all marginal addresses.

Then depending on the results, maybe the errors are all close together and you can encompass them in one memmap range that isn't too wastefully large, or you may want to use a few different smaller ranges.

In the server above there were a few bad addresses, but they were all very close to each other and that tiny 1K range covers them all.

And that server no longer crashes. The change was made about a year ago and at that time it was crashing once or twice a week. So that really did fix it.

It's a stock opensuse 10.1 kernel (oss & updates repos).

I think it still works the same at least as of 11.4. A few months ago I had a new box with new faulty ram that I had to use until the RMA came in.

Another box would give errors but only because it was a desktop motherboard in a 1u rackmount case, which means the airflow was all wrong and the cpu got no air at all. The memory errors were pretty random, all over the map the longer the tests ran. Switching to a different case with different fan and power-supply arrangements made the memory errors go away on that one.

Many thanks, Brian, for this. I am now wading thru what memtest86+ is producing for me (I just got 2 different results from 2 runs[#] which hasn't helped in working out what the heck is going on :-( ). I just read the skimpy docs on how to use memtest to see how I can get it to display for me a range of bad memory bytes so that I can put into menu.lst that memmap thingie. BTW, how does one handle if there is more than one range or even should the range be greater than 1K (and is 1K=1,000 or is = to 1,024 or larger? :-) ).

[#] The first run gave me 56 errors in a range of addresses in the 4GB+ range which is impossible as I only have 3GB of RAM, while the second run gave me only 23 errors but I didn't note the address range. Also, which is a nuisance, the errors scroll off the screen.

BC

There is a summary option in one or the other or both memtest versions, there are two versions of memtest one is memtest86, one is memtest86+ The summary option will display a single summary in the screen that never scrolls off, but you can't see the individual addresses, just the lowest and highest, which might be useless if there is one error near the bottom and one error near the top, you don't exactly want to exclude 90% of your ram. You handle more than one range of bad addresses by just putting more memmap options on the kernel command line... memmap=size1$start1 memmap=size2$start2 memmap=size3$start3 ...with different sizes and start addresses. The start addresses can be copied right from the memtest screen as per my example, use the lowest one you see out of all, then make the size big enough to go just past the highest one you see of all. Or instead of "of all", "of all in a particular close grouping" The 1K in my case I didn't care if it was 1000 or 1024 because I really only had a few addresses and any interpretation of "1K" would more than cover it, and any interpretation of "1K" is more than small enough that it's silly to even try to get smaller just to save 500 bytes of ram out of 4 gigs. If you have many errors that scroll off the screen, and they aren't all in the same place or just a few places, really all over your memory space, then you have too many errors to think about merely using memmap to fix it. You have a bigger problem that needs a proper fix. Failing capacitors on motherboard, too-hot cpu or ram, etc. -- bkw -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

Basil Chupin

12 Jan 12 Jan

04:49

On 12/01/12 04:22, Brian K. White wrote:

...

On 1/11/2012 7:07 AM, Basil Chupin wrote:

...
On 11/01/12 04:17, Brian K. White wrote:

...
On 1/9/2012 9:20 PM, Basil Chupin wrote:

...
I had the need to go to the memtest86.com site to grab a copy of the test to check the condition of my RAM and came across a patch for the kernel which purports to use a feature of the kernel to map bad RAM bytes and not use them thus avoiding the need to replace RAM (perhaps unnecessarily). The patches do not go beyond kernel 2.6.x.

I was wondering if this patch is now automatically built-in into the kernel and patching the kernel is therefore no longer necessary (I am using kernel 3.2 for example)?

Anybody know the answer, please?

BC

There is a built-in feature that can do the same thing. Just use the memmap=size$start kernel command line option.

From /boot/grub/menu.lst an old 10.1 box of mine:

# server is randomly crashing # memtest 4.00 reports 0x0006703d144,0xffffffffffc ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE Linux 10.1 root (hd0,1) kernel /boot/vmlinuz root=/dev/sda2 resume=/dev/sda1 showopts memmap=1K$0x0006703d144 initrd /boot/initrd

So, memtest said that a very small discrepancy happens starting at address 0x0006703d144

So, I told the kernel via memmap=size$start to ignore 1K starting at that same address.

Ypu have to let memtest run through all tests for a day or two and at full hot temperature (like, not with the case off and out on a bench instead of in whatever hole you would normally run it), to really be sure you are getting all marginal addresses.

Then depending on the results, maybe the errors are all close together and you can encompass them in one memmap range that isn't too wastefully large, or you may want to use a few different smaller ranges.

In the server above there were a few bad addresses, but they were all very close to each other and that tiny 1K range covers them all.

And that server no longer crashes. The change was made about a year ago and at that time it was crashing once or twice a week. So that really did fix it.

It's a stock opensuse 10.1 kernel (oss & updates repos).

I think it still works the same at least as of 11.4. A few months ago I had a new box with new faulty ram that I had to use until the RMA came in.

Another box would give errors but only because it was a desktop motherboard in a 1u rackmount case, which means the airflow was all wrong and the cpu got no air at all. The memory errors were pretty random, all over the map the longer the tests ran. Switching to a different case with different fan and power-supply arrangements made the memory errors go away on that one.

Many thanks, Brian, for this. I am now wading thru what memtest86+ is producing for me (I just got 2 different results from 2 runs[#] which hasn't helped in working out what the heck is going on :-( ). I just read the skimpy docs on how to use memtest to see how I can get it to display for me a range of bad memory bytes so that I can put into menu.lst that memmap thingie. BTW, how does one handle if there is more than one range or even should the range be greater than 1K (and is 1K=1,000 or is = to 1,024 or larger? :-) ).

[#] The first run gave me 56 errors in a range of addresses in the 4GB+ range which is impossible as I only have 3GB of RAM, while the second run gave me only 23 errors but I didn't note the address range. Also, which is a nuisance, the errors scroll off the screen.

BC

There is a summary option in one or the other or both memtest versions, there are two versions of memtest one is memtest86, one is memtest86+

The summary option will display a single summary in the screen that never scrolls off, but you can't see the individual addresses, just the lowest and highest, which might be useless if there is one error near the bottom and one error near the top, you don't exactly want to exclude 90% of your ram.

You handle more than one range of bad addresses by just putting more memmap options on the kernel command line...

memmap=size1$start1 memmap=size2$start2 memmap=size3$start3

...with different sizes and start addresses. The start addresses can be copied right from the memtest screen as per my example, use the lowest one you see out of all, then make the size big enough to go just past the highest one you see of all. Or instead of "of all", "of all in a particular close grouping"

The 1K in my case I didn't care if it was 1000 or 1024 because I really only had a few addresses and any interpretation of "1K" would more than cover it, and any interpretation of "1K" is more than small enough that it's silly to even try to get smaller just to save 500 bytes of ram out of 4 gigs.

If you have many errors that scroll off the screen, and they aren't all in the same place or just a few places, really all over your memory space, then you have too many errors to think about merely using memmap to fix it. You have a bigger problem that needs a proper fix. Failing capacitors on motherboard, too-hot cpu or ram, etc.

Again, thanks, Brian, for your response. But this subject is now becoming off topic methinks for this list so I will only respond briefly here (but if you also read the offtopic list then I did raise this issue of memtest there and some discussion is going on there about it). I ran memtest86+ again a short while ago and it showed a contiguous area of 84 errors but it did not show what the range of addresses was (I do wish it could output its results to a file so that one could examine the results closely[#]). I then ran the test again with a different output report setting to see if I could pick up the address range and......I got NO errors! At this point I am giving up (at least temporarily :-) ) on this memtest thingie while I still not a bluberring and sobbing mess :-) . What I would like to know is: if there are errors on which of the 3 RAM sticks the errors are so that I can have that stick replaced. [#] I recall using some memory checker way back in the early 90s or late 80s which actually graphically displayed the RAM chips and whether it/they was/were OK or failed the test. BC -- It is easy to convince people of something, but hard to keep them convinced. Niccolo Machiavelli -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org

4742

Age (days ago)

4744

Last active (days ago)

List overview

Download

10 comments

5 participants

participants (5)

Basil Chupin
Brian K. White
Jean Delvare
Michal Hocko
Michal Kubeček