can MTRR mapping cause system lockups?
Hi, I have a dual Opteron File server with 4GB ram, and just "upgraded" to SLES9 x86-64 (was previously running SUSE 9.3 x86-64 without problem). The system keeps crashing, and I am starting to suspect mtrr. Some other posts on the subject indicate that mtrr is used to remap the memory that is hidden by the PCI bus to a region above the existing physical memory, so it can be accessible. The system on boot always reports something like this: mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000 mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000 And after running a while it suddently locks up, requiring a cold start. Something else that I thought funny is that there are 2 PCI -X cards in it that work fine but attempting to put in another (an Intel dual GigE card) prevents the system from booting or even passing POST (even with the other cards removed). The same card works in another dual Opteron machine (different mother board) running SLES9 without problem. So, if erroneous memory mapping by mtrr is causing the lockups, how can I remap or disable memory mapping? Thanks for any clues
John Craig wrote:
Hi,
I have a dual Opteron File server with 4GB ram, and just "upgraded" to SLES9 x86-64 (was previously running SUSE 9.3 x86-64 without problem). The system keeps crashing, and I am starting to suspect mtrr.
Some other posts on the subject indicate that mtrr is used to remap the memory that is hidden by the PCI bus to a region above the existing physical memory, so it can be accessible.
The system on boot always reports something like this:
mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000 mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000
And after running a while it suddently locks up, requiring a cold start.
Something else that I thought funny is that there are 2 PCI -X cards in it that work fine but attempting to put in another (an Intel dual GigE card) prevents the system from booting or even passing POST (even with the other cards removed). The same card works in another dual Opteron machine (different mother board) running SLES9 without problem.
So, if erroneous memory mapping by mtrr is causing the lockups, how can I remap or disable memory mapping?
Thanks for any clues
I'm investigating random lockups as well on a SuperMicro motherboard using SuSE 10.0. I get the exact message as well but I'm not conviced the mtrr module is the culprit. I was about to try booting with mtrr disabled when I decided to go home and try it Monday. There's a boot command argument to disable mtrr. Also, I just read that the xorg.conf file has an option ( Option mtrr = off") under the "Section Device" pertinent to the graphics driver. I think the problem is graphics related. Sometimes the system runs for days before locking up, sometimes for a few hours. The screen saver has been disabled too.
Pierre Patino wrote:
John Craig wrote:
Hi,
I have a dual Opteron File server with 4GB ram, and just "upgraded" to SLES9 x86-64 (was previously running SUSE 9.3 x86-64 without problem). The system keeps crashing, and I am starting to suspect mtrr.
Some other posts on the subject indicate that mtrr is used to remap the memory that is hidden by the PCI bus to a region above the existing physical memory, so it can be accessible.
The system on boot always reports something like this:
mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000 mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000
And after running a while it suddently locks up, requiring a cold start.
Something else that I thought funny is that there are 2 PCI -X cards in it that work fine but attempting to put in another (an Intel dual GigE card) prevents the system from booting or even passing POST (even with the other cards removed). The same card works in another dual Opteron machine (different mother board) running SLES9 without problem.
So, if erroneous memory mapping by mtrr is causing the lockups, how can I remap or disable memory mapping?
Thanks for any clues
I'm investigating random lockups as well on a SuperMicro motherboard using SuSE 10.0. I get the exact message as well but I'm not conviced the mtrr module is the culprit. I was about to try booting with mtrr disabled when I decided to go home and try it Monday. There's a boot command argument to disable mtrr. Also, I just read that the xorg.conf file has an option ( Option mtrr = off") under the "Section Device" pertinent to the graphics driver. I think the problem is graphics related. Sometimes the system runs for days before locking up, sometimes for a few hours. The screen saver has been disabled too.
I am also experiencing lockups on a Supermicro system with an nForce4 based Supermicro H8DCE mother board with dual Opteron 246's (stepping CG), 4G RAM, Areca PCIe RAID card (currently with no drives attached), nVidia 7800GT PCIe (and previous nVidia 6200 PCIe) running SuSE 10.0 Eval x86_64. The system also includes a Soundblaser Audigy 2 and and Hauppgauge 550 dual tuner card. I am using the most recent H8DCE BIOS (9/14/05). The memory seems to be good (diagnostics ran all night and found no errors) The problems have persisted (although varied somewhat) with 3 different nVidia video cards. The problems have also persisted with different SATA drives and with the drives attached to the Areca controller and to the motherboard nForce SATA. Before describing the SuSE lock up problems let me say, because of the lockups, I also installed Ubuntu Breezy 5.10 x86_64 (2.6.12+). With Breezy I experienced the lockups with a nVidia 6200 PCIe card installed using the nvidia driver but experienced no lockups with nv driver and experienced no lockups with nVidia 7800GT PCIe card installed using either nv or nvidia driver BUT the performance is very poor compared to SuSE 10.0 As to SuSE 10.0... the system will go anywhere from a few seconds after login to an hour before it locks up tight. It seems to lockup even if there is no console X activity but is ssh login activiy. There is generally nothing in the logs to indicate what happened although I twice saw some kind of kernel panic CPU "oops" message. I did discover in my boot.msg log messages like: "Bios doesnt leave a aperature memory hole; Please enable IOMMU option in BIOS..." and further down several messages like: "mtrr: type mismatch for b0000000, 10000000," ... I checked the BIOS settings and changed MTTR mapping from default "continuous" to "discrete" IOMMU from default "best fit" to "absolute" IOMMU aperature from default "32M" to "64M" Upon rebooting there were no longer any "mtrr: type mismatch ..." messages but I still get an IOMMU error although a slightly differnt one about "cpu0 only allocating 32M please turn on IOMMU in BIOS..." In the BIOS settings, Boot settings there is an option "OS installation" with options [Other | Linux 2.6.9]. The description seems to indicate this is a work around for ONLY 2.6.9 but not really what it works around; does anyone know if this option is actually for "Linux 2.6.9 and higher"? Dispite trying various different BIOS settings I still get frequent hard lockups when running SuSE. (detailed boot.msg, etc. available if/as needed). I would greatly appreciate any help, guidance, and or commiseration. This is getting very, very frustrating and I'm not quite sure how to determine just what is the problem. R.Parr, RHCE, Temporal Arts
Randall J. Parr wrote:
Pierre Patino wrote:
John Craig wrote:
Hi,
I have a dual Opteron File server with 4GB ram, and just "upgraded" to SLES9 x86-64 (was previously running SUSE 9.3 x86-64 without problem). The system keeps crashing, and I am starting to suspect mtrr.
Some other posts on the subject indicate that mtrr is used to remap the memory that is hidden by the PCI bus to a region above the existing physical memory, so it can be accessible.
The system on boot always reports something like this:
mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000 mtrr: 0xfd000000,0x800000 overlaps existing 0xfd000000,0x400000
And after running a while it suddently locks up, requiring a cold start.
Something else that I thought funny is that there are 2 PCI -X cards in it that work fine but attempting to put in another (an Intel dual GigE card) prevents the system from booting or even passing POST (even with the other cards removed). The same card works in another dual Opteron machine (different mother board) running SLES9 without problem.
So, if erroneous memory mapping by mtrr is causing the lockups, how can I remap or disable memory mapping?
Thanks for any clues
I'm investigating random lockups as well on a SuperMicro motherboard using SuSE 10.0. I get the exact message as well but I'm not conviced the mtrr module is the culprit. I was about to try booting with mtrr disabled when I decided to go home and try it Monday. There's a boot command argument to disable mtrr. Also, I just read that the xorg.conf file has an option ( Option mtrr = off") under the "Section Device" pertinent to the graphics driver. I think the problem is graphics related. Sometimes the system runs for days before locking up, sometimes for a few hours. The screen saver has been disabled too.
I am also experiencing lockups on a Supermicro system with an nForce4 based Supermicro H8DCE mother board with dual Opteron 246's (stepping CG), 4G RAM, Areca PCIe RAID card (currently with no drives attached), nVidia 7800GT PCIe (and previous nVidia 6200 PCIe) running SuSE 10.0 Eval x86_64. The system also includes a Soundblaser Audigy 2 and and Hauppgauge 550 dual tuner card.
I am using the most recent H8DCE BIOS (9/14/05). The memory seems to be good (diagnostics ran all night and found no errors) The problems have persisted (although varied somewhat) with 3 different nVidia video cards. The problems have also persisted with different SATA drives and with the drives attached to the Areca controller and to the motherboard nForce SATA.
Before describing the SuSE lock up problems let me say, because of the lockups, I also installed Ubuntu Breezy 5.10 x86_64 (2.6.12+). With Breezy I experienced the lockups with a nVidia 6200 PCIe card installed using the nvidia driver but experienced no lockups with nv driver and experienced no lockups with nVidia 7800GT PCIe card installed using either nv or nvidia driver BUT the performance is very poor compared to SuSE 10.0
As to SuSE 10.0... the system will go anywhere from a few seconds after login to an hour before it locks up tight. It seems to lockup even if there is no console X activity but is ssh login activiy. There is generally nothing in the logs to indicate what happened although I twice saw some kind of kernel panic CPU "oops" message.
I did discover in my boot.msg log messages like:
"Bios doesnt leave a aperature memory hole; Please enable IOMMU option in BIOS..."
and further down several messages like:
"mtrr: type mismatch for b0000000, 10000000," ...
I checked the BIOS settings and changed
MTTR mapping from default "continuous" to "discrete" IOMMU from default "best fit" to "absolute" IOMMU aperature from default "32M" to "64M"
Upon rebooting there were no longer any "mtrr: type mismatch ..." messages but I still get an IOMMU error although a slightly differnt one about "cpu0 only allocating 32M please turn on IOMMU in BIOS..."
In the BIOS settings, Boot settings there is an option "OS installation" with options [Other | Linux 2.6.9]. The description seems to indicate this is a work around for ONLY 2.6.9 but not really what it works around; does anyone know if this option is actually for "Linux 2.6.9 and higher"?
Dispite trying various different BIOS settings I still get frequent hard lockups when running SuSE.
(detailed boot.msg, etc. available if/as needed).
I would greatly appreciate any help, guidance, and or commiseration. This is getting very, very frustrating and I'm not quite sure how to determine just what is the problem.
R.Parr, RHCE, Temporal Arts
This is interesting information. You've obviously dug a lot deeper than me. I'm evaluating this particular motherboard for possible inclusion in our equipment and I have the ear of an applications engineer at SuperMicro who's trying to get our account. I'll pass this along next week. Our motherboard (H8SSL-i) is a single-CPU type though.
On Thursday 24 November 2005 00:44, John Craig wrote:
Hi,
I have a dual Opteron File server with 4GB ram, and just "upgraded" to SLES9 x86-64 (was previously running SUSE 9.3 x86-64 without problem). The system keeps crashing, and I am starting to suspect mtrr.
Some other posts on the subject indicate that mtrr is used to remap the memory that is hidden by the PCI bus to a region above the existing physical memory, so it can be accessible.
No, MTRRs (Memory Type Range Registers) cannot remap memory. They just define if memory (or non memory) can be cached or not. There are CPU specific mechanism for remapping memory above the PCI hole though. One problem is that since MTRRs are quite limited sometimes memory remapping results in a configuration that cannot be correctly covered with MTRRs - and sometimes it could but the BIOS just gets it wrong.
Something else that I thought funny is that there are 2 PCI -X cards in it that work fine but attempting to put in another (an Intel dual GigE card) prevents the system from booting or even passing POST (even with the other cards removed). The same card works in another dual Opteron machine (different mother board) running SLES9 without problem.
It's probably a BIOS problem in that it sets up the MTRRs incorrectly in some cases, which can lead to such lockups. The kernel cannot fix it up for the BIOS unfortunately.
So, if erroneous memory mapping by mtrr is causing the lockups, how can I remap or disable memory mapping?
There are usually options in the BIOS setup for "memory remapping", with one option to turn it off. -Andi
participants (4)
-
Andi Kleen
-
John Craig
-
Pierre Patino
-
Randall J. Parr