Re: [SLE] Hard disk causing crashes in SuSE 9,1?
El 2004-08-15 a las 15:23 -0700, Ben Sheron escribió: You forgot to email to the list.
How do I enable the kernel log? Can I do this in Yast?
No. It is controlled by the syslog daemon, configured via "/etc/syslog.conf". The line for full kernel log would be: kern.* -/var/log/kernel Further info: "man syslog.conf" -- Saludos Carlos Robinson
OK, sorry about not sending that to the list... Anyway, I enabled logging, and checked /var/log/kernel right after rebooting after a crash. Here's the part of the log from the beginning to when the kernel starts to print boot messages: Aug 16 10:19:20 leartes kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 10:19:20 leartes kernel: Inspecting /boot/System.map-2.6.4-52-smp Aug 16 10:19:20 leartes kernel: Loaded 23948 symbols from /boot/System.map-2.6.4-52-smp. Aug 16 10:19:20 leartes kernel: Symbols match kernel version 2.6.4. Aug 16 10:19:20 leartes kernel: No module symbols loaded - kernel modules not enabled. Aug 16 10:49:36 leartes kernel: kjournald starting. Commit interval 5 seconds Aug 16 10:49:36 leartes kernel: EXT3 FS on hda1, internal journal Aug 16 10:49:36 leartes kernel: EXT3-fs: mounted filesystem with ordered data mode. Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000 Aug 16 11:06:46 leartes kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 11:06:46 leartes kernel: Inspecting /boot/System.map-2.6.4-52-smp Aug 16 11:06:46 leartes kernel: Loaded 23948 symbols from /boot/System.map-2.6.4-52-smp. Aug 16 11:06:46 leartes kernel: Symbols match kernel version 2.6.4. It seems to be a similar error. I did some Googling and found some information suggesting that this could be a video card problem, but I'm not sure... Ben On Mon, 2004-08-16 at 01:58, Carlos E. R. wrote:
El 2004-08-15 a las 15:23 -0700, Ben Sheron escribió:
You forgot to email to the list.
How do I enable the kernel log? Can I do this in Yast?
No. It is controlled by the syslog daemon, configured via "/etc/syslog.conf". The line for full kernel log would be:
kern.* -/var/log/kernel
Further info: "man syslog.conf"
The Monday 2004-08-16 at 11:13 -0700, Ben Sheron wrote:
OK, sorry about not sending that to the list... Anyway, I enabled logging, and checked /var/log/kernel right after rebooting after a crash. Here's the part of the log from the beginning to when the kernel starts to print boot messages:
Aug 16 10:19:20 leartes kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 10:19:20 leartes kernel: Inspecting /boot/System.map-2.6.4-52-smp Aug 16 10:19:20 leartes kernel: Loaded 23948 symbols from /boot/System.map-2.6.4-52-smp. Aug 16 10:19:20 leartes kernel: Symbols match kernel version 2.6.4. Aug 16 10:19:20 leartes kernel: No module symbols loaded - kernel modules not enabled. Aug 16 10:49:36 leartes kernel: kjournald starting. Commit interval 5 seconds Aug 16 10:49:36 leartes kernel: EXT3 FS on hda1, internal journal Aug 16 10:49:36 leartes kernel: EXT3-fs: mounted filesystem with ordered data mode. Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000 Aug 16 11:06:46 leartes kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 11:06:46 leartes kernel: Inspecting /boot/System.map-2.6.4-52-smp Aug 16 11:06:46 leartes kernel: Loaded 23948 symbols from /boot/System.map-2.6.4-52-smp. Aug 16 11:06:46 leartes kernel: Symbols match kernel version 2.6.4.
It seems to be a similar error. I did some Googling and found some information suggesting that this could be a video card problem, but I'm not sure...
The above are typical booting messages, I think; I can not see there anything related to a crash or something very wrong. The only one strange to me is: Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000 There is something wrong there. The normal kernel log stop and start sequence is like this: Aug 16 16:18:57 nimrodel kernel: Kernel logging (proc) stopped. Aug 16 16:18:57 nimrodel kernel: Kernel log daemon terminating. Aug 16 20:17:40 nimrodel kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 20:17:40 nimrodel kernel: Inspecting /boot/System.map-2.6.5-7.104-default The first two lines belong to the halt sequence, the other two to the next boot. Notice the time difference. And the log you posted starts with the line marking the start of the boot sequence as well... unless it crashed right after the mtrr line. Is that it? -- Cheers, Carlos Robinson
The mtrr line was the last line before I rebooted, right after the crash. I'm not exactly sure what the mtrr is, though. Cheers, Ben On Mon, 2004-08-16 at 17:54, Carlos E. R. wrote:
The Monday 2004-08-16 at 11:13 -0700, Ben Sheron wrote:
OK, sorry about not sending that to the list... Anyway, I enabled logging, and checked /var/log/kernel right after rebooting after a crash. Here's the part of the log from the beginning to when the kernel starts to print boot messages:
Aug 16 10:19:20 leartes kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 10:19:20 leartes kernel: Inspecting /boot/System.map-2.6.4-52-smp Aug 16 10:19:20 leartes kernel: Loaded 23948 symbols from /boot/System.map-2.6.4-52-smp. Aug 16 10:19:20 leartes kernel: Symbols match kernel version 2.6.4. Aug 16 10:19:20 leartes kernel: No module symbols loaded - kernel modules not enabled. Aug 16 10:49:36 leartes kernel: kjournald starting. Commit interval 5 seconds Aug 16 10:49:36 leartes kernel: EXT3 FS on hda1, internal journal Aug 16 10:49:36 leartes kernel: EXT3-fs: mounted filesystem with ordered data mode. Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000 Aug 16 11:06:46 leartes kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 11:06:46 leartes kernel: Inspecting /boot/System.map-2.6.4-52-smp Aug 16 11:06:46 leartes kernel: Loaded 23948 symbols from /boot/System.map-2.6.4-52-smp. Aug 16 11:06:46 leartes kernel: Symbols match kernel version 2.6.4.
It seems to be a similar error. I did some Googling and found some information suggesting that this could be a video card problem, but I'm not sure...
The above are typical booting messages, I think; I can not see there anything related to a crash or something very wrong. The only one strange to me is:
Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
There is something wrong there.
The normal kernel log stop and start sequence is like this:
Aug 16 16:18:57 nimrodel kernel: Kernel logging (proc) stopped. Aug 16 16:18:57 nimrodel kernel: Kernel log daemon terminating. Aug 16 20:17:40 nimrodel kernel: klogd 1.4.1, log source = /proc/kmsg started. Aug 16 20:17:40 nimrodel kernel: Inspecting /boot/System.map-2.6.5-7.104-default
The first two lines belong to the halt sequence, the other two to the next boot. Notice the time difference. And the log you posted starts with the line marking the start of the boot sequence as well... unless it crashed right after the mtrr line. Is that it?
-- Cheers, Carlos Robinson
The Tuesday 2004-08-17 at 05:38 -0700, Ben Sheron wrote:
Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
The mtrr line was the last line before I rebooted, right after the crash. I'm not exactly sure what the mtrr is, though.
Me neither, not much. The 'apropos' command yields some info: mtrr_add - Add a memory type region ... Memory type region registers control the caching on newer Intel and non Intel processors. This function allows drivers to request an MTRR is added. The details and hardware specifics of each processor's implementation are hidden from the caller, but nevertheless the caller should expect to need to provide a power of two size on an equivalent power of two boundary. And then, there is the kernel documentation on it: /usr/src/linux-2.6.5-7.75/Documentation/mtrr.txt Do you have a vodoo card? See this note: |Some cards (especially Voodoo Graphics boards) need this 4 kB area |excluded from the beginning of the region because it is used for |registers. It would then be an error in the driver programming... can you try another driver? I'm shooting on the dark. -- Cheers, Carlos Robinson
I'm using an NVidia GeForce4 card, with the driver that came with SuSE. I'm going to check around, though. There might be a different driver I'd have some success with. Thanks for the info, I'll keep you informed. Ben On Tue, 2004-08-17 at 13:08, Carlos E. R. wrote:
The Tuesday 2004-08-17 at 05:38 -0700, Ben Sheron wrote:
Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
The mtrr line was the last line before I rebooted, right after the crash. I'm not exactly sure what the mtrr is, though.
Me neither, not much. The 'apropos' command yields some info:
mtrr_add - Add a memory type region ... Memory type region registers control the caching on newer Intel and non Intel processors. This function allows drivers to request an MTRR is added. The details and hardware specifics of each processor's implementation are hidden from the caller, but nevertheless the caller should expect to need to provide a power of two size on an equivalent power of two boundary.
And then, there is the kernel documentation on it:
/usr/src/linux-2.6.5-7.75/Documentation/mtrr.txt
Do you have a vodoo card? See this note:
|Some cards (especially Voodoo Graphics boards) need this 4 kB area |excluded from the beginning of the region because it is used for |registers.
It would then be an error in the driver programming... can you try another driver? I'm shooting on the dark.
-- Cheers, Carlos Robinson
The Tuesday 2004-08-17 at 10:04 -0700, Ben Sheron wrote:
I'm using an NVidia GeForce4 card, with the driver that came with SuSE. I'm going to check around, though. There might be a different driver I'd have some success with. Thanks for the info, I'll keep you informed.
Ah, me too. There are two drivers: the free one, and the not free. Or 'nv' and 'nvidia'. Also, it occurs to me you may try changing the "agp aperture setting" in the bios, I think it is related. Also, the nvidia driver has some options: Option "NvAGP" "3" # try 2 then 1 #Option "NvAGP" "2" # use agpgart #Option "NvAGP" "1" # use nvidia agp #Option "NvAGP" "0" # disable agp Finally, the nvida driver comes with a long file (or was it at the nvidia web page :-?= explaining different options. Assuming then that the problem is video related, you may be probably be able to boot into runlevel 3 (text mode). Simply type a 3 at the grub prompt. -- Cheers, Carlos Robinson
[ Lines like the following in your /var/log/messages ] Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
This means that the video driver is trying to access memory addresses that are not mapped by the kernel (or something like that). I had the same problem on one of my machines, except my log message had a different address and sizes: Jul 15 05:23:13 obelisk kernel: mtrr: 0xd0000000,0x8000000 overlaps existing 0xd0000000,0x800000 The occurred each time the X display server started. Although I had not experienced any actual failures with those warnings, I read that others were encountering various crashes with this problem. I fixed it by putting the following lines in my /etc/init.d/boot.local script to force the kernel to map the correct amount of memory space for the video frame buffer each time it boots: if [ -f /proc/mtrr ] then # Fix video framebuffer size in mtrr echo "disable=2" >/proc/mtrr echo "base=0xd0000000 size=0x8000000 type=write-combining" >/proc/mtrr fi The first "echo" line disables the mapping for the video frame buffer, and the second "echo" line remaps it with the correct size and type. You shouldn't use the above as-is, because it is specific to my hardware. Basically, if you do a "cat /proc/mtrr" it will show you the current mapping. In my case, the reg02 line pertained to the video frame buffer with the starting address of 0xd0000000, but it mapped only 8MB (0x800000) of memory, but the video driver wants 128MB (0x8000000). Prior to the above "fix", my "cat /proc/mtrr" output looked like this: $ cat /proc/mtrr reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 reg01: base=0x3f800000 (1016MB), size= 8MB: uncachable, count=1 reg02: base=0xd0000000 (3328MB), size= 8MB: write-combining, count=2 After apply the above fix in boot.local, the output now looks like this: $ cat /proc/mtrr reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 reg01: base=0x3f800000 (1016MB), size= 8MB: uncachable, count=1 reg02: base=0xd0000000 (3328MB), size= 128MB: write-combining, count=2 Voila, no more mtrr messages in my log file. You need to look at your "cat /proc/mtrr" output and the log message and make the appropriate remapping to get things right. -Ti
Thanks for your response. The output of cat /proc/mtrr is the following on my machine: reg00: base=0x00000000 ( 0MB), size= 512MB: write-back, count=1 reg01: base=0xf0000000 (3840MB), size= 16MB: write-combining, count=1 reg05: base=0xe8000000 (3712MB), size= 128MB: write-combining, count=1 So, would I then change your script to this? if [ -f /proc/mtrr ] then # Fix video framebuffer size in mtrr echo "disable=1" >/proc/mtrr echo "base=0xf0000000 size=0x8000000 type=write-combining" >/proc/mtrr fi Also, can I just stick this anywhere within my /etc/init.d/boot.local script? Once again, thank you very much. Ben On Tue, 2004-08-17 at 21:28, Ti Kan wrote:
[ Lines like the following in your /var/log/messages ] Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
This means that the video driver is trying to access memory addresses that are not mapped by the kernel (or something like that). I had the same problem on one of my machines, except my log message had a different address and sizes:
Jul 15 05:23:13 obelisk kernel: mtrr: 0xd0000000,0x8000000 overlaps existing 0xd0000000,0x800000
The occurred each time the X display server started. Although I had not experienced any actual failures with those warnings, I read that others were encountering various crashes with this problem.
I fixed it by putting the following lines in my /etc/init.d/boot.local script to force the kernel to map the correct amount of memory space for the video frame buffer each time it boots:
if [ -f /proc/mtrr ] then # Fix video framebuffer size in mtrr echo "disable=2" >/proc/mtrr echo "base=0xd0000000 size=0x8000000 type=write-combining" >/proc/mtrr fi
The first "echo" line disables the mapping for the video frame buffer, and the second "echo" line remaps it with the correct size and type.
You shouldn't use the above as-is, because it is specific to my hardware. Basically, if you do a "cat /proc/mtrr" it will show you the current mapping. In my case, the reg02 line pertained to the video frame buffer with the starting address of 0xd0000000, but it mapped only 8MB (0x800000) of memory, but the video driver wants 128MB (0x8000000). Prior to the above "fix", my "cat /proc/mtrr" output looked like this:
$ cat /proc/mtrr reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 reg01: base=0x3f800000 (1016MB), size= 8MB: uncachable, count=1 reg02: base=0xd0000000 (3328MB), size= 8MB: write-combining, count=2
After apply the above fix in boot.local, the output now looks like this:
$ cat /proc/mtrr reg00: base=0x00000000 ( 0MB), size=1024MB: write-back, count=1 reg01: base=0x3f800000 (1016MB), size= 8MB: uncachable, count=1 reg02: base=0xd0000000 (3328MB), size= 128MB: write-combining, count=2
Voila, no more mtrr messages in my log file.
You need to look at your "cat /proc/mtrr" output and the log message and make the appropriate remapping to get things right.
-Ti
Ben Sheron writes:
Thanks for your response. The output of cat /proc/mtrr is the following on my machine:
reg00: base=0x00000000 ( 0MB), size= 512MB: write-back, count=1 reg01: base=0xf0000000 (3840MB), size= 16MB: write-combining, count=1 reg05: base=0xe8000000 (3712MB), size= 128MB: write-combining, count=1
So, would I then change your script to this?
if [ -f /proc/mtrr ] then # Fix video framebuffer size in mtrr echo "disable=1" >/proc/mtrr echo "base=0xf0000000 size=0x8000000 type=write-combining" >/proc/mtrr fi
Since your warning message says:
Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
Your second echo command should look like this: echo "base=0xf0000000 size=0x4000000 type=write-combining" >/proc/mtrr As root, you should run the two echo commands at the shell prompt by hand first, and then do a "cat /proc/mtrr" to see if the change looks reasonable. If it looks ok, then add it to your /etc/init.d/boot.local script.
Also, can I just stick this anywhere within my /etc/init.d/boot.local script?
Yes. This script is run early in the boot process, before most system services are started. Be default SuSE provides an empty boot.local file (except some comments). You can just add your stuff to the end of the file. -Ti
Thanks, it looks good now. I'm going to try to crash my box a few more times to test this out. Also, just out of curiosity, could I increase the memory beyond 64 megs for better performance? Ben On Wed, 2004-08-18 at 18:54, Ti Kan wrote:
Ben Sheron writes:
Thanks for your response. The output of cat /proc/mtrr is the following on my machine:
reg00: base=0x00000000 ( 0MB), size= 512MB: write-back, count=1 reg01: base=0xf0000000 (3840MB), size= 16MB: write-combining, count=1 reg05: base=0xe8000000 (3712MB), size= 128MB: write-combining, count=1
So, would I then change your script to this?
if [ -f /proc/mtrr ] then # Fix video framebuffer size in mtrr echo "disable=1" >/proc/mtrr echo "base=0xf0000000 size=0x8000000 type=write-combining" >/proc/mtrr fi
Since your warning message says:
Aug 16 10:56:42 leartes kernel: mtrr: 0xf0000000,0x4000000 overlaps existing 0xf0000000,0x1000000
Your second echo command should look like this:
echo "base=0xf0000000 size=0x4000000 type=write-combining" >/proc/mtrr
As root, you should run the two echo commands at the shell prompt by hand first, and then do a "cat /proc/mtrr" to see if the change looks reasonable. If it looks ok, then add it to your /etc/init.d/boot.local script.
Also, can I just stick this anywhere within my /etc/init.d/boot.local script?
Yes. This script is run early in the boot process, before most system services are started. Be default SuSE provides an empty boot.local file (except some comments). You can just add your stuff to the end of the file.
-Ti
Ben Sheron writes:
... Thanks, it looks good now. I'm going to try to crash my box a few more times to test this out. Also, just out of curiosity, could I increase the memory beyond 64 megs for better performance?
No, you should not map more than what is needed. Doing so would not give you more performance anyway. -Ti
participants (3)
-
Ben Sheron
-
Carlos E. R.
-
ti@amb.org