[Bug 213321] New: memory allocation/deallocation causes hard lockup in Thinkpad R60e Centrino Duo
https://bugzilla.novell.com/show_bug.cgi?id=213321 Summary: memory allocation/deallocation causes hard lockup in Thinkpad R60e Centrino Duo Product: SUSE Linux 10.1 Version: Final Platform: i686 OS/Version: Linux Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: geraldweber@terra.com.br QAContact: qa@suse.de The following piece of code, compiled with g++ #include <valarray> #include <iostream> int main(void) { size_t N=1000000; while(true) { std::valarray<double> mat(N); std::cout << "." << std::flush; //Just to tell you that it is still alive } } When left running for a few minutes this code causes a hard lockup on a Thinkpad R60e, Centrino Duo T2300 system with 2x1Gb DDR2 Ram. The system does not respond to SysRq and nmi_watchdog=1 had no effect either. Basically the code allocates N pointers of type double, which are then deallocated again at the end of the loop (no increase in memory usage is observed while the program runs). This was tested with kernels: kernel-smp-2.6.16.21-0.25, kernel-default and kernel-debug (all installed via Yast2), in runlevels 1, 3 and 5. Also tested with "Failsafe" boot parameters. In all cases I get a hard lockup and no problem shows up the logs. I also compiled this with gcc 4.0.2, instead of gcc 4.1.0, again with the same hard lockup. If run this through valgrind no lockup is obseverd, and valgrind reports no leaks. The system has been updated to lastest bios firmware version 1.04 (7EET18WW), but the symptoms are unaffected. Memory modules have been tested to exhaustion with the memory test included in SuSE Linux 10.1 DVD. hwinfo and demidecode will be attached to this bug report. Interestingly, apart from this problem, this system runs flawlessly. The same code was tested on a Thinkpad R50e (Pentium-M) and R40e (Celeron) which would run this for hours without problems. I am reporting this in the hope that it may be useful perhaps for investigating hard lockups in similar systems. On the other hand I would be grateful for any suggestions of further tests, suggestions of workarounds or any feedback at all. many thanks Gerald -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=213321 ------- Comment #1 from geraldweber@terra.com.br 2006-10-18 08:23 MST ------- Created an attachment (id=101901) --> (https://bugzilla.novell.com/attachment.cgi?id=101901&action=view) hwinfo for Thinkpad R60e -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=213321 ------- Comment #2 from geraldweber@terra.com.br 2006-10-18 08:24 MST ------- Created an attachment (id=101903) --> (https://bugzilla.novell.com/attachment.cgi?id=101903&action=view) dmidecode for Thinkpad R60e -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=213321 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |gregkh@novell.com AssignedTo|kernel- |trenn@novell.com |maintainers@forge.provo.nove| |ll.com | ------- Comment #3 from gregkh@novell.com 2006-10-18 16:59 MST ------- That program just sits and spins on the cpu, possibly driving the processor to overheat... Thomas, could this be an ACPI issue with the overheating problem? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=213321 trenn@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|trenn@novell.com |kernel-maintainers@forge.provo.novell.com ------- Comment #4 from trenn@novell.com 2006-10-19 03:25 MST ------- No. If the machine gets too hot it would: - shutdown properly if critical temperature is reached - switch off power hard, if ACPI is not functioning correctly, machine is getting too fast too hot or no critical temp is defined Still it's strange why this only happens on the R60e... I'd recommend to run a memtest. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=213321 ------- Comment #5 from geraldweber@terra.com.br 2006-10-19 06:08 MST ------- (In reply to comment #3)
That program just sits and spins on the cpu, possibly driving the processor to overheat...
That was my first thought when I saw this problem for the first time. I've been monitoring the temperature ever since, it stays always below 65C which is quite far from the critical 97C. Also this is running on one CPU only. I have been running CPU intensive programs most of the time using both CPUs, sometimes for days, and never run into any overheating problems. So it is definitively not a overheating problem.
Thomas, could this be an ACPI issue with the overheating problem?
I've also tried several times with acpi=off, again it results in hard lockup. (In reply to comment #4)
I'd recommend to run a memtest.
I've done that several times, the last time for about 12 hours, no errors reported. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=213321
------- Comment #6 from geraldweber@terra.com.br 2006-10-19 07:27 MST -------
I've done a few other tests which perhaps may shed some light on the problem:
Consider this C code:
#include
https://bugzilla.novell.com/show_bug.cgi?id=213321 geraldweber@terra.com.br changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Comment #7 from geraldweber@terra.com.br 2006-10-20 06:09 MST ------- It seems after all hardware related. I took one of the memory modules out and the machine would show no sign of hard lockup. I then swapped the modules and again, the machine works fine. Only both together causes the lockup. Note that with both modules the machine works fine, only when there is a huge memory activity (like when recompiling the complete gcc collection) that the problem shows up. One possibility could be overheating of the memory modules themselves as they do sit one on top of the other. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
participants (1)
-
bugzilla_noreply@novell.com