suse 10.1 runs out of memory, then kernel panic
Hello, I've installed suse 10.1 (amd64) on a new system and it hangs after a few hours of being up. The hardware is dual Opteron with 4 gigs of ram. After booting the system there are no applications running but my available memory (given by top) eventually goes from 4 gigs down to zero, giving this message on the console: "Kernel panic - not syncing: Out of memory and no killable processes...". I'm running kernel version 2.6.16.13-4-smp and have also tried 2.6.16-20-smp with the same results. Does anyone know what could be causing this problem? thanks, Kevin
On Saturday 03 June 2006 02:01, Kevin Lewandowski wrote:
Hello, I've installed suse 10.1 (amd64) on a new system and it hangs after a few hours of being up.
The hardware is dual Opteron with 4 gigs of ram.
After booting the system there are no applications running but my available memory (given by top) eventually goes from 4 gigs down to zero, giving
this message on the console: "Kernel panic - not syncing: Out of memory and no killable processes...".
I'm running kernel version 2.6.16.13-4-smp and have also tried 2.6.16-20-smp with the same results. Does anyone know what could be
causing this problem?
Sounds like a memory leak somewhere. Can you save a copy of /proc/slabinfo every few hours and send it on? lsmod output might be also good. Is the system doing something special? -Andi
Can you save a copy of /proc/slabinfo every few hours and send it on?
lsmod output might be also good. Is the system doing something special?
Thanks for the reply! The system is not doing anything special. This is just a minimal install with no apps running. I've tried to save /proc/slabinfo via a cron job and quite often at the time I try to access it the system crashes with the following on the console: general protection fault: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:09.0/0000:04:04.1/power/state CPU 1 Modules linked in: edd cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table ipv6 button battery ac loop dm_mod shpchp ehci_hcd ohci_hcd ide_cd cdrom pci_hotplug usbcore tg3 floppy ext3 jbd fan thermal processor i2o_block i2o_core serverworks ide_disk ide_core Pid: 3558, comm: cat Not tainted 2.6.16.13-4-smp #1 RIP: 0010:[<ffffffff80178c63>] <ffffffff80178c63>{s_show+189} RSP: 0018:ffff810076041e58 EFLAGS: 00010016 RAX: 000000000000001e RBX: ffff810080005d40 RCX: f000eef300000001 RDX: f000eef300000001 RSI: 000000000000001e RDI: ffff810080005d80 RBP: 0000000000000001 R08: 00000000fffffffe R09: ffff810076041c78 R10: 0000000000000001 R11: 0000000000000000 R12: ffff810037d946c0 R13: ffffffff802f78a8 R14: 0000000000000000 R15: ffff8100efc24f40 FS: 00002ab10d1e16d0(0000) GS:ffff8100f40557c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000506000 CR3: 00000000efc14000 CR4: 00000000000006e0 Process cat (pid: 3558, threadinfo ffff810076040000, task ffff81007c95d080) Stack: 0000000000000000 000000000005c0b1 0000000000003117 0000000000000001 0000000000000000 ffff8100efc24f40 ffff810037d946c0 0000000000000504 0000000000001000 0000000000000000 Call Trace: <ffffffff801986fa>{seq_read+469} <ffffffff8017b402>{vfs_read+203} <ffffffff8017b7de>{sys_read+69} <ffffffff8010a7be>{system_call+126} Code: 48 8b 0a 0f 18 09 48 8d 43 10 48 39 c2 75 c7 48 8b 03 eb 40 RIP <ffffffff80178c63>{s_show+189} RSP <ffff810076041e58> <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0, irqs_disabled():1 Call Trace: <ffffffff80131da1>{profile_task_exit+21} <ffffffff801333ed>{do_exit+32} <ffffffff8010bfa4>{kernel_math_error+0} <ffffffff802d178b>{do_general_protection+254} <ffffffff8010b4b9>{error_exit+0} <ffffffff80178c63>{s_show+189} <ffffffff80178c33>{s_show+141} <ffffffff801986fa>{seq_read+469} <ffffffff8017b402>{vfs_read+203} <ffffffff8017b7de>{sys_read+69} <ffffffff8010a7be>{system_call+126} NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: edd cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table ipv6 button battery ac loop dm_mod shpchp ehci_hcd ohci_hcd ide_cd cdrom pci_hotplug usbcore tg3 floppy ext3 jbd fan thermal processor i2o_block i2o_core serverworks ide_disk ide_core Pid: 0, comm: swapper Not tainted 2.6.16.13-4-smp #1 RIP: 0010:[<ffffffff802d0bf6>] <ffffffff802d0bf6>{.text.lock.spinlock+2} RSP: 0018:ffff8100f407be50 EFLAGS: 00000086 RAX: 0000000000000001 RBX: 0000000000000020 RCX: 000000007e0d4800 RDX: ffff8100f4064c00 RSI: 0000000000000020 RDI: ffff810080005d80 RBP: ffff810037d946c0 R08: ffff810037d98000 R09: ffff8100f4199100 R10: 0000000000000292 R11: ffffffff88041d0a R12: ffff810080005d40 R13: ffff8100f4064c00 R14: 000000000000003c R15: ffff810037d946c0 FS: 00002abc80f8c6d0(0000) GS:ffff8100f40557c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000506000 CR3: 00000000f32e6000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff810037d98000, task ffff8100f40660c0) Stack: ffffffff80177212 00000020f407bec8 0000000000000020 ffff810037d946c0 0000000000000246 ffff81007e09e000 0000000000000000 000000007e0d4800 ffffffff80176cf3 000000007e0d4800 Call Trace: <IRQ> <ffffffff80177212>{cache_alloc_refill+109} <ffffffff80176cf3>{kmem_cache_alloc+127} <ffffffff88041eb5>{:i2o_core:i2o_exec_reply+427} <ffffffff8803f8e2>{:i2o_core:i2o_driver_dispatch+123} <ffffffff8804108c>{:i2o_core:i2o_pci_interrupt+41} <ffffffff801564dc>{handle_IRQ_event+41} <ffffffff801565a6>{__do_IRQ+153} <ffffffff8010ccd4>{do_IRQ+59} <ffffffff801097b4>{default_idle+0} <ffffffff8010ad20>{ret_from_intr+0} <EOI> <ffffffff801097df>{default_idle+43} <ffffffff8010989f>{cpu_idle+151} <ffffffff8011823b>{start_secondary+1240} Code: 83 3f 00 7e f9 e9 55 fe ff ff e8 df 9b f1 ff e9 65 fe ff ff console shuts up ... <3>Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():1, irqs_disabled():1 Call Trace: <NMI> <ffffffff80131da1>{profile_task_exit+21} <ffffffff801333ed>{do_exit+32} <ffffffff802d12f9>{__die+0} <ffffffff802d1a85>{nmi_watchdog_tick+260} <ffffffff802d181a>{default_do_nmi+134} <ffffffff802d1b60>{do_nmi+69} <ffffffff802d108b>{nmi+127} <ffffffff88041d0a>{:i2o_core:i2o_exec_reply+0} <ffffffff802d0bf6>{.text.lock.spinlock+2} <EOE> <IRQ> <ffffffff80177212>{cache_alloc_refill+109} <ffffffff80176cf3>{kmem_cache_alloc+127} <ffffffff88041eb5>{:i2o_core:i2o_exec_reply+427} <ffffffff8803f8e2>{:i2o_core:i2o_driver_dispatch+123} <ffffffff8804108c>{:i2o_core:i2o_pci_interrupt+41} <ffffffff801564dc>{handle_IRQ_event+41} <ffffffff801565a6>{__do_IRQ+153} <ffffffff8010ccd4>{do_IRQ+59} <ffffffff801097b4>{default_idle+0} <ffffffff8010ad20>{ret_from_intr+0} <EOI> <ffffffff801097df>{default_idle+43} <ffffffff8010989f>{cpu_idle+151} <ffffffff8011823b>{start_secondary+1240} Kernel panic - not syncing: Aiee, killing interrupt handler!
On Monday 05 June 2006 04:41, Kevin Lewandowski wrote:
Can you save a copy of /proc/slabinfo every few hours and send it on?
lsmod output might be also good. Is the system doing something special?
Thanks for the reply! The system is not doing anything special. This is just a minimal install with no apps running.
I've tried to save /proc/slabinfo via a cron job and quite often at the time I try to access it the system crashes with the following on the console:
general protection fault: 0000 [1] SMP
A slab list is corrupted. That might also explain the out of memory issues. Either something is corrupting memory or the kernel is otherwise confused. First I would run memtest86 overnight at least to make sure the memory is ok. To rule out NUMA problems can you try booting with numa=off? If that helps please send boot.msg from a boot without that option. Otherwise most likely some device driver is corrupting memory. You could disable e.g. network and other devices and see if that makes a difference. -Andi
participants (2)
-
Andi Kleen
-
Kevin Lewandowski