[Bug 253968] New: freeze issue with the H8SSLi AM2 HT1000
https://bugzilla.novell.com/show_bug.cgi?id=253968 Summary: freeze issue with the H8SSLi AM2 HT1000 Product: openSUSE 10.2 Version: Final Platform: i386 OS/Version: All Status: NEW Severity: Blocker Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: martin@tuxadero.com QAContact: qa@suse.de To whom it my concern, We use the H8SSL-I2 AM2 Board from Supermicro with 4GB RAM, 2 SATA-HDs on the onboard HT1000-Controller in a software RAID-1 configuration and the 1218 HE AMD CPU, the firmware version is "ServerWorks Serial ATA Controller MMIO BIOS V 3.0.0015.3h 09-14-2006". The OS is Suse 10.2 with the 2.6.18.2-34-bigsmp Kernel. Every time we start our initial hardware burnin-test, the system hangs after a few hours; our main test tool is stress (http://weather.ou.edu/%7Eapw/projects/stress/). The following symptoms are shown: After a system hang the system is still pingable, networksockets are open, but no userspace process is responding. At one instance we had a "top" open and almost all of the userspace processes were in the state of iowait. The console for the login is still responding (the typed characters are echoed back) but no login process is spawned. The Magic sys request is still working. If we use a 3ware RAID-controller card instead of the onboard controller, the system runs stable. If we use 2GB instead of the 4GB the system hangs are less frequent. If we use the the 2.6.20 kernel form kernel.org the system runs stable. I think all of the above points indicate that it is not a hardware issue, but a software issue, in particular a timing- / locking-issue of the SATA driver (ata_svw). Our main problem is that we can not switch to the 2.6.20 kernel, we must use the standart Suse kernel 2.6.18.2-34-bigsmp. Is there any workaround / fix or new fimrware known to fix this freeze issue? Any help would be appreciated. Thanks in advanced. Martin I also have a few Kerneldump: SysRq : HELP : loglevel0-8 reBoot Crashdump tErm Full kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount SysRq : Show Memory Mem-info: DMA per-cpu: cpu 0 hot: high 0, batch 1 used:0 cpu 0 cold: high 0, batch 1 used:0 cpu 1 hot: high 0, batch 1 used:0 cpu 1 cold: high 0, batch 1 used:0 DMA32 per-cpu: empty Normal per-cpu: cpu 0 hot: high 186, batch 31 used:44 cpu 0 cold: high 62, batch 15 used:53 cpu 1 hot: high 186, batch 31 used:155 cpu 1 cold: high 62, batch 15 used:49 HighMem per-cpu: cpu 0 hot: high 186, batch 31 used:0 cpu 0 cold: high 62, batch 15 used:44 cpu 1 hot: high 186, batch 31 used:4 cpu 1 cold: high 62, batch 15 used:16 Free pages: 12292kB (712kB HighMem) Active:349955 inactive:634280 dirty:13444 writeback:299620 unstable:0 free:3073 slab:46787 mapped:378 pagetables:1838 DMA free:4728kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16384kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 880 4368 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 880 4368 Normal free:6852kB min:3756kB low:4692kB high:5632kB active:285560kB inactive:385388kB present:901120kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 27904 HighMem free:712kB min:512kB low:4236kB high:7960kB active:1114260kB inactive:2151732kB present:3571712kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 48*4kB 31*8kB 16*16kB 6*32kB 2*64kB 1*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4728kB DMA32: empty Normal: 429*4kB 64*8kB 15*16kB 3*32kB 7*64kB 4*128kB 3*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 6852kB HighMem: 0*4kB 1*8kB 12*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 712kB Swap cache: add 14757813, delete 14539413, find 43886/46204, race 2+5 Free swap = 171284kB Total swap = 1052152kB Free swap: 171284kB 1122304 pages of RAM 892928 pages of HIGHMEM 83665 reserved pages 142409 pages shared 218401 pages swap cached 13444 pages dirty 299620 pages writeback 378 pages mapped 46787 pages slab 1838 pages pagetables SysRq : HELP : loglevel0-8 reBoot Crashdump tErm Full kIll saK showMem Nice powerOff showPc unRaw Sync showTasks Unmount SysRq : Show Regs Pid: 13466, comm: stress EIP: 0073:[<0804b0a9>] CPU: 1 EIP is at 0x804b0a9 ESP: 007b:bfe8984c EFLAGS: 00000292 Tainted: G U (2.6.18.2-34-bigsmp #1) EAX: 46cd6543 EBX: b7f284e0 ECX: b7f28328 EDX: b7f28088 ESI: 00000000 EDI: 00000000 EBP: bfe89868 DS: 007b ES: 007b CR0: 80050033 CR2: 80105000 CR3: 1ff323e0 CR4: 000006f0 SysRq : Show Regs Pid: 13466, comm: stress EIP: 0073:[<0804b0b5>] CPU: 1 EIP is at 0x804b0b5 ESP: 007b:bfe89850 EFLAGS: 00000292 Tainted: G U (2.6.18.2-34-bigsmp #1) EAX: 36a8a2b3 EBX: b7f284e0 ECX: b7f28328 EDX: b7f28084 ESI: 00000000 EDI: 00000000 EBP: bfe89868 DS: 007b ES: 007b CR0: 80050033 CR2: 80105000 CR3: 1ff323e0 CR4: 000006f0 SysRq : Show Regs Pid: 13466, comm: stress EIP: 0073:[<0804b0b5>] CPU: 1 EIP is at 0x804b0b5 ESP: 007b:bfe89850 EFLAGS: 00000292 Tainted: G U (2.6.18.2-34-bigsmp #1) EAX: 4aa2797d EBX: b7f284e0 ECX: b7f28328 EDX: b7f280d4 ESI: 00000000 EDI: 00000000 EBP: bfe89868 DS: 007b ES: 007b CR0: 80050033 CR2: 80105000 CR3: 1ff323e0 CR4: 000006f0 SysRq : Show Regs Pid: 13466, comm: stress EIP: 0073:[<0804b0b5>] CPU: 1 EIP is at 0x804b0b5 ESP: 007b:bfe89850 EFLAGS: 00000292 Tainted: G U (2.6.18.2-34-bigsmp #1) EAX: 4904a9a3 EBX: b7f284e0 ECX: b7f28328 EDX: b7f280d8 ESI: 00000000 EDI: 00000000 EBP: bfe89868 DS: 007b ES: 007b CR0: 80050033 CR2: 80105000 CR3: 1ff323e0 CR4: 000006f0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 teheo@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 teheo@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |martin@tuxadero.com ------- Comment #1 from teheo@novell.com 2007-03-13 06:05 MST ------- Can you please... 1. post the result of 'hwinfo --all'. 2. try KOTD ftp://ftp.suse.com/pub/projects/kernel/kotd/10.2-x86_64/SL102_BRANCH/ Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 ------- Comment #2 from martin@tuxadero.com 2007-03-13 06:41 MST ------- Created an attachment (id=124030) --> (https://bugzilla.novell.com/attachment.cgi?id=124030&action=view) Output of hwinfo --all -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 martin@tuxadero.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|martin@tuxadero.com | ------- Comment #3 from martin@tuxadero.com 2007-03-13 06:44 MST ------- We tried different BIOS Settings for the satamode of the HT1000, and it seems that when we change the mode from mmio to ide the system runs stable. I started a new burnintest with the recommended kernel with the mmio mode activated. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 ------- Comment #4 from martin@tuxadero.com 2007-03-13 09:51 MST ------- Hello, I used the kernel 2.6.18.8-SL102_BRANCH_20070312010320-bigsmp, because the x86_64 wouldn't install, because I have a 32-bit system. I still have the same problem. SysRq : Show Regs5 0 1106m 931m 160 R 23 23.0 5:32.92 stress 16450 root 15 0 0 0 0 S 6 0.0 0:00.74 pdflush Pid: 14508, comm: stress D 1 0.0 1:22.18 md2_raid1 EIP: 0073:[<0804b0b3>] CPU: 1 60 32 S 0 0.0 0:02.00 init EIP is at 0x804b0b3 0 0 0 0 S 0 0.0 0:00.00 migration/0 ESP: 007b:bf8c5290 EFLAGS: 00000296 Tainted: G U (2.6.18.8-SL102_BRANCH_20070312010320-bigsmp #1) EAX: 194eb3e3 EBX: b7ec94e0 ECX: b7ec9328 EDX: b7ec90bc00.01 migration/1 ESI: 00000000 EDI: 00000000 EBP: bf8c52a8 DS: 007b ES: 007b0 ksoftirqd/1 CR0: 80050033 CR2: b7e41008 CR3: 364657c0 CR4: 000006f000.00 events/0 SysRq : Show Regs0 -5 0 0 0 S 0 0.0 0:00.00 events/1 8 root 20 -5 0 0 0 S 0 0.0 0:00.00 khelper Pid: 14508, comm: stress EIP: 0073:[<b7dcb790>] CPU: 1 EIP is at 0xb7dcb790 ESP: 007b:bf8c5250 EFLAGS: 00000246 Tainted: G U (2.6.18.8-SL102_BRANCH_20070312010320-bigsmp #1) EAX: 00000000 EBX: b7ec8ff4 ECX: b7ec9328 EDX: b7ec906c ESI: bf8c5274 EDI: b7ec9064 EBP: bf8c5264 DS: 007b ES: 007b CR0: 80050033 CR2: b7e41008 CR3: 364657c0 CR4: 000006f0 SysRq : Show Regs Pid: 14508, comm: stress EIP: 0073:[<0804b0b3>] CPU: 1 EIP is at 0x804b0b3 ESP: 007b:bf8c5290 EFLAGS: 00000296 Tainted: G U (2.6.18.8-SL102_BRANCH_20070312010320-bigsmp #1) EAX: 4e29169e EBX: b7ec94e0 ECX: b7ec9328 EDX: b7ec90b0 ESI: 00000000 EDI: 00000000 EBP: bf8c52a8 DS: 007b ES: 007b CR0: 80050033 CR2: b7e41008 CR3: 364657c0 CR4: 000006f0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Blocker |Major ------- Comment #5 from gregkh@novell.com 2007-03-14 12:47 MST ------- bugs on products that have already shipped can't be "blocker"... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 jeffm@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |teheo@novell.com |maintainers@forge.provo.nove| |ll.com | Status|ASSIGNED |NEW -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 teheo@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |martin@tuxadero.com ------- Comment #6 from teheo@novell.com 2007-03-17 08:41 MST ------- I'm a bit confused. * Are you still seeing the problem with the BIOS setting changed? * Can you post the result of 'hwinfo --all' with the BIOS setting changed? I'm wondering whether the same driver is used. * Do you have another machine so that you can set up netconsole? If so, please set it up and post the kernel messages after the hang. (I need to look at what libata reports to determine what's going on). You can download netconsole tools from the following URL. http://download.opensuse.org/distribution/10.2/repo/oss/suse/noarch/netconso... After installing the rpm, run "netconsole-server -v -d eth0 IP_of_another_machine:6666" followed by "dmesg -n 8". On the other machine, execute "netcat -l -u -p 6666 | tee kernel.log". After hang, you can post kernel.log here. Thanks. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 ------- Comment #7 from teheo@novell.com 2007-03-21 23:58 MST ------- PING -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 martin@tuxadero.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|martin@tuxadero.com | ------- Comment #8 from martin@tuxadero.com 2007-03-22 15:33 MST ------- I got the dmesg logs as you reqeuested As I saw that the machine was out of memory I changed my swap configuration, my initial setup was swap on a md device, which was configured as a RAID1 which a size of 1GB. I changed the setup to use the device sda2 and sdb2 with the same prio as swap, now the swap size is 2GB. Now the machine runs stable. My question is why is the oom not funktioning properly. If the machine is in an oom stituation it should kill the process an recover, but it doesen't. why? # oom-killer: gfp_mask=0xd0, order=0 [<c014b768>] out_of_memory+0x2d/0x14d [<c014ccdd>] __alloc_pages+0x207/0x2c8 [<c016189d>] cache_alloc_refill+0x29d/0x491 [<c01615f6>] kmem_cache_alloc+0x41/0x4b [<c01715fe>] getname+0x1a/0xb0 [<c01730ce>] __user_walk_fd+0x12/0x40 [<c016cc48>] vfs_stat_fd+0x19/0x40 [<c02a81de>] do_page_fault+0x346/0x630 [<c016ccfc>] sys_stat64+0xf/0x23 [<c01630e3>] filp_close+0x52/0x59 [<c02a7e98>] do_page_fault+0x0/0x630 [<c0103ddd>] sysenter_past_esp+0x56/0x79 Mem-info: DMA per-cpu: cpu 0 hot: high 0, batch 1 used:0 cpu 0 cold: high 0, batch 1 used:0 cpu 1 hot: high 0, batch 1 used:0 cpu 1 cold: high 0, batch 1 used:0 DMA32 per-cpu: empty Normal per-cpu: cpu 0 hot: high 186, batch 31 used:175 cpu 0 cold: high 62, batch 15 used:53 cpu 1 hot: high 186, batch 31 used:164 cpu 1 cold: high 62, batch 15 used:50 HighMem per-cpu: cpu 0 hot: high 186, batch 31 used:24 cpu 0 cold: high 62, batch 15 used:23 cpu 1 hot: high 186, batch 31 used:20 cpu 1 cold: high 62, batch 15 used:57 Free pages: 134040kB (124644kB HighMem) Active:225122 inactive:710378 dirty:88879 writeback:324447 unstable:0 free:33541 slab:63174 mapped:140 pagetables:1671 DMA free:3936kB min:68kB low:84kB high:100kB active:0kB inactive:4kB present:16384kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 880 4368 DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 880 4368 Normal free:5956kB min:3756kB low:4692kB high:5632kB active:66436kB inactive:533124kB present:901120kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 27904 HighMem free:111764kB min:512kB low:4236kB high:7960kB active:879136kB inactive:2276372kB present:3571712kB pages_scanned:2176 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 1*4kB 2*8kB 2*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3604kB DMA32: empty Normal: 43*4kB 17*8kB 2*16kB 2*32kB 4*64kB 2*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3476kB HighMem: 1*4kB 4085*8kB 749*16kB 1827*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 103580kB Swap cache: add 62751133, delete 62521135, find 240316/252271, race 0+22 Free swap = 0kB Total swap = 1052152kB Free swap: 0kB 1122304 pages of RAM 892928 pages of HIGHMEM 83665 reserved pages 193554 pages shared 229999 pages swap cached 88879 pages dirty 324447 pages writeback 177 pages mapped 63174 pages slab 1671 pages pagetables Out of Memory: Kill process 13310 (stress) score 17746 and children. Out of memory: Killed process 13311 (stress). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 teheo@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|teheo@novell.com |kernel-maintainers@forge.provo.novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |npiggin@novell.com |maintainers@forge.provo.nove| |ll.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 npiggin@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |andrea@novell.com ------- Comment #10 from npiggin@novell.com 2007-03-29 20:45 MST ------- So it looks like the correct process does get targetted for OOM-kill, however for some reason the system isn't recovering afterwards. Andrea is looking at a similar problem, and it can get quite difficult if network filesystems are involved, but it sounds like you're just running on local filesystems, which should be much less prone to deadlock. Can you get a sysrq+t trace after the OOM killer gets invoked, and the system gets stuck? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 npiggin@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |martin@tuxadero.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 ------- Comment #11 from martin@tuxadero.com 2007-04-04 08:02 MST ------- Created an attachment (id=128936) --> (https://bugzilla.novell.com/attachment.cgi?id=128936&action=view) Output of the sysrq+t -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 martin@tuxadero.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|martin@tuxadero.com | ------- Comment #12 from martin@tuxadero.com 2007-04-04 08:13 MST ------- Hi, I attached the needed information. We found a workaround for us to solve the problem, in our first setup we had the swap on a md Device with a RAID1 configuration, now we use the two harddisk partion direct as swap and the system runs stable. The sysrq+t output is from the swap on md configuration. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=253968 User npiggin@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=253968#c13 Nick Piggin <npiggin@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |andrea@kernel.org --- Comment #13 from Nick Piggin <npiggin@novell.com> 2008-01-10 22:19:12 MST --- It's interesting with this bug: some things (including) md_raid thread are stuck in mempool_alloc -> page reclaim. I guess the system may not be stuck, just livelocked. And this could be a contributing factor for why systems sometimes go unresponsive when they are near OOM. I think mempool allocation code could be enhanced to do a fair (eg. FIFO) distribution of pool elements to allocators. I also think it should probably not do much work in page reclaim, especially if it is having problems freeing stuff. Perhaps we don't even want it to enter page reclaim at all (but kick kswapd if needed). On the other hand, it is nice to get some parallelism in page reclaim, and we don't necessarily want to leave that work up to the non-mempool allocs. I don't really know, but it is something I'm thinking about... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com