[opensuse-kernel] Spurious crash since 4.2 kernel series
Hi All, I'm trying to understand why since the 4.2x series hit the kernel I've one machine getting kernel crash ... Today I was enough lucky the grab one trace. Oct 04 18:39:59 yoda kernel: IP: [<ffffffffa082b4b6>] __nf_conntrack_alloc+0x76/0x320 [nf_conntrack] Oct 04 18:39:59 yoda kernel: PGD 2219067 PUD 80a6af063 PMD 80a6b0063 PTE 800000017c56e161 Oct 04 18:39:59 yoda kernel: Oops: 0003 [#3] PREEMPT SMP Oct 04 18:39:59 yoda kernel: Modules linked in: act_police cls_basic cls_flow cls_fw cls_u32 sch_fq_codel sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq ip6t_MASQUERADE nf_nat_masque Oct 04 18:39:59 yoda kernel: nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast Oct 04 18:39:59 yoda kernel: asus_wmi sparse_keymap rfkill kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw edac_core pcspkr gf Oct 04 18:39:59 yoda kernel: raid6_pq raid10 raid1 raid0 md_mod dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod Oct 04 18:39:59 yoda kernel: CPU: 0 PID: 3998 Comm: named Tainted: G D 4.2.3-1.gef1562d-default #1 Oct 04 18:39:59 yoda kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015 Oct 04 18:39:59 yoda kernel: task: ffff8807dce660c0 ti: ffff8807dce3c000 task.ti: ffff8807dce3c000 Oct 04 18:39:59 yoda kernel: RIP: 0010:[<ffffffffa082b4b6>] [<ffffffffa082b4b6>] __nf_conntrack_alloc+0x76/0x320 [nf_conntrack] Oct 04 18:39:59 yoda kernel: RSP: 0018:ffff8807dce3f908 EFLAGS: 00010282 Oct 04 18:39:59 yoda kernel: RAX: ffff88017c56e240 RBX: 0000000000000000 RCX: ffff8807dce3f9c8 Oct 04 18:39:59 yoda kernel: RDX: 0000000000000000 RSI: ffffe8ffffc13be8 RDI: 0000000000000202 Oct 04 18:39:59 yoda kernel: RBP: ffff8807dce3f948 R08: 0000000000000020 R09: 0000000099b8b092 Oct 04 18:39:59 yoda kernel: R10: 0000000000000024 R11: 000000001e5d2ac0 R12: ffffffff81ed3e40 Oct 04 18:39:59 yoda kernel: R13: ffff8807dce3f9a0 R14: ffff8807dce3f9c8 R15: ffff88017c56e240 Oct 04 18:39:59 yoda kernel: FS: 00007f8f203fd700(0000) GS:ffff88082ec00000(0000) knlGS:0000000000000000 Oct 04 18:39:59 yoda kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 04 18:39:59 yoda kernel: CR2: ffff88017c56e244 CR3: 00000007e626b000 CR4: 00000000000406f0 Oct 04 18:39:59 yoda kernel: Stack: Oct 04 18:39:59 yoda kernel: ffff8807dce3f948 99b8b092a0829dfa ffff8807dce3f948 ffff8800c5b45b00 Oct 04 18:39:59 yoda kernel: 0000000000000002 ffffffff81ed3e40 ffffffffa07b6120 0000000000000000 Oct 04 18:39:59 yoda kernel: ffff8807dce3fa28 ffffffffa082be54 ffffffffa07b6120 ffffffffa083c260 Oct 04 18:39:59 yoda kernel: Call Trace: Oct 04 18:39:59 yoda kernel: [<ffffffffa082be54>] nf_conntrack_in+0x6d4/0xc00 [nf_conntrack] Oct 04 18:39:59 yoda kernel: [<ffffffffa07b4754>] ipv4_conntrack_local+0x54/0x60 [nf_conntrack_ipv4] Oct 04 18:39:59 yoda kernel: [<ffffffff8159fa39>] nf_iterate+0x79/0x90 Oct 04 18:39:59 yoda kernel: [<ffffffff8159fabf>] nf_hook_slow+0x6f/0xc0 Oct 04 18:39:59 yoda kernel: [<ffffffff815ab1b5>] __ip_local_out_sk+0x95/0xa0 Oct 04 18:39:59 yoda kernel: [<ffffffff815ab1db>] ip_local_out_sk+0x1b/0x40 Oct 04 18:39:59 yoda kernel: [<ffffffff815acf4a>] ip_send_skb+0x1a/0x50 Oct 04 18:39:59 yoda kernel: [<ffffffff815d2bad>] udp_send_skb+0x9d/0x270 Oct 04 18:39:59 yoda kernel: [<ffffffff815d3af5>] udp_sendmsg+0x305/0x9e0 Oct 04 18:39:59 yoda kernel: [<ffffffff815e0fff>] inet_sendmsg+0x7f/0xb0 Oct 04 18:39:59 yoda kernel: [<ffffffff81550568>] sock_sendmsg+0x38/0x50 Oct 04 18:39:59 yoda kernel: [<ffffffff81550eea>] ___sys_sendmsg+0x29a/0x2b0 Oct 04 18:39:59 yoda kernel: [<ffffffff815518a2>] __sys_sendmsg+0x42/0x80 Oct 04 18:39:59 yoda kernel: [<ffffffff815518f2>] SyS_sendmsg+0x12/0x20 Oct 04 18:39:59 yoda kernel: [<ffffffff81667e32>] entry_SYSCALL_64_fastpath+0x16/0x75 Oct 04 18:39:59 yoda kernel: DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x16/0x75 Oct 04 18:39:59 yoda kernel: Oct 04 18:39:59 yoda kernel: Leftover inexact backtrace: Oct 04 18:39:59 yoda kernel: Code: d0 0f 82 4b 02 00 00 49 8b bc 24 38 0b 00 00 44 89 c6 44 89 4d cc e8 0a 9e 99 e0 48 85 c0 49 89 c7 44 8b 4d cc 0f 84 fc 00 00 00 <c7> 40 04 00 00 00 00 49 Oct 04 18:39:59 yoda kernel: RIP [<ffffffffa082b4b6>] __nf_conntrack_alloc+0x76/0x320 [nf_conntrack] Oct 04 18:39:59 yoda kernel: RSP <ffff8807dce3f908> Oct 04 18:39:59 yoda kernel: CR2: ffff88017c56e244 Oct 04 18:39:59 yoda kernel: ---[ end trace 4e6ecd64fa91ff32 ]--- To resume, this configuration 13.1 + Kernel-Stable has worked perfect until 4.2 with 4.1.6 there was 0 problem. Now What I've found is if I start it in rescue mode, and doesn't activate any network interface there's no crash. Unfortunately, this is my main server router so no network is not an option :-)) All bios and firmware have been updated. I've put the journald extract here https://dav.ioda.net/index.php/s/yDvDAJjahAyoX2n/download The network setup is not that much complicated enp3s0 (motherboard internal e1000) is connected to the dsl routeur with nat enp12s0 (intel PT Pro 1000) is connected to the lan. br0 ( ipv4 and ipv6 routable /56) is over enps12s0 sit1 is a sixx ipv4 ipv6 tunnel to sixx.net I've tried to install the kdump as state in the wiki but never got a /var/crash trace. So any help to find the root cause, or at least the best way to formulate a nice bug report at boo, will be really appreciate. -- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch openSUSE Member & Board, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Sun, Oct 04, 2015 at 07:20:19PM +0200, Bruno Friedmann wrote:
Hi All, I'm trying to understand why since the 4.2x series hit the kernel I've one machine getting kernel crash ...
Today I was enough lucky the grab one trace. ... Oct 04 18:39:59 yoda kernel: CPU: 0 PID: 3998 Comm: named Tainted: G D 4.2.3-1.gef1562d-default #1
In general, it's preferrable to show the first oops as the others may be just follow-ups. However, in this case the log shows that first oops looks the same.
Oct 04 18:39:59 yoda kernel: RIP: 0010:[<ffffffffa082b4b6>] [<ffffffffa082b4b6>] __nf_conntrack_alloc+0x76/0x320 [nf_conntrack] Oct 04 18:39:59 yoda kernel: RSP: 0018:ffff8807dce3f908 EFLAGS: 00010282 Oct 04 18:39:59 yoda kernel: RAX: ffff88017c56e240 RBX: 0000000000000000 RCX: ffff8807dce3f9c8 Oct 04 18:39:59 yoda kernel: RDX: 0000000000000000 RSI: ffffe8ffffc13be8 RDI: 0000000000000202 Oct 04 18:39:59 yoda kernel: RBP: ffff8807dce3f948 R08: 0000000000000020 R09: 0000000099b8b092 Oct 04 18:39:59 yoda kernel: R10: 0000000000000024 R11: 000000001e5d2ac0 R12: ffffffff81ed3e40 Oct 04 18:39:59 yoda kernel: R13: ffff8807dce3f9a0 R14: ffff8807dce3f9c8 R15: ffff88017c56e240 Oct 04 18:39:59 yoda kernel: FS: 00007f8f203fd700(0000) GS:ffff88082ec00000(0000) knlGS:0000000000000000 Oct 04 18:39:59 yoda kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 04 18:39:59 yoda kernel: CR2: ffff88017c56e244 CR3: 00000007e626b000 CR4: 00000000000406f0
This is strange... it happened here: ct = kmem_cache_alloc(net->ct.nf_conntrack_cachep, gfp); if (ct == NULL) { atomic_dec(&net->ct.count); return ERR_PTR(-ENOMEM); } spin_lock_init(&ct->lock); Apparently ct is not null but points to an unmapped page. This looks like some corruption of the slab cache. You might try mainline commit 9cf94eab8b30 netfilter: conntrack: use nf_ct_tmpl_free in CT/synproxy error paths (http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9c...) but I can't say if the problem addressed by it could cause this kind of outcome. Michal Kubeček -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
On Monday 05 October 2015 14.59:38 Michal Kubecek wrote:
On Sun, Oct 04, 2015 at 07:20:19PM +0200, Bruno Friedmann wrote:
Hi All, I'm trying to understand why since the 4.2x series hit the kernel I've one machine getting kernel crash ...
Today I was enough lucky the grab one trace. ... Oct 04 18:39:59 yoda kernel: CPU: 0 PID: 3998 Comm: named Tainted: G D 4.2.3-1.gef1562d-default #1
In general, it's preferrable to show the first oops as the others may be just follow-ups. However, in this case the log shows that first oops looks the same.
Oct 04 18:39:59 yoda kernel: RIP: 0010:[<ffffffffa082b4b6>] [<ffffffffa082b4b6>] __nf_conntrack_alloc+0x76/0x320 [nf_conntrack] Oct 04 18:39:59 yoda kernel: RSP: 0018:ffff8807dce3f908 EFLAGS: 00010282 Oct 04 18:39:59 yoda kernel: RAX: ffff88017c56e240 RBX: 0000000000000000 RCX: ffff8807dce3f9c8 Oct 04 18:39:59 yoda kernel: RDX: 0000000000000000 RSI: ffffe8ffffc13be8 RDI: 0000000000000202 Oct 04 18:39:59 yoda kernel: RBP: ffff8807dce3f948 R08: 0000000000000020 R09: 0000000099b8b092 Oct 04 18:39:59 yoda kernel: R10: 0000000000000024 R11: 000000001e5d2ac0 R12: ffffffff81ed3e40 Oct 04 18:39:59 yoda kernel: R13: ffff8807dce3f9a0 R14: ffff8807dce3f9c8 R15: ffff88017c56e240 Oct 04 18:39:59 yoda kernel: FS: 00007f8f203fd700(0000) GS:ffff88082ec00000(0000) knlGS:0000000000000000 Oct 04 18:39:59 yoda kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 04 18:39:59 yoda kernel: CR2: ffff88017c56e244 CR3: 00000007e626b000 CR4: 00000000000406f0
This is strange... it happened here:
ct = kmem_cache_alloc(net->ct.nf_conntrack_cachep, gfp); if (ct == NULL) { atomic_dec(&net->ct.count); return ERR_PTR(-ENOMEM); } spin_lock_init(&ct->lock);
Apparently ct is not null but points to an unmapped page. This looks like some corruption of the slab cache. You might try mainline commit
9cf94eab8b30 netfilter: conntrack: use nf_ct_tmpl_free in CT/synproxy error paths (http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9c...)
but I can't say if the problem addressed by it could cause this kind of outcome.
Michal Kubeček
Thanks for the pointer, the error nf_conntrack: table full, dropping packet was seen a lot in the 4.2x kernel I've tried when starting normally the server. Here the crash happened in named which was the only network service running. So the the good news, is there's already a fix, and somebody else have seen that kind of problem. I'm not sure to be able to test it before next week-end (including rebuilding the kernel). If this commit follow its way to 4.2.4 in the same time, I can way until it hits build.o.o Do you think it is gainful to still report upstream, Or at least open a bug on b.o.o (just to keep a pointer too?) -- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch openSUSE Member & Board, fsfe fellowship GPG KEY : D5C9B751C4653227 irc: tigerfoot -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
participants (2)
-
Bruno Friedmann
-
Michal Kubecek