[Bug 296242] New: resume application triggers kernel bug in kernel/power/ snapshot.c:464
https://bugzilla.novell.com/show_bug.cgi?id=296242#c1 Summary: resume application triggers kernel bug in kernel/power/snapshot.c:464 Product: openSUSE 10.3 Version: Alpha 6 Platform: x86-64 OS/Version: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: pavel@novell.com ReportedBy: trenn@novell.com QAContact: qa@suse.de CC: hare@novell.com, seife@novell.com, agraf@novell.com Found By: Development After successful installation, the rebooted kernel hangs. It looks like the resume program trying to find a resume image in swap is causing this. Here comes the log: running boot/05-usb.sh preping boot/08-shell.sh ------------[ cut here ]------------ kernel BUG at kernel/power/snapshot.c:464! invalid opcode: 0000 [1] SMP last sysfs file: /kernel/uevent_seqnum CPU 3 Modules linked in: ohci_hcd sd_mod usbcore edd reiserfs fan thermal processor sata_sil pata_amd libata scsi_mod Pid: 1039, comm: resume Tainted: G N 2.6.22-5-default #1 RIP: 0010:[<ffffffff80250a06>] [<ffffffff80250a06>] memory_bm_find_bit+0x20/0x78 RSP: 0018:ffff8100f6781d98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff810000003480 RCX: ffff8100f6781dac RDX: ffff8100f6781da0 RSI: 00000000000f7ff0 RDI: ffff8101fc6c6300 RBP: 00000000000f7ff1 R08: ffff8100f6781da0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8101fc6c6300 R13: ffff8101fbf61a48 R14: ffff8101fc68c280 R15: ffff8101fac8dd80 FS: 00000000006d5850(0063) GS:ffff8101fbc4a8c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000043c590 CR3: 00000001fa89a000 CR4: 00000000000006e0 Process resume (pid: 1039, threadinfo ffff8100f6780000, task ffff8100f69d8040) Stack: ffffffff80250a8f ffff8101fadeb018 0000003ffac8dd80 ffffffff8025170f ffff8101fc68c280 0000000000000000 ffff8101fbf61a48 ffffffff80252bff ffffffff80427140 0000000000000000 ffff8101fc68c280 ffffffff8035238f Call Trace: [<ffffffff80250a8f>] memory_bm_set_bit+0x11/0x20 [<ffffffff8025170f>] create_basic_memory_bitmaps+0xd7/0x128 [<ffffffff80252bff>] snapshot_open+0x60/0xff [<ffffffff8035238f>] misc_open+0x144/0x1bd [<ffffffff8028b6e8>] chrdev_open+0x157/0x17c [<ffffffff8028b591>] chrdev_open+0x0/0x17c [<ffffffff8028757e>] __dentry_open+0xd9/0x1aa [<ffffffff80287706>] do_filp_open+0x2d/0x3d [<ffffffff8028775a>] do_sys_open+0x44/0xc1 [<ffffffff80209c2e>] system_call+0x7e/0x83 Code: 0f 0b eb fe 48 3b 70 08 72 ee 48 3b 70 10 73 e8 48 89 47 10 RIP [<ffffffff80250a06>] memory_bm_find_bit+0x20/0x78 RSP <ffff8100f6781d98> running boot/08-shell.sh preping boot/11-devinit_done.sh running boot/11-devAttempting manual resume ------------------------ Unpacking the initrd reveals, that the culprit is triggered by: 04-udev.sh in udev_discover_resume() Can someone tell me how the variables in the boot/XY_name.sh scripts are passed, pls. I mean, e.g. [ "$resume_mode" != "off" ] is probably set because no resume= boot parameter was there, but where do the parameters get parsed, in init itself? removing resume=/dev/sda2 helps to boot the machine. Adding it again, should force the machine into hang state again (I always could reproduce the hang before removing resume=). I wonder why this only happens at this machine, another one booted fine. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=296242#c2
--- Comment #2 from Stefan Seyfried
https://bugzilla.novell.com/show_bug.cgi?id=296242#c3
--- Comment #3 from Stefan Seyfried
https://bugzilla.novell.com/show_bug.cgi?id=296242#c4
--- Comment #4 from Thomas Renninger
So could you try a suspend.rpm from alpha5 or alpha4? Yep, but give me some time, pls... If you have some spare time :) you can just use berg and try yourself (give me a chat if you should have any problem).
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=296242#c5
--- Comment #5 from Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=296242#c6
--- Comment #6 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c7
--- Comment #7 from Rafael Wysocki
https://bugzilla.novell.com/show_bug.cgi?id=296242#c8
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c9
--- Comment #9 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c10
--- Comment #10 from Rafael Wysocki
Forgot to mention, that the machine is booting fine now.
Good.
Shouldn't this check be added anyway?
Yes, it should. I'm sending the patch upstream in a while.
Beside the fact that we should also dig why this happens at all.
Well, I was afraid that we had a more serious bug and the patch wouldn't work, but since it works, I think I understand what happens. :-) Namely, you have the reserved memory between 00000000fbf70000 and 0000000100000000, which most probably doesn't belong to any zone. Now, the x86_64's e820_mark_nosave_regions() registers all PFNs lesser than end_pfn, because at this point we can't tell which of them will be regarded as valid. Thus, we should verify them before trying to set the corresponding bits in the memory bitmap created by the hibernation core. I wonder if hibernation works on this box, though. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=296242#c11
--- Comment #11 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c12
--- Comment #12 from Rafael Wysocki
It's a 8 core server, I doubt it makes sense to try...
Well, in that case the "resume=" command line thing is useless. :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=296242#c13
--- Comment #13 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c14
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c15
--- Comment #15 from Andi Kleen
Is this a NUMA system? Yes. How important is the need to look at why resume/suspend tries to access invalid mem (it normally should never try to?). Doing a suspend on NUMA systems is not really that important, but there is still a bug sitting around there -> lower severity to normal (will add the
https://bugzilla.novell.com/show_bug.cgi?id=296242#c16
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c17
--- Comment #17 from Rafael Wysocki
I gave a suspend a try and again got thousands of lines: "We hit invalid memory" (Which as I said I added to an else path of your patch)
This is normal and the "We hit invalid memory" is not needed (unless you want your logs to be spammed ;-)). As I said in Comment #10, the PFNs that trigger this line are regarded as "nosave" and are to be marked as such by the hibernation core ("nosave" PFNs are treated by the hibernation core as inaccessible). However, if they are invalid, they won't be used anyway. The _only_ bug here was that we didn't check if the PFN was actually valid before marking it as inaccessible to the hibernation core, but it wouldn't be used in either case.
Last lines I see on serial console after that: We hit invalid memory We hit invalid memory swsusp: Basic memory bitmaps created Bswsusp: Basic memory bitmaps freed
then the console is back without doing a suspend, looks functional...
OK -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=296242#c18
--- Comment #18 from Rafael Wysocki
Is this a NUMA system? Yes. How important is the need to look at why resume/suspend tries to access invalid mem (it normally should never try to?). Doing a suspend on NUMA systems is not really that important, but there is still a bug sitting around there -> lower severity to normal (will add the patch later today).
Please see Comment #17: the bug has been fixed by the patch from Comment #7, the rest is just normal behavior. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=296242#c19
Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c20
--- Comment #20 from Thomas Renninger
https://bugzilla.novell.com/show_bug.cgi?id=296242#c21
--- Comment #21 from Rafael Wysocki
participants (1)
-
bugzilla_noreply@novell.com