[Bug 697699] New: Memory corruption after hibernate/resume on Thinkpad X61s
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c0 Summary: Memory corruption after hibernate/resume on Thinkpad X61s Classification: openSUSE Product: openSUSE 11.4 Version: Final Platform: i686 OS/Version: openSUSE 11.4 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: ptesarik@novell.com QAContact: qa@suse.de CC: rgoldwyn@novell.com Found By: L3 Blocker: --- Symptoms: Sometimes, after a hibernate/resume cycle, applications fail mysteriously, often with SIGSEGV or SIGILL. Analysis: A closer look revealed that some dynamic libraries were corrupted. The corruption disappears, and I get correct file contents again after I drop caches using: echo 3 > /proc/sys/vm/drop_caches Other Info: The disk was tested using SMART's long test. No errors found. RAM was tested using memtest86+. Two successful passes. openSUSE 11.3 had been running flawlessly for many months on the same HW. All files were on the root partition (/dev/sda6). This is an ext3 filesystem, mounted like this: /dev/sda6 on / type ext3 (rw,relatime,errors=continue,user_xattr,acl,commit=15,barrier=1,data=writeback) The machine is an Intel Core Duo with 2GB RAM, i.e. a 32-bit machine with some high memory. Corruption traces: All corrupted files so far follow a similar pattern: - it's always the first 32 bytes in a page - the wrong contents is usually changed to all zeroes - one case was different, the corruption had this pattern: 36b000 aa aa aa 00 aa aa aa 00 aa aa aa 00 aa aa aa 00 36b010 aa aa aa 00 aa aa aa 00 aa aa aa 00 aa aa aa 00 Note that the corruption size is 32 bytes, but L1 cache size is 64 bytes, so it doesn't look like a problem with the CPU caches. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c1 --- Comment #1 from Petr Tesařík <ptesarik@novell.com> 2011-06-02 14:38:59 UTC --- Forgot uname -a output: Linux azariah 2.6.37.6-0.5-default #1 SMP 2011-04-25 21:48:33 +0200 i686 i686 i386 GNU/Linux -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c2 --- Comment #2 from Petr Tesařík <ptesarik@novell.com> 2011-06-02 17:21:39 UTC --- I've just experienced another such corruption. The corrupted file was /usr/lib/libplc4.so, and drop_caches didn't help, because the corrupted page was still mapped into the address space of several processes (NetworkManager, openvpn, pidgin and ekiga). I ran debugfs on /dev/sda6 to find the on-disk location of /usr/lib/libplc4so: azariah:~ # debugfs /dev/sda6 debugfs 1.41.14 (22-Dec-2010) debugfs: stat /usr/lib/libnspr4.so Inode: 1452959 Type: regular Mode: 0755 Flags: 0x0 Generation: 2481375563 Version: 0x00000000 User: 0 Group: 0 Size: 17992 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 40 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x4dc0f174 -- Wed May 4 08:25:56 2011 atime: 0x4de73193 -- Thu Jun 2 08:45:39 2011 mtime: 0x4d642eb8 -- Tue Feb 22 22:46:32 2011 Size of extra inode fields: 4 BLOCKS: (0-4):5814738-5814742 TOTAL: 5 The /dev/sda6 filesystem has 4K blocks, so the first block translates to 5814738*4096, or 0x58b9d2000. I ran hed on /dev/sda6 and saw: 58b9d2000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 :ELF:::......... 58b9d2010 03 00 03 00 01 00 00 00 90 10 00 00 34 00 00 00 :.:.:...::..4... 58b9d2020 e8 41 00 00 00 00 00 00 34 00 20 00 07 00 28 00 :A......4. .:.(. 58b9d2030 1c 00 1b 00 01 00 00 00 00 00 00 00 00 00 00 00 :.:.:........... Then I looked up where the file was mapped to the process 3569's address space using /proc/3569/smaps: b5e47000-b5e4b000 r-xp 00000000 08:06 1452959 /usr/lib/libplc4.so I could see that the in-memory copy is corrupted with gdb /proc/3569/exe 3569: (gdb) x/32x 0xb5e47000 0xb5e47000: 0x00000000 0x00000000 0x00000000 0x00000000 0xb5e47010: 0x00000000 0x00000000 0x00000000 0x00000000 0xb5e47020: 0x000041e8 0x00000000 0x00200034 0x00280007 .. Next, I ran crash to find out where the page was located in physical memory: crash> vtop 0xb5e47000 VIRTUAL PHYSICAL b5e47000 7ab3a000 PAGE DIRECTORY: f2aac000 PGD: f2aacb5c => 7d1c4067 PMD: f2aacb5c => 7d1c4067 PTE: 7d1c491c => 7ab3a005 PAGE: 7ab3a000 PTE PHYSICAL FLAGS 7ab3a005 7ab3a000 (PRESENT|USER) VMA START END FLAGS FILE f2b388b4 b5e47000 b5e4b000 8000075 /usr/lib/libplc4.so PAGE PHYSICAL MAPPING INDEX CNT FLAGS f6900740 7ab3a000 f4229ee8 0 4 8004002c I then looked up the physical memory offset (using /dev/crash, which is similar to /dev/mem but can map high memory), and indeed, the page is corrupted: 007ab3a000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ............... 007ab3a010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ............... 007ab3a020 e8 41 00 00 00 00 00 00 34 00 20 00 07 00 28 00 :A......4. .:.(. 007ab3a030 1c 00 1b 00 01 00 00 00 00 00 00 00 00 00 00 00 :.:.:.......... .. Last, I killed all processes using /usr/lib/libplc4.so, dropped the caches, and got the correct file contents with hed /usr/lib/libplc4.so: 0000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 :ELF:::......... 0010 03 00 03 00 01 00 00 00 90 10 00 00 34 00 00 00 :.:.:...::..4... 0020 e8 41 00 00 00 00 00 00 34 00 20 00 07 00 28 00 :A......4. .:.(. 0030 1c 00 1b 00 01 00 00 00 00 00 00 00 00 00 00 00 :.:.:........... .. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c3 Petr Tesařík <ptesarik@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel-maintainers@forge.pr |rjw@novell.com |ovo.novell.com | --- Comment #3 from Petr Tesařík <ptesarik@novell.com> 2011-06-02 17:31:12 UTC --- Since the corruption doesn't happen on every hibernate/resume, I tried to find out what was different. It seems I had a good hibernate/resume cycle when swap was completely unused, but a corrupted cycle when some swap pages were actually used before hibernating. Rafael, any idea? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c4 --- Comment #4 from Petr Tesařík <ptesarik@novell.com> 2011-06-02 17:32:05 UTC --- For the record, I've never experienced a corruption when suspending to RAM. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c5 --- Comment #5 from Petr Tesařík <ptesarik@novell.com> 2011-06-06 09:05:30 UTC --- I set up my system to boot a second OS and copy the swap partition to a file before resuming, so when I experienced another corruption today, I still had the contents of the suspend image. It turns out the corruption happens during resume: libgtk-x11-2.0.so.0.2200.1 at 28f000 (from page cache): 28f000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 28f010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 28f020 00 00 e8 a9 29 f5 ff 85 f6 74 1b 8b 16 85 d2 90 ..::)::::t:::::: 28f030 74 04 39 02 74 42 89 44 24 04 89 34 24 e8 ba 8a t:9:tB:D$::4$::: .. disk contents: 28f000 55 89 e5 57 56 53 83 ec 2c 8b 7d 08 e8 46 b6 dc U::WVS::,:}::F:: 28f010 ff 81 c3 e3 5f 1d 00 8b 75 0c 85 ff 0f 84 06 01 ::::_:.:u::::::: 28f020 00 00 e8 a9 29 f5 ff 85 f6 74 1b 8b 16 85 d2 90 ..::)::::t:::::: 28f030 74 04 39 02 74 42 89 44 24 04 89 34 24 e8 ba 8a t:9:tB:D$::4$::: .. PID 3298) has: b7324000-b7785000 r-xp 00000000 08:06 778148 /usr/lib/libgtk-x11-2.0.so.0.2200.1 => the faulty page is at 0xb75b3000 crash> vtop 0xb75b3000 VIRTUAL PHYSICAL b75b3000 7ab38000 .. => PFN = 0x7ab38 Suspend image: struct swsusp_info { .. image_pages = 0x2fb5a, pages = 0x2fc1a, .. => nr_meta_pages = 191 (0xbf) PFN 0x7ab38 found at suspend image offset 0xb9094 map starts at 0x1000 => page idx in suspend image is (0xb9094-0x1000)/4 = 0x2e025 first page data starts at 0xc0000 page is at 0xc0000+0x2e025000 = 0x2e0e5000 Suspend image contents: 2e0e5000 55 89 e5 57 56 53 83 ec 2c 8b 7d 08 e8 46 b6 dc U::WVS::,:}::F:: 2e0e5010 ff 81 c3 e3 5f 1d 00 8b 75 0c 85 ff 0f 84 06 01 ::::_:.:u::::::: 2e0e5020 00 00 e8 a9 29 f5 ff 85 f6 74 1b 8b 16 85 d2 90 ..::)::::t:::::: 2e0e5030 74 04 39 02 74 42 89 44 24 04 89 34 24 e8 ba 8a t:9:tB:D$::4$::: .. This is the CORRECT page contents. Since I used the openSUSE 11.4's resume to extract and decompress the suspend image from the saved swap partition (only changing "/dev/snapshot" to a local file name and removig the IOCTLs), the same content was also written to the snapshot device, hence something went wrong on the kernel side. Rafael, I'm out of ideas here. What else can I do to debug this issue? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c6 --- Comment #6 from Petr Tesařík <ptesarik@novell.com> 2011-06-06 09:10:24 UTC --- I forgot one more piece of information, it seems: struct image_header_info { pages = 0x2fc1a, flags = 0x12, .. } The flags (0x12) translate to PLATFORM_SUSPEND | IMAGE_COMPRESSED. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c7 --- Comment #7 from Petr Tesařík <ptesarik@novell.com> 2011-06-06 10:18:28 UTC --- Seems to be related to https://bugzilla.kernel.org/show_bug.cgi?id=13811 Yes, this Thinkpad uses Intel GM965 graphics (with the i915 driver). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c8 Petr Tesařík <ptesarik@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #8 from Petr Tesařík <ptesarik@novell.com> 2011-06-09 10:44:23 UTC --- Rafael, since both upstream commits related to the closed kernel.org bug are applied in my kernel, it seems the bug was not solved. Would you mind re-opening the upstream bug? Or should I rather clone it, because another bug might have been fixed by those commits? And, could you give a hint what I can do now to help with nailing down the bug? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c9 --- Comment #9 from Petr Tesařík <ptesarik@novell.com> 2011-06-10 16:59:24 UTC --- Never mind, I have cloned the upstream bug to https://bugzilla.kernel.org/show_bug.cgi?id=37142 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c10 --- Comment #10 from Rafael Wysocki <rjw@novell.com> 2011-06-10 21:29:20 UTC --- Hi, sorry for the lack of response, I've been traveling (and having Internet access issues afterwards). OK, so I think the problem is related to highmem and the i915 driver. Let's continue debugging in the kernel.org bug entry. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c11 Rafael Wysocki <rjw@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO InfoProvider| |ptesarik@novell.com --- Comment #11 from Rafael Wysocki <rjw@novell.com> 2011-07-11 09:04:18 UTC --- I have attached a patch to test to BKO #37142, at: https://bugzilla.kernel.org/attachment.cgi?id=65202 Please test and report back. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c12 --- Comment #12 from Rafael Wysocki <rjw@suse.com> 2012-04-12 21:25:38 UTC --- Well, no response. A fix has been recently applied to the mainline kernel that may be related to this issue. Please let me know if you are still interested in tracking it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=697699 https://bugzilla.novell.com/show_bug.cgi?id=697699#c13 Rafael Wysocki <rjw@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |RESOLVED InfoProvider|ptesarik@suse.com | Resolution| |DUPLICATE --- Comment #13 from Rafael Wysocki <rjw@suse.com> 2012-04-12 21:29:20 UTC --- In fact, I think this bug is a duplicate of bnc#732908. I'm closing it as such, please reopen if that's not the case. *** This bug has been marked as a duplicate of bug 732908 *** http://bugzilla.novell.com/show_bug.cgi?id=732908 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com