[Bug 1160019] New: "Metadata corruption detected at xfs_inode_buf_verify"
http://bugzilla.suse.com/show_bug.cgi?id=1160019 Bug ID: 1160019 Summary: "Metadata corruption detected at xfs_inode_buf_verify" Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.1 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: martin.wilck@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- This happened on a fresh Leap15.1 installation on a laptop, SATA SSD, /home file system. Kernel 4.12.14-lp151.28.36-default.
Dec 28 19:02:47 pallas kernel: XFS (dm-0): Metadata corruption detected at xfs_inode_buf_verify+0x72/0xf0 [xfs], xfs_inode block 0x4b860 Dec 28 19:02:47 pallas kernel: XFS (dm-0): Unmount and run xfs_repair Dec 28 19:02:47 pallas kernel: XFS (dm-0): First 64 bytes of corrupted metadata buffer: Dec 28 19:02:47 pallas kernel: ffff8801f164f000: 49 4e 41 ed 03 01 00 00 00 00 04 58 00 00 00 64 INA........X...d Dec 28 19:02:47 pallas kernel: ffff8801f164f010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................ Dec 28 19:02:47 pallas kernel: ffff8801f164f020: 5e 01 34 bd 25 b4 38 63 5d ff e7 a6 01 cc 26 9e ^.4.%.8c].....&. Dec 28 19:02:47 pallas kernel: ffff8801f164f030: 5d ff e7 a6 01 cc 26 9e 00 00 00 00 00 00 00 34 ].....&........4
The same message is repeated a few times, eventually the FS is mounted r/o, and all kinds of weird failures result. AFAICS in the code in fs/xfs/libxfs/xfs_inode_buf.c, the 15.1 kernel tests nothing but the magic number (byte 0-1, 49 4e) and version (byte 4, 03), so the pattern dumped should be correct. But this is only the 1st inode in the buffer, so subsequent inodes might be bad. However I dumped a full 4k block att offset 0x4b860 and all inode headers seem to be ok so that far. The FS couldn't be umounted, so I rebooted into single-user mode, umounted /home, and ran xfs_repair, which showed nothing of interest:
Phase 1 - find and verify superblock... - block cache size set to 374008 entries Phase 2 - using internal log - zero log...u zero_log: head block 55736 tail block 55736 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 2 - agno = 1 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts...
XFS_REPAIR Summary Sat Dec 28 21:02:45 2019
Phase Start End Duration Phase 1: 12/28 21:02:44 12/28 21:02:45 1 second Phase 2: 12/28 21:02:45 12/28 21:02:45 Phase 3: 12/28 21:02:45 12/28 21:02:45 Phase 4: 12/28 21:02:45 12/28 21:02:45 Phase 5: 12/28 21:02:45 12/28 21:02:45 Phase 6: 12/28 21:02:45 12/28 21:02:45 Phase 7: 12/28 21:02:45 12/28 21:02:45
Total run time: 1 second done
A very similar was observed some days later after only a couple of minutes of uptime.
Jan 02 10:35:10 pallas.mittagstun.de kernel: XFS (dm-0): Metadata corruption detected at xfs_inode_buf_verify+0x72/0xf0 [xfs], xfs_inode block 0x746b180 Jan 02 10:35:10 pallas.mittagstun.de kernel: XFS (dm-0): Unmount and run xfs_repair Jan 02 10:35:10 pallas.mittagstun.de kernel: XFS (dm-0): First 64 bytes of corrupted metadata buffer: Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d000: 49 4e 81 a4 03 02 00 00 00 00 04 58 00 00 00 64 IN.........X...d Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d020: 5e 07 ba 06 1c 4b f6 68 5e 07 ba 04 18 7a e7 4b ^....K.h^....z.K Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d030: 5e 07 ba 04 18 7a e7 4b 00 00 00 00 00 00 09 e8 ^....z.K........
I wonder if this is an indication of corrupt hardware, and if yes, what (SSD? CPU? Memory? Other?). The SSD in the laptop is brand new; the respective statement by the vendor is corroborated by SMART data (power-on hours). I have run a simple "surface test" (write pattern & verify) on the affected logical volume with fio, and found no errors (only a single test pass thus far). Perhaps noteworthy, I saw also some BTRFS checksum errors in the logs (from root file system):
Dec 28 19:04:17 pallas kernel: BTRFS warning (device sda2): sda2 checksum verify failed on 158384128 wanted D1502CDD found 7F3B5CF3 level 0
Jan 02 10:18:00 pallas.mittagstun.de kernel: BTRFS warning (device sda2): csum failed root 267 ino 232862 off 16384 csum 0xed735e63 expected csum 0x7e1dece1 mirror 1
Jan 02 10:28:27 pallas.mittagstun.de kernel: BTRFS warning (device sda2): csum failed root 267 ino 248644 off 0 csum 0x347357c0 expected csum 0x2e8516b0 mirror 1
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c1
--- Comment #1 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c2
--- Comment #2 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c3
Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c4
--- Comment #4 from Martin Wilck
Dec 28 19:02:47 pallas kernel: XFS (dm-0): metadata I/O error: block 0x4b860 ("xfs_trans_read_buf_map") error 117 numblks 32 Dec 28 19:02:47 pallas kernel: XFS (dm-0): xfs_imap_to_bp: xfs_trans_read_buf() returned error -117. Dec 28 19:02:47 pallas kernel: XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 3496 of file ../fs/xfs/xfs_inode.c. Return address = 0xffffffffa0c25c1a Dec 28 19:02:47 pallas kernel: XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem Dec 28 19:02:47 pallas kernel: XFS (dm-0): Please umount the filesystem and rectify the problem(s)
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c5
--- Comment #5 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c6
--- Comment #6 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c7
Anthony Iliopoulos
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c8
--- Comment #8 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c9
Anthony Iliopoulos
Anthony, thanks. Sorry, I seem to have lost the xfs_metadump output :-(
No worries, next time it reoccurs you can also grab it from the live mounted partition (xfs_metadump -f /dev/dm-0 /tmp/xfs.md should do the trick). I expect that the inode metadata in the dump will be intact (and that they could be fully crc-checked in xfs_db).
xfs_scrub is available on Leap, but it doesn't start ("Kernel metadata scrubbing facility is not available."). I guess it would need a kernel compiled with different options.
Right, you'd need to recompile with CONFIG_XFS_ONLINE_SCRUB=y for that functionality. It actually works (won't really eat your data :), but of course xfs upstream is quite conservative in terms of graduating the feature from experimental so we still keep it disabled/unsupported by default but ship the userspace utils as interested users may want to experiment along with a custom kernel.
Wrt xfs_repair, the FS had been automatically mounted during boot (that happens for /home with systemd). I saw just ordinary journal replay during mount/umount, nothing out of the ordinary, no errors logged. That made me think that I might be facing a memory corruption problem.
Indeed, doesn't seem like a consistency issue, and the kernel message "Unmount and run xfs_repair" is really too generic advice and wouldn't be able to help in this kind of metadata corruption.
What I did, meanwhile, was run memtest86 on the system, which passed 4x without errors. Then I made a write/verify test on the entire LV using fio, which would of course overwrite the XFS file system that had shown the issues before. That test completed with no errors, too.
Yes I was about to suggest memtest86 before but it is still very curious to me that the hexdump (which is dumped from the in-memory inode buffer) is really valid. Wild guess: it could be that when in xorg mode the gpu acceleration is producing too much heat (I had a couple of laptops like that) and overheating/impacting the cpu. Really not sure how this may happen other than bitflips in cpu registers/L$ during the comparisons (which I would assume is so rare and random that wouldn't occur only in the same exact metadata check).
Finally I created a new XFS on the same LV, copied back all the data (~50GB, mostly photos and videos), no errors. Since then, I haven't seen the issue any more. I guess I'll simply have to observe further.
Sure, let's leave the ticket open see what happens. I'd be really curious to find out why the metadata checks were getting triggered. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c10
--- Comment #10 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c11
--- Comment #11 from Martin Wilck
Right, you'd need to recompile with CONFIG_XFS_ONLINE_SCRUB=y
AFAICS the SLE15-SP1/Leap15.1 kernel doesn't have this option.
Wild guess: it could be that when in xorg mode the gpu acceleration is producing too much heat
Well, as I said this happened right after booting and logging in (actually, the cases I observed all happened no more than ~10min after booting and logging in). If the GPU had overheated, it must have done so very quickly. Hints for further debugging are highly welcome. Would it make sense to trigger a crash dump at this condition? If yes, I guess we should include a full memory dump? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c12
--- Comment #12 from Martin Wilck
http://bugzilla.suse.com/show_bug.cgi?id=1160019
http://bugzilla.suse.com/show_bug.cgi?id=1160019#c13
--- Comment #13 from Martin Wilck
https://bugzilla.suse.com/show_bug.cgi?id=1160019
https://bugzilla.suse.com/show_bug.cgi?id=1160019#c14
Anthony Iliopoulos
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com