Bug ID 1160019
Summary "Metadata corruption detected at xfs_inode_buf_verify"
Classification openSUSE
Product openSUSE Distribution
Version Leap 15.1
Hardware Other
OS Other
Status NEW
Severity Normal
Priority P5 - None
Component Kernel
Assignee kernel-maintainers@forge.provo.novell.com
Reporter martin.wilck@suse.com
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

This happened on a fresh Leap15.1 installation on a laptop, SATA SSD, /home
file system. Kernel 4.12.14-lp151.28.36-default.

> Dec 28 19:02:47 pallas kernel: XFS (dm-0): Metadata corruption detected at xfs_inode_buf_verify+0x72/0xf0 [xfs], xfs_inode block 0x4b860
> Dec 28 19:02:47 pallas kernel: XFS (dm-0): Unmount and run xfs_repair
> Dec 28 19:02:47 pallas kernel: XFS (dm-0): First 64 bytes of corrupted metadata buffer:
> Dec 28 19:02:47 pallas kernel: ffff8801f164f000: 49 4e 41 ed 03 01 00 00 00 00 04 58 00 00 00 64  INA........X...d
> Dec 28 19:02:47 pallas kernel: ffff8801f164f010: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00  ................
> Dec 28 19:02:47 pallas kernel: ffff8801f164f020: 5e 01 34 bd 25 b4 38 63 5d ff e7 a6 01 cc 26 9e  ^.4.%.8c].....&.
> Dec 28 19:02:47 pallas kernel: ffff8801f164f030: 5d ff e7 a6 01 cc 26 9e 00 00 00 00 00 00 00 34  ].....&........4

The same message is repeated a few times, eventually the FS is mounted r/o, and
all kinds of weird failures result.

AFAICS in the code in fs/xfs/libxfs/xfs_inode_buf.c, the 15.1 kernel tests
nothing but the magic number (byte 0-1, 49 4e) and version (byte 4, 03), so the
pattern dumped should be correct. But this is only the 1st inode in the buffer,
so subsequent inodes might be bad. However I dumped a full 4k block att offset
0x4b860 and all inode headers seem to be ok so that far.

The FS couldn't be umounted, so I rebooted into single-user mode, umounted
/home, and ran xfs_repair, which showed nothing of interest:

> Phase 1 - find and verify superblock...
>         - block cache size set to 374008 entries
> Phase 2 - using internal log
>         - zero log...u
> zero_log: head block 55736 tail block 55736
>         - scan filesystem freespace and inode maps...
>         - found root inode chunk
> Phase 3 - for each AG...
>         - scan and clear agi unlinked lists...
>         - process known inodes and perform inode discovery...
>         - agno = 0
>         - agno = 1
>         - agno = 2
>         - agno = 3
>         - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
>         - setting up duplicate extent list...
>         - check for inodes claiming duplicate blocks...
>         - agno = 0
>         - agno = 3
>         - agno = 2
>         - agno = 1
> Phase 5 - rebuild AG headers and trees...
>         - agno = 0
>         - agno = 1
>         - agno = 2
>         - agno = 3
>         - reset superblock...
> Phase 6 - check inode connectivity...
>         - resetting contents of realtime bitmap and summary inodes
>         - traversing filesystem ...
>         - agno = 0
>         - agno = 1
>         - agno = 2
>         - agno = 3
>         - traversal finished ...
>         - moving disconnected inodes to lost+found ...
> Phase 7 - verify and correct link counts...
> 
>         XFS_REPAIR Summary    Sat Dec 28 21:02:45 2019
> 
> Phase           Start           End             Duration
> Phase 1:        12/28 21:02:44  12/28 21:02:45  1 second
> Phase 2:        12/28 21:02:45  12/28 21:02:45  
> Phase 3:        12/28 21:02:45  12/28 21:02:45  
> Phase 4:        12/28 21:02:45  12/28 21:02:45  
> Phase 5:        12/28 21:02:45  12/28 21:02:45  
> Phase 6:        12/28 21:02:45  12/28 21:02:45  
> Phase 7:        12/28 21:02:45  12/28 21:02:45  
> 
> Total run time: 1 second
> done

A very similar was observed some days later after only a couple of minutes of
uptime.

> Jan 02 10:35:10 pallas.mittagstun.de kernel: XFS (dm-0): Metadata corruption detected at xfs_inode_buf_verify+0x72/0xf0 [xfs], xfs_inode block 0x746b180
> Jan 02 10:35:10 pallas.mittagstun.de kernel: XFS (dm-0): Unmount and run xfs_repair
> Jan 02 10:35:10 pallas.mittagstun.de kernel: XFS (dm-0): First 64 bytes of corrupted metadata buffer:
> Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d000: 49 4e 81 a4 03 02 00 00 00 00 04 58 00 00 00 64  IN.........X...d
> Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
> Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d020: 5e 07 ba 06 1c 4b f6 68 5e 07 ba 04 18 7a e7 4b  ^....K.h^....z.K
> Jan 02 10:35:11 pallas.mittagstun.de kernel: ffff8801dd24d030: 5e 07 ba 04 18 7a e7 4b 00 00 00 00 00 00 09 e8  ^....z.K........

I wonder if this is an indication of corrupt hardware, and if yes, what (SSD?
CPU? Memory? Other?).

The SSD in the laptop is brand new; the respective statement by the vendor is
corroborated by SMART data (power-on hours). I have run a simple "surface test"
(write pattern & verify) on the affected logical volume with fio, and found no
errors (only a single test pass thus far).

Perhaps noteworthy, I saw also some BTRFS checksum errors in the logs (from
root file system):

> Dec 28 19:04:17 pallas kernel: BTRFS warning (device sda2): sda2 checksum verify failed on 158384128 wanted D1502CDD found 7F3B5CF3 level 0
> 
> Jan 02 10:18:00 pallas.mittagstun.de kernel: BTRFS warning (device sda2): csum failed root 267 ino 232862 off 16384 csum 0xed735e63 expected csum 0x7e1dece1 mirror 1
>
> Jan 02 10:28:27 pallas.mittagstun.de kernel: BTRFS warning (device sda2): csum failed root 267 ino 248644 off 0 csum 0x347357c0 expected csum 0x2e8516b0 mirror 1


You are receiving this mail because: