Bug ID 1008107
Summary Potential XFS Kernel bug - _xfs_buf_find: Block out of range
Classification openSUSE
Product openSUSE Distribution
Version Leap 42.1
Hardware x86-64
OS openSUSE 42.1
Status NEW
Severity Normal
Priority P5 - None
Component Basesystem
Assignee bnc-team-screening@forge.provo.novell.com
Reporter david@the-taylor-family.org
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

Posted this originally on the Leap 42.1 support forum and it was suggested I
post it over here as it may be kernel bug with respect to XFS.

https://forums.opensuse.org/showthread.php/518978-Leap-42-1-XFS-corrupted-file-problem?p=2798255#post2798255

dcurtisfra posted this in reply to my forum post:
Looking at your report and, the Ubuntu Bug Report, it could be that we have a
Leap 42.1 Kernel issue with respect to XFS here.

It may be a good idea to raise a Bug Report <https://bugzilla.opensuse.org/>
containing everything that you've found.

The Ubuntu folks are suggesting that, the latest Kernel version may alleviate
this issue but, that Kernel is appearing in the openSUSE distribution with Leap
42.2 which, is still in the Release Candidate testing phase.

Here's the Canonical/Ubuntu bug reference:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576599



Here's the (majority) of what I had posted in the forums referenced above:

Had complaints that the squid proxy was not working for several machines on the
network, so I investigated the Leap 42.1 box I have handling proxy services. It
would ping and ports would respond to Nagios TCP checks (service checks were
faulting), but I couldn't log in and none of the services on the box were
responsive. I power cycled the machine and came up to the emergency/maintenance
recovery login. Once in the maintenance mode I determined the /var file system
was corrupted and any attempt to mount it, xfs_repair, etc. had the effect of
hanging the system indefinitely requiring another power cycle to recover.
Booting from the Leap USB stick, I was able to get a little further, but was
unsuccessful in getting /var back. I had tried mounting readonly with norecover
but it still refused. Flushing the log/metadata with the -L option to
xfs_repair was the only way to get past the problem. (I have backups of the
logs from the night before, so just lost a little syslog data from some other
systems, not a big issue here).

Once I had /var mounted, I was able to look at the messages in the log which
referred to _xfs_buf_find: Block out of range errors. These occurred when
logrotate was trying to swap logs around. It was still writing logs against my
main system messages file (I use syslog_ng vs systemd journal logging) up until
the point I power cycled the system so I guess it had sufficient allocation on
that file without requesting more. As squid had stopped working, along with
logins, etc, I expect the /var file system failure was preventing opening
and/or writing other logs (thus appearing locked up).

Best I can tell based on the SMART results (I have smartd running) and lack of
any other kernel warnings of disk failures, this does not appear to have been a
disk failure. The Call Trace in the Ubuntu bug listed a grow_inode call, which
was not present in the Call Trace from my crash, so while they both list the
block out of range fault, they may not be related...

There were a total of six dumps within several seconds, all referencing the
same PID (logrotate).  Here's the first one.  I'll attach an edited copy of the
log that contains more details plus reboot information for those who need/want
to know.  I did try and clean up unnecessary noise from the attached log as
well as identifying bits that I didn't think needed publicly posted.

Oct 29 01:00:05 shadows kernel: XFS (dm-9): _xfs_buf_find: Block out of range:
block 0x7fffffff8, EOFS 0x1000000 Oct 29 01:00:05 shadows kernel:
[665260.471535] XFS (dm-9): _xfs_buf_find: Block out of range: block
0x7fffffff8, EOFS 0x1000000 
Oct 29 01:00:05 shadows kernel: [665260.471581] ------------[ cut here
]------------
Oct 29 01:00:05 shadows kernel: [665260.471626] WARNING: CPU: 3 PID: 4863 at
../fs/xfs/xfs_buf.c:473 _xfs_buf_find+0x2a1/0x2f0 [xfs]()
Oct 29 01:00:05 shadows kernel: [665260.471627] Modules linked in:
ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter
ip_tables xt_conntrack x
_tables nf_nat nf_conntrack br_netfilter bridge stp llc dm_thin_pool
dm_persistent_data dm_bio_prison dm_bufio loop af_packet iscsi_ibft
iscsi_boot_sysfs joydev hid_generic usbhid snd_hda_codec_hdmi
snd_hda_codec_realtek snd_hda_code
c_generic intel_rapl snd_hda_intel snd_hda_controller i915 snd_hda_codec
snd_hda_core x86_pkg_temp_thermal snd_hwdep intel_powerclamp coretemp video
snd_pcm drm_kms_helper iTCO_wdt iTCO_vendor_support snd_timer gpio_ich snd
soundcore
i2c_i801 drm mei_me kvm mei e1000e lpc_ich ptp mfd_core pps_core serio_raw
i2c_algo_bit crct10dif_pclmul ppdev parport_pc tpm_tis tpm 8250_fintek pcspkr
processor parport wmi button crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul g
lue_helper ablk_helper cryptd xfs libcrc32c crc32c_intel sr_mod cdrom ehci_pci
ehci_hcd usbcore usb_common dm_mod sg
Oct 29 01:00:05 shadows kernel: [665260.471680] CPU: 3 PID: 4863 Comm:
logrotate Not tainted 4.1.31-30-default #1
Oct 29 01:00:05 shadows kernel: [665260.471682] Hardware name: LENOVO 7005AK8/
, BIOS 9HKT46AUS 12/15/2011
Oct 29 01:00:05 shadows kernel: [665260.471684] 0000000000000286
0000000000000000 ffffffff8165ef0d 0000000000000000
Oct 29 01:00:05 shadows kernel: [665260.471687] 0000000000000000
ffffffffa016c24c ffffffff81068961 ffff88042a2b3340
Oct 29 01:00:05 shadows kernel: [665260.471689] 0000000000000008
00000007fffffff8 0000000000000000 0000000000000001
Oct 29 01:00:05 shadows kernel: [665260.471692] Call Trace:
Oct 29 01:00:05 shadows kernel: [665260.471704] [<ffffffff810055cc>]
dump_trace+0x8c/0x340
Oct 29 01:00:05 shadows kernel: [665260.471709] [<ffffffff8100597c>]
show_stack_log_lvl+0xfc/0x1a0
Oct 29 01:00:05 shadows kernel: [665260.471712] [<ffffffff81006ec1>]
show_stack+0x21/0x50
Oct 29 01:00:05 shadows kernel: [665260.471717] [<ffffffff8165ef0d>]
dump_stack+0x5d/0x79
Oct 29 01:00:05 shadows kernel: [665260.471722] [<ffffffff81068961>]
warn_slowpath_common+0x81/0xb0
Oct 29 01:00:05 shadows kernel: [665260.471747] [<ffffffffa012ab91>]
_xfs_buf_find+0x2a1/0x2f0 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471773] [<ffffffffa012ac07>]
xfs_buf_get_map+0x27/0x2c0 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471800] [<ffffffffa0158871>]
xfs_trans_get_buf_map+0x131/0x1e0 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471826] [<ffffffffa0102abc>]
xfs_btree_get_bufs+0x4c/0x60 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471845] [<ffffffffa00ebaa9>]
xfs_alloc_fix_freelist+0x179/0x410 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471863] [<ffffffffa00ec4e8>]
xfs_free_extent+0x88/0x110 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471886] [<ffffffffa0126d27>]
xfs_bmap_finish+0x137/0x190 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471912] [<ffffffffa013e174>]
xfs_itruncate_extents+0x184/0x330 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471935] [<ffffffffa013e3aa>]
xfs_inactive_truncate+0x8a/0x110 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471957] [<ffffffffa013f218>]
xfs_inactive+0x128/0x150 [xfs]
Oct 29 01:00:05 shadows kernel: [665260.471964] [<ffffffff811f9e00>]
evict+0xb0/0x170
Oct 29 01:00:05 shadows kernel: [665260.471968] [<ffffffff811f5b70>]
__dentry_kill+0x170/0x1e0
Oct 29 01:00:05 shadows kernel: [665260.471973] [<ffffffff811f5d66>]
dput+0x186/0x240
Oct 29 01:00:05 shadows kernel: [665260.471982] [<ffffffff811e0bc0>]
__fput+0x150/0x1c0
Oct 29 01:00:05 shadows kernel: [665260.471987] [<ffffffff81085057>]
task_work_run+0xa7/0xe0
Oct 29 01:00:05 shadows kernel: [665260.471991] [<ffffffff81002f59>]
do_notify_resume+0x69/0x90
Oct 29 01:00:05 shadows kernel: [665260.471997] [<ffffffff816658c1>]
int_signal+0x12/0x17
Oct 29 01:00:05 shadows kernel: [665260.472004] [<00007f4b210f52d0>]
0x7f4b210f52d


Several items about the attached log so as to hopefully reduce any confusion. 
I had attached a Samsung 850 SSD to the machine to serve as a storage device
for recovering bits of data between the nightly backup and when it crashed.  It
is not normally attached (same goes for the USB stick).  It was also
disconnected from the network as I had a VM stood up to take over proxy
services while diagnosing this box so I would have had duplicate IPs.  It also
normally has a DVD drive attached which is not (the Samsung is using it's SSD
port)

Hardware is a mostly stock Lenovo M91p with an i5 and 16GB RAM. Nothing fancy
in it and running a text console.


You are receiving this mail because: