[Bug 1043912] New: kernel 4.4.70-18.9-default is unable to mount some btrfs partitions, unlike previous kernel
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912 Bug ID: 1043912 Summary: kernel 4.4.70-18.9-default is unable to mount some btrfs partitions, unlike previous kernel Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.2 Hardware: Other OS: Other Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: fcrozat@suse.com QA Contact: qa-bugs@suse.de CC: dsterba@suse.com Found By: --- Blocker: --- Latest maintenance update on kernel (4.4.70-18.9-default) is causing one of my btrfs partition to not be mountable at all: [ 4.125164] BTRFS info (device sdb1): disk space caching is enabled [ 4.132807] BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 [ 4.132809] BTRFS error (device sdb1): failed to read chunk tree: -22 [ 4.169764] BTRFS error (device sdb1): open_ctree failed kernel-default-4.4.62-18.6.1.x86_64 is able to mount it properly. I ran scrub and balance on it from kernel 4.4.62-18.6.1, no error found : Checking filesystem on /dev/sdb1 UUID: 072539d3-94cc-4c96-b50d-448c5b223850 checking extents checking free space cache checking fs roots checking csums checking root refs found 409700515840 bytes used, no error found total csum bytes: 399474044 total tree bytes: 536072192 total fs tree bytes: 60981248 total extent tree bytes: 20795392 btree space waste bytes: 55480247 file data blocks allocated: 409166581760 referenced 409166577664 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c1
Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c2
Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c3
Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c4
--- Comment #4 from Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c5
Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c6
Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c7
Frederic Crozat
Can you also provide the output of lsblk -b /dev/sdb.
# lsblk -b /dev/sdb NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 1000204886016 0 disk └─sdb1 8:17 0 1000203820544 0 part -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c8
--- Comment #8 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c9
--- Comment #9 from Frederic Crozat
So all the provided information points that the actual size of the underlying device is 1000203820544, which is 3.5kb larger than what the super block thinks. This is also correlated by both the info in the device extent tree and in the multiple copies of the super block. Independently of the filesystem, lsblk also confirms that the device size is 1000203820544.This is odd indeed, so if I understand correctly this volume worked fine before you upgraded to 42.2?
Correction. This volume is working fine with 42.2 release kernel and all maintenance kernel until 4.4.62-18.6-default (included). It is no longer mountable with latest maintenance kernel
If so could you show what number is reported by lsblkid on the old kernel, where this used to work?
with 4.4.62-18.6-default kernel: # lsblk -b /dev/sdb NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 0 1000204886016 0 disk └─sdb1 8:17 0 1000203820544 0 part /mnt/cache
Also have you performed any device replace operations?
No, this is a disk I added to my system a long time ago (either with openSUSE 13.1 or 42.1, I'm not 100% sure), I don't remember doing any replace operations on it. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c10
--- Comment #10 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c11
--- Comment #11 from Frederic Crozat
Right, then it's apparent why it doesn't want to mount - the check which fails the mount was not present until rpm-4.4.63-1. It's only a consistency check. However, the more pressing question the volume got in a situation where device size doesn't equal to what is recorded in the superblock. I guess there are 2 courses of action:
1) Revert the offending patch, but it's supposed to harden the the kernel against such "corruption" ( the quotes are due to my not being sure that it's actually a corruption)
Would a online resize of the FS be enough to "fix" the issue ?
2) If you have enough freespace on a different drive, try to recreate the filesystem via btrfs send/receive. Obviously this might work in your case but we don't know how widespread this bug could be, it seems so far no one has complained.
I'm really concern we might see this from our SP2 customers at some point as L3 (we never know when they will apply maintenance update)..
David, Jeff what do you make out of this ?
-- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c12
--- Comment #12 from Jeff Mahoney
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c15
--- Comment #15 from Frederic Crozat
I've built a 42.2-based kernel with possible patches which fix the issue here: https://build.suse.de/project/show/home:nikbor:bsc1043912
Can you please test it and confirm that they work before I push-them for merging?
I will be able to test it only next Sunday. Will report then -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c16
--- Comment #16 from Frederic Crozat
I've built a 42.2-based kernel with possible patches which fix the issue here: https://build.suse.de/project/show/home:nikbor:bsc1043912
Can you please test it and confirm that they work before I push-them for merging?
Still broken: kernel: Btrfs loaded, crc32c=crc32c-intel, assert=on kernel: BTRFS: device label cache devid 1 transid 49644 /dev/sdb1 kernel: BTRFS: device fsid 7fd337e9-f933-4cab-8a20-441daf926db2 devid 1 transid 157784 /dev/sda2 kernel: BTRFS info (device sda2): disk space caching is enabled kernel: BTRFS info (device sda2): disk space caching is enabled kernel: BTRFS info (device sdb1): disk space caching is enabled kernel: BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 kernel: BTRFS error (device sdb1): failed to read chunk tree: -22 kernel: BTRFS error (device sdb1): open_ctree failed -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c19
--- Comment #19 from Frederic Crozat
I've built yet another kernel off latest 42.2, it can be found here: https://build.suse.de/project/show/home:nikbor:bsc1043912
It contains a patch which rounds down the values used in the check and should hopefully allow Frederic to mount his system. If everything checks out I will commit it.
I can confirm kernel 4.4.73-1.g16a5f2a-default properly mounts my "problematic" btrfs partition However, I got the following warning (not sure if it is related): [ 54.116369] WARNING: CPU: 0 PID: 278 at ../fs/btrfs/ctree.h:1513 btrfs_update_device+0x1ad/0x1c0 [btrfs]() [ 54.116398] ath9k_hw gpio_ich snd_hda_core snd_hwdep snd_pcm_oss iTCO_wdt iTCO_vendor_support lrw gf128mul ath r8169 snd_pcm mac80211 joydev glue_helper mii ablk_helper lpc_ich mfd_core snd_seq cfg80211 rfkill fjes snd_seq_device snd_timer snd_mixer_oss processor i2c_i801 snd cryptd mei_me shpchp mei soundcore pcspkr btrfs xor hid_microsoft hid_generic usbhid raid6_pq sr_mod cdrom sd_mod crc32c_intel ahci libahci i915 libata video i2c_algo_bit xhci_pci drm_kms_helper ehci_pci syscopyarea sysfillrect xhci_hcd sysimgblt ehci_hcd fb_sys_fops usbcore drm usb_common button sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs autofs4 [ 54.116438] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [ 54.116490] [<ffffffffa04c66fd>] btrfs_update_device+0x1ad/0x1c0 [btrfs] [ 54.116512] [<ffffffffa04cbf91>] btrfs_finish_chunk_alloc+0x151/0x590 [btrfs] [ 54.116521] [<ffffffffa04875c7>] btrfs_create_pending_block_groups+0x137/0x280 [btrfs] [ 54.116539] [<ffffffffa049ff53>] __btrfs_end_transaction+0x93/0x350 [btrfs] [ 54.116548] [<ffffffffa0488b6f>] flush_space+0xef/0x5f0 [btrfs] [ 54.116557] [<ffffffffa048918b>] btrfs_async_reclaim_metadata_space+0x11b/0x4b0 [btrfs] [ 54.117794] WARNING: CPU: 0 PID: 278 at ../fs/btrfs/ctree.h:1513 btrfs_update_device+0x1ad/0x1c0 [btrfs]() [ 54.117819] gpio_ich snd_hda_core snd_hwdep snd_pcm_oss iTCO_wdt iTCO_vendor_support lrw gf128mul ath r8169 snd_pcm mac80211 joydev glue_helper mii ablk_helper lpc_ich mfd_core snd_seq cfg80211 rfkill fjes snd_seq_device snd_timer snd_mixer_oss processor i2c_i801 snd cryptd mei_me shpchp mei soundcore pcspkr btrfs xor hid_microsoft hid_generic usbhid raid6_pq sr_mod cdrom sd_mod crc32c_intel ahci libahci i915 libata video i2c_algo_bit xhci_pci drm_kms_helper ehci_pci syscopyarea sysfillrect xhci_hcd sysimgblt ehci_hcd fb_sys_fops usbcore drm usb_common button sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod efivarfs autofs4 [ 54.117855] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [ 54.117884] [<ffffffffa04c66fd>] btrfs_update_device+0x1ad/0x1c0 [btrfs] [ 54.117896] [<ffffffffa04cbf91>] btrfs_finish_chunk_alloc+0x151/0x590 [btrfs] [ 54.117905] [<ffffffffa04875c7>] btrfs_create_pending_block_groups+0x137/0x280 [btrfs] [ 54.117915] [<ffffffffa049ff53>] __btrfs_end_transaction+0x93/0x350 [btrfs] [ 54.117925] [<ffffffffa0488b6f>] flush_space+0xef/0x5f0 [btrfs] [ 54.117933] [<ffffffffa048918b>] btrfs_async_reclaim_metadata_space+0x11b/0x4b0 [btrfs] -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c22
--- Comment #22 from Frederic Crozat
(In reply to Nikolay Borisov from comment #20)
So yeah, the warning comes from newly-added code whose purpose is to catch offenders which try to store a non-rounded-down value. Your case is such. It's not critical and might actually go away if you run a rebalance operation on this volume, could you actually try this?
I'll try tonight.
rebalance fixed the warning -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c26
Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c27
--- Comment #27 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c28
--- Comment #28 from Nikolay Borisov
I just upgraded my Leap 42.2 system to the latest kernel-default from maintenance update ( 4.4.74-18.20-default ) and problem is still there:
[ 7.052652] BTRFS info (device sdb1): disk space caching is enabled [ 7.060731] BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 [ 7.060733] BTRFS error (device sdb1): failed to read chunk tree: -22 [ 7.110461] BTRFS error (device sdb1): open_ctree failed
I look like the fix didn't make it or isn't properly applied
So can yo perform the following steps: 1. Boot your system with the last kernel I provided you with: https://build.suse.de/project/show/home:nikbor:bsc1043912 2. You should be able to mount the system and will likely see the warnings your described before. 3. Run btrfs balance as before - this time you shouldn't see the warnings. At this point I'm expecting to see rounded-down values stored on disk. To verify this again provide a dump of the super block and the device tree for the affected partition with those 2 commands: btrfs inspect-internal dump-super -af /dev/vdb btrfs inspect-internal dump-tree -d /dev/vdb -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c29
--- Comment #29 from Frederic Crozat
(In reply to Frederic Crozat from comment #26)
I just upgraded my Leap 42.2 system to the latest kernel-default from maintenance update ( 4.4.74-18.20-default ) and problem is still there:
[ 7.052652] BTRFS info (device sdb1): disk space caching is enabled [ 7.060731] BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 [ 7.060733] BTRFS error (device sdb1): failed to read chunk tree: -22 [ 7.110461] BTRFS error (device sdb1): open_ctree failed
I look like the fix didn't make it or isn't properly applied
So can yo perform the following steps:
1. Boot your system with the last kernel I provided you with: https://build.suse.de/project/show/home:nikbor:bsc1043912
2. You should be able to mount the system and will likely see the warnings your described before.
I was able to mount the fs with this kernel but no warning at all. I'm attaching both dump tree and super before running balance. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c30
Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c31
Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c32
--- Comment #32 from Nikolay Borisov
(In reply to Nikolay Borisov from comment #28)
(In reply to Frederic Crozat from comment #26)
I just upgraded my Leap 42.2 system to the latest kernel-default from maintenance update ( 4.4.74-18.20-default ) and problem is still there:
[ 7.052652] BTRFS info (device sdb1): disk space caching is enabled [ 7.060731] BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 [ 7.060733] BTRFS error (device sdb1): failed to read chunk tree: -22 [ 7.110461] BTRFS error (device sdb1): open_ctree failed
I look like the fix didn't make it or isn't properly applied
So can yo perform the following steps:
1. Boot your system with the last kernel I provided you with: https://build.suse.de/project/show/home:nikbor:bsc1043912
2. You should be able to mount the system and will likely see the warnings your described before.
I was able to mount the fs with this kernel but no warning at all.
I'm attaching both dump tree and super before running balance.
Can you attach logs AFTER you have run balance. What I'm seeing in the dumps provided is that the DEV_ITEM still has unrounded value: item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 3897 itemsize 98 devid 1 total_bytes 1000203820544 bytes_used 426351001600 My expectation would be that rebalance should store a rounded-down value. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c34
--- Comment #34 from Nikolay Borisov
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c35
--- Comment #35 from Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c36
--- Comment #36 from Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c39
--- Comment #39 from Frederic Crozat
Latest Leap 42.2 maintenance kernel was still not able to mount the partition. I'll install a TW kernel and test later
Latest kernel for TW (Kernel:stable) 4.12.2-1.g1b6adc0-default doesn't mount the partition: [ 8.312756] BTRFS error (device sdb1): super_total_bytes 1000203808768 mismatch with fs_devices total_rw_bytes 1000203812864 [ 8.312796] BTRFS error (device sdb1): failed to read chunk tree: -22 [ 8.367709] BTRFS error (device sdb1): open_ctree failed -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c41
--- Comment #41 from Frederic Crozat
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912
http://bugzilla.opensuse.org/show_bug.cgi?id=1043912#c43
--- Comment #43 from Frederic Crozat
(In reply to Frederic Crozat from comment #41)
When asked to resize, I ran btrfs fi resize 1000203816448 /mnt/cache (which is 1000203820544 - 4096)
Okay this makes sense and it exposes a missed site which should have been patched my previous attempt at fixing it. I've updated the patches and pushed them to build at https://build.suse.de/project/show/home:nikbor:bsc1043912. When it's done can you please test by downlading that kernel and booting into it and then trying to grow the volume to its maximum capacity:
After rebooting with this kernel: [ 4.469419] BTRFS error (device sdb1): super_total_bytes 1000203808768 mismatch with fs_devices total_rw_bytes 1000203812864 but the partition was properly mounted
btrfs fi resize max /mnt/cache
[ 147.637427] BTRFS info (device sdb1): new size for /dev/sdb1 is 1000203816960
Then report the numbers from these two commands:
btrfs inspect dump-super /mnt/cache | grep total
(proper command needs device name, not mount path) total_bytes 1000203812864 dev_item.total_bytes 1000203816960
btrfs inspect dump-tree /mnt/cache | grep total devid 1 total_bytes 1000203816960 bytes_used 426351001600 total bytes 1000203812864
I then unmounted /mnt/cache and re-mounted it (without rebooting): [ 405.607741] BTRFS error (device sdb1): super_total_bytes 1000203812864 mismatch with fs_devices total_rw_bytes 1000203816960 (but partition was mounted) -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com