[Bug 1189791] New: btrfs filesystem corruptions with HyperPAV
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 Bug ID: 1189791 Summary: btrfs filesystem corruptions with HyperPAV Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: S/390-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: azouhr@opensuse.org QA Contact: qa-bugs@suse.de CC: ada.lovelace@gmx.de Found By: --- Blocker: --- For a proof of concept, I created a machine with 22 3390-54 Disks attached to a single btrfs, cylinder 0 excluded as minidisks. This happened in certain junks of disks (at least 4 at a time). After adding devices, I always did a rebalance of the btrfs. When I added disk 15-18, I also enabled hyperpav (without cylinder 0 on the base devices) as available as feature for z/VM 7.2 (feature also has been added to z/VM). The definition looks like this in the directory: COMMAND DEFINE HYPERPAVALIAS 01C0 FOR BASE 0100 I also added alias devices at addresses 01C1-01C7. However, later on when copying data from that disk, I found quite a number of corruptions within btrfs. After a while the server even crashed, and is now running without hyperpav. Corruptions that already happened are obviously still there, but besides that, the system works without the hangs I experienced before. By chance (don't think that would be the issue), the first disk with corruptions is the 16th disk in the system (including the system disk). And while thinking about this, the disk with number 100 has not been enabled on the system, although all disks are defined to the same control unit. HyperPAV has been used during rebalance of the last 8 disks because dasdstat displayed workload on the PAV device. # btrfs device stats /srv | grep corruption_errs [/dev/dasdb1].corruption_errs 0 [/dev/dasdc1].corruption_errs 0 [/dev/dasdd1].corruption_errs 0 [/dev/dasde1].corruption_errs 0 [/dev/dasdk1].corruption_errs 0 [/dev/dasdj1].corruption_errs 0 [/dev/dasdh1].corruption_errs 0 [/dev/dasdm1].corruption_errs 0 [/dev/dasdf1].corruption_errs 0 [/dev/dasdg1].corruption_errs 0 [/dev/dasdl1].corruption_errs 0 [/dev/dasdi1].corruption_errs 0 [/dev/dasdn1].corruption_errs 0 [/dev/dasdo1].corruption_errs 0 [/dev/dasdp1].corruption_errs 331 [/dev/dasdq1].corruption_errs 303 [/dev/dasds1].corruption_errs 302 [/dev/dasdv1].corruption_errs 399 [/dev/dasdr1].corruption_errs 356 [/dev/dasdt1].corruption_errs 206 [/dev/dasdw1].corruption_errs 279 [/dev/dasdu1].corruption_errs 218 # dmesg | tail [ 6616.004750] BTRFS warning (device dasdb1): csum failed root 5 ino 40234 off 1818624 csum 0x8941f998 expected csum 0x99cae683 mirror 1 [ 6616.004755] BTRFS error (device dasdb1): bdev /dev/dasdv1 errs: wr 0, rd 0, flush 0, corrupt 401, gen 0 [ 6616.005008] BTRFS warning (device dasdb1): csum failed root 5 ino 40234 off 3637248 csum 0x8941f998 expected csum 0xe55a18c7 mirror 1 [ 6616.005013] BTRFS error (device dasdb1): bdev /dev/dasdv1 errs: wr 0, rd 0, flush 0, corrupt 402, gen 0 [ 6616.005238] BTRFS warning (device dasdb1): csum failed root 5 ino 40234 off 1818624 csum 0x8941f998 expected csum 0x99cae683 mirror 1 [ 6616.005244] BTRFS error (device dasdb1): bdev /dev/dasdv1 errs: wr 0, rd 0, flush 0, corrupt 403, gen 0 [ 6616.005513] BTRFS warning (device dasdb1): csum failed root 5 ino 40234 off 1818624 csum 0x8941f998 expected csum 0x99cae683 mirror 1 [ 6616.005565] BTRFS error (device dasdb1): bdev /dev/dasdv1 errs: wr 0, rd 0, flush 0, corrupt 404, gen 0 [ 6616.882699] BTRFS warning (device dasdb1): csum failed root 5 ino 40224 off 10055680 csum 0x8941f998 expected csum 0x26222530 mirror 1 [ 6616.882715] BTRFS error (device dasdb1): bdev /dev/dasdu1 errs: wr 0, rd 0, flush 0, corrupt 249, gen 0 # uname -a Linux zlxusr1020 5.12.12-1-default #1 SMP Fri Jun 18 11:07:46 UTC 2021 (0e46a2c) s390x s390x s390x GNU/Linux -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c1 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ihno@suse.com --- Comment #1 from Sarah Kriesch <ada.lovelace@gmx.de> --- Has SUSE a HyperPAV available to reproduce? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c2 --- Comment #2 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to Sarah Kriesch from comment #1)
Has SUSE a HyperPAV available to reproduce?
SUSE has HyperPAV available (this is a storage feature, and devices must be configured in iocds). However to reproduce, you would need z/VM 7.2, because I am using minidisks without the first cylinder. This is a new feature available with z/VM 7.2 I did not try with dedicated disks and alias devices. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c3 --- Comment #3 from Berthold Gunreben <azouhr@opensuse.org> --- Maybe I should add a little so that it is easier to reproduce. So, what I did: 1. create btrfs and mount to /srv (actually 4 devices, but I believe this will not be relevant) 1.1 put data on that disk, so that it is something like 90% full 2. online add extra devices with btrfs device add 2.1 use dirm to add full pack minidisk (without cylinder 0) like dirm for <guest> amd 104 3390 autog 65519 pool1 mr 2.2 vmcp link <host> 101 101 mr 2.3 use yast to activate and format disk then create raw partition 2.4 adddisk with btrfs device add 2.5 rebalance btrfs 3. add hyperpavalias device like in first description, also online as maint with for <guest> cmd define hyperpavalias 01c0 for base 0100 4. activate hyperpavalias device with chzdev -e 01c0 5. add more disks like in 2 with rebalance 6. this is now a little complicate: 6.1 I had a kubic docker registry on /srv and mirrored registry also within that filesystem 6.2 the errors showed up when I filled the registry with the local copy -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c4 --- Comment #4 from Sarah Kriesch <ada.lovelace@gmx.de> --- (In reply to Berthold Gunreben from comment #3)
6.1 I had a kubic docker registry on /srv and mirrored registry also within that filesystem 6.2 the errors showed up when I filled the registry with the local copy
Are you using openSUSE Kubic or only the kubic docker registry based on openSUSE Tumbleweed? Both products are based on openSUSE Tumbleweed. BUT there could be some differences. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c5 --- Comment #5 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to Sarah Kriesch from comment #4)
Are you using openSUSE Kubic or only the kubic docker registry based on openSUSE Tumbleweed? Both products are based on openSUSE Tumbleweed. BUT there could be some differences.
This is the registry based on Tumbleweed only. However, I don't think that this is relevant, because the btrfs is just the filesystem for that workload. I guess other workloads also would create the issue. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P3 - Medium -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c6 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tstaudt@de.ibm.com --- Comment #6 from Sarah Kriesch <ada.lovelace@gmx.de> --- Hi Thomas, can IBM support us in this case with btrfs file corruptions with HyperPAV after the creation of workloads, please? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c7 --- Comment #7 from Sarah Kriesch <ada.lovelace@gmx.de> --- If the IBM Developer does not know the Kubic docker registry. We provide container images (also for s390x) at registry.opensuse.org. The list is growing and you can create workloads with mirroring. You can reproduce this bug with other workloads (probably), too. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c8 --- Comment #8 from Berthold Gunreben <azouhr@opensuse.org> --- (In reply to Sarah Kriesch from comment #7)
If the IBM Developer does not know the Kubic docker registry. We provide container images (also for s390x) at registry.opensuse.org. The list is growing and you can create workloads with mirroring. You can reproduce this bug with other workloads (probably), too.
The container registry I am using is described here: https://kubic.opensuse.org/blog/2019-11-15-private-registry/ However, I did not go for a transactional server, just a standard installation. Note, that I won't be able to use the machine for much longer, because the IFL loaner that is used in this environment will end this week. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c9 --- Comment #9 from Berthold Gunreben <azouhr@opensuse.org> --- There has not been a lot of development on this topic lately, so let me emphasize on possible impacts. The fact that I found the issue with btrfs does not necessarily mean, that this is the only filesystem where the issue exists. Due to the nature of the filesystem, it quickly detects file corruptions. However, it is quite possible, that all filesystems are affected, just with other filesystems there might be unnoticed data corruption. Until this is clarified, the HyperPAV Feature as invented with z/VM 7.2 is not usable for me. Unfortunately, I do not have z/VM 7.2 available right now, and thus cannot do further tests on my own. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c10 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bugproxy@us.ibm.com Flags| |needinfo?(bugproxy@us.ibm.c | |om) --- Comment #10 from Sarah Kriesch <ada.lovelace@gmx.de> --- What is the status of this bug report? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c18 Sarah Kriesch <ada.lovelace@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |azouhr@opensuse.org Flags| |needinfo?(azouhr@opensuse.o | |rg) --- Comment #18 from Sarah Kriesch <ada.lovelace@gmx.de> --- Berthold, had you got the time to reproduce? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1189791 http://bugzilla.opensuse.org/show_bug.cgi?id=1189791#c20 --- Comment #20 from Sarah Kriesch <ada.lovelace@gmx.de> --- Hi Berthold, nice to see you again. Is it possible to test and reproduce this bug at Datev? -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com