[Bug 1063638] New: btrfs balance renders system unresponsive and eventually even kills WiFi when quota is enabled - review I/O scheduling parameters of btrfsmaintenance
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Bug ID: 1063638 Summary: btrfs balance renders system unresponsive and eventually even kills WiFi when quota is enabled - review I/O scheduling parameters of btrfsmaintenance Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de CC: Antoine.Saroufim@gmail.com, aschnell@suse.com, auxsvr@gmail.com, dmitry@roshchin.org, dsterba@suse.com, ecsos@schirra.net, enadolski@suse.com, fcrozat@suse.com, fliu@suse.com, friedhelm.stappert@web.de, guenther@mpanrw.de, gweberbh@gmail.com, hannsj_uhl@de.ibm.com, harald.achitz@gmail.com, ingo.goeppert+suse@mailbox.org, jeffm@suse.com, kdejaeger@gmail.com, lnussel@suse.com, lpechacek@suse.com, ndcunliffe@gmail.com, okurz@suse.com, ortsacs@yahoo.es, rgoldwyn@suse.com, richard@nod.at, ronisbr@gmail.com, slindomansilla@suse.com, suse@bugs.jan.ritzerfeld.org, sven.heithecker@web.de, t.rother@netzwissen.de, tneo@gmx.com, ulrich.hobelmann+suse@gmail.com Depends on: 1017461 Found By: --- Blocker: --- ## Observation +++ This bug was initially created as a clone of Bug #1017461 +++ The first part of fixes have been done in bug 1017461 but it seems the btrfs maintenance tasks can still have a significant impact on system responsiveness. Running the btrfs maintenance jobs, e.g. * /etc/cron.monthly/btrfs-scrub * /etc/cron.weekly/btrfs-balance * /etc/cron.weekly/btrfs-trim can make the clock with seconds displayed stop for some seconds or the mouse cursor to get stuck for seconds ## Reproducible * Somehow cause lot of "dirty" data that needs balancing/scrubbing * Run the cron jobs * Observe the system responsiveness is hindered ## Expected result Interactive use of a machine should not be impacted ## Suggestion It looks like /usr/share/btrfsmaintenance/btrfs-scrub.sh checks for the variable "$BTRFS_SCRUB_PRIORITY" equalling "normal" from /etc/sysconfig/btrfsmaintenance but this is set to "idle" for me in the config file which causes that no I/O scheduling parameters are forwarded to the call of scrub at all which does not seem to make sense to me. Should it set `-c 3` for idle or `-c 2 -n 7` for best-effort prio 7 instead? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 David Walker <David@WalkerStreet.info> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |David@WalkerStreet.info -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c5 Antoine Belvire <antoine.belvire@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |antoine.belvire@opensuse.or | |g --- Comment #5 from Antoine Belvire <antoine.belvire@opensuse.org> --- With the recently added systemd units, btrfs-balance and btrfs-trim cannot run simultaneously: ~> systemctl cat btrfs-balance # /usr/lib/systemd/system/btrfs-balance.service [Unit] Description=Balance block groups on a btrfs filesystem Documentation=man:btrfs-balance After=fstrim.service btrfs-trim.service btrfs-scrub.service [Service] Type=oneshot ExecStart=/usr/share/btrfsmaintenance/btrfs-balance.sh IOSchedulingClass=idle CPUSchedulingPolicy=idle ~> (Thanks to the After= field.) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c8 Aaron Williams <aaron.w2@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |aaron.w2@gmail.com --- Comment #8 from Aaron Williams <aaron.w2@gmail.com> --- I was hit by this recently and I killed the rebalancing in the middle because my laptop slowed to a crawl (and I couldn't afford to wait). Unfortunately this left BTRFS in a state where I can only mount it read-only. I'm still trying to repair it by rebuilding the extent and csum trees but it is taking forever (16 hours on a SSD with 100G root volume). After forcing a reboot (power button because the system was so unresponsive) I can no longer mount the root filesysten read/write and btrfs check --repair crashed afterwards. As far as I'm concerned, due to this bug, BTRFS is nowhere near ready for prime time. I should never have accepted the default choice and should have just used XFS. The fact that it cannot recover from an interrupted rebalancing operation is of extremely grave concern. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c9 --- Comment #9 from Andre Guenther <guenther@mpanrw.de> --- This is what killed my PC OS last year too. I know you "shouldn't" do that - but the first duty of a file system is to keep it's data as safe as possible. After hard reset i could not mount at all. Then tried several ways to make it work again in order of ascending severity. I managed to mount it read only and had to hand-pick my files from the disk because many of them couldn't be read anymore. (No important ones of course but it was inconvenient.) Then i installed it new and i am making full-partition image backups regularly now to prevent it from happening again. I have a bad feeling using this on the companies servers :-/. There can of course be reasons for this to happen even on a Server (defective hardware ...) and in this case it's bad enough to deal with one problem. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c10 --- Comment #10 from Aaron Williams <aaron.w2@gmail.com> --- After rebuilding the extent and csum trees I was able to boot again and things went great until it started rebalancing. Then my system basically hung. I couldn't even start a root or sudo session to stop the rebalancing. I was forced to hit the power button again after letting it run for several hours. I booted a rescue flash drive to do it there so at least I can see what the hell is going on. Fortunately after a couple of minute delay I was able to mount read/write this time. There should be no need to do this since there is plenty of space left. OpenSUSE should not use BTRFS for anything critical like the root filesystem or user data without strong warnings. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c11 Libor Pechacek <lpechacek@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tkoenig@netcologne.de --- Comment #11 from Libor Pechacek <lpechacek@suse.com> --- *** Bug 1074924 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Goldwyn Rodrigues <rgoldwyn@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wqu@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c12 --- Comment #12 from Wenruo Qu <wqu@suse.com> --- (In reply to Oliver Kurz from comment #0)
## Observation
+++ This bug was initially created as a clone of Bug #1017461 +++
The first part of fixes have been done in bug 1017461 but it seems the btrfs maintenance tasks can still have a significant impact on system responsiveness. Running the btrfs maintenance jobs, e.g.
IMHO the problem is quota with balance. Unfortunately it's a known bug and at least I don't have a clear plan to fix. Would you please try do a balance with quota enabled, nothing else and check if the responsiveness get any improvement? If quota + balance has acceptable responsiveness, then at least it's not ab urgent problem for qgourp. (In reply to Andre Guenther from comment #9)
After hard reset i could not mount at all.
This is the real problem, and in fact much more serious than the performance problem IMHO. This, and some recent reports in mail list, suggest btrfs is not as safe as we though for power loss. The whole concept of btrfs metadata CoW is, as long as your superblock is updated correctly or not updated at all, whatever happened shouldn't damage your fs. (metadata should always be fine, CoW data is also fine while nocowed data is damaged). All problems caused by hard reset imply a serious problem we should dig further. I would start investigate this by introducing new runtime selftest first. But the problem seems not easy to fix any time soon. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c32 Gabor Katona <katonag@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |katonag@gmail.com --- Comment #32 from Gabor Katona <katonag@gmail.com> --- I am also struggling with this bug. It happened several times in the past and I also face it at this very moment. The current story: Balancing started yesterday rendering my notebook unusable (no disk IO but high CPU). Tried a hard reset (again, I have done it in the past also). The result is a non bootable system, it stops in emergency mode, root is readonly. Unfortunately I am familiar with this, it happened sevaral times (although not always). For me the repair is the following. Repeat the boot-shutdown sequence several times (10, 20, 30, who knows) and once magic, the system boots. This is what happened yesterday evening. I let the notebook run whole night for more than 12 hours. But still 50% was left from balancing. I had to remove the charging to take to work but unfortunately it tried to sleep. After resume it rebooted (erratic BIOS bug) and started all over, readonly root, several reboots and now it is "working", which means that btrfs eats 100% CPU and doesn't respond to btrfs balance cancel. What should I do to avoid this forever? Should I disable quota? OK, but please provide info on what to change in snapper (ranges as I read, but how). Or should I reinstall without using btrfs? It is simply unacceptable for me that a file system renders a system unusable for 10-20 hours. Especially when there are file systems that do not do this. Two remarks: 1. Opensuse should not use btrfs as default. Or not with quotas if this is the reason. Currently opensuse with btrfs is not even beta, it is unusable. 2. Btrfs doc says that balancing is safe. Well, this is actually totally false. Balancing is quite unsafe. Maybe someone should change it in the docs. My system is Tumbleweed with latest updates. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c33 --- Comment #33 from Gabor Katona <katonag@gmail.com> --- Forgot one thing: In my case the unresponsiveness of my notebook is periodic. For 5-20 s I can use it more or less normally, than it stalls for 5-10 s. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Jeff Mahoney <jeffm@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #761823|application/x-shellscript |text/plain mime type| | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Oleksandr Orlov <oorlov@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Blocks| |1091933 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c37 David Manca <medzernik1@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |medzernik1@gmail.com --- Comment #37 from David Manca <medzernik1@gmail.com> --- Is there any progress on fixing this? I've had several production machines die because of this a few days back. This is a critical bug that needs to be addressed, is there any progress? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c39 --- Comment #39 from Friedhelm Stappert <friedhelm.stappert@web.de> --- BTW: The problem persists after upgrading to Leap15. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c40 --- Comment #40 from Aaron Williams <aaron.w2@gmail.com> --- Whoever made the choice to make BTRFS the default root filesystem should be FIRED. This abomination needs to be fixed now! I just powered up my laptop to do something that should take no more than 5 minutes and now I get to watch it die since I don't have my charger with me and I can't safely shut down. This shitty filesystem should NEVER have been made anything but experimental. I'm ready to reformat my laptop with XFS to rid myself of this abomination. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c41 Aaron Williams <aaron.w2@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P2 - High |P0 - Crit Sit Severity|Major |Critical --- Comment #41 from Aaron Williams <aaron.w2@gmail.com> --- I think this bug should be marked as critical because it renders the system completely unusable and in my experience can lead to data loss. For example, I'm watching the battery on my laptop run down because I can't shut it off while this goes on. The last time I forced it off it rendered my system unbootable and recovery took a couple of days with a lot of very worrying error and warning messages from the broken btrfs fsck tool. I'm running the latest Leap and this is still broken. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c42 --- Comment #42 from Aaron Williams <aaron.w2@gmail.com> --- With the init 0 command my laptop did eventually shut down, however, now it only boots into single user mode due to this abomination. Fuck BTRFS. There was NO reason for it to go into its check mode to begin with the previous time when it hung with BTRFS because it had been shut down cleanly. No rebalancing was needed. Rebalancing shouldn't hang the whole fucking system either, nor should it prevent a clean shutdown. Now, thanks to this abomination I can't boot back up. All I ask is a filesystem that can reliably store and retrieve files that can quickly be checked after crashes. BTRFS fits none of these criteria. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Jeff Mahoney <jeffm@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P0 - Crit Sit |P2 - High -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c43 --- Comment #43 from David Manca <medzernik1@gmail.com> --- The bug has a severity of Critical. It renders the system unusable. It is critical. Who even thinks if not marking this as critical to fix. Have we dropped standards now? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c44 --- Comment #44 from Harald Achitz <harald.achitz@gmail.com> --- Solutions for the BTRFS problematic have been declared as solved several times, and it turned out again again again that the problem still exists. The whole story is not a recommendation for using openSUSE, and I think I will make some break and jump over 15 Leap. SUSE should not abuse the openSUSE users as beta testers like this. This feature is nothing for notebooks, and it is doubtable that it is useful for workstation of the average openSUSE user either. Maybe on 7x24 running servers, but even there is the performance drop concerning. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c45 David Manca <medzernik1@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P2 - High |P1 - Urgent --- Comment #45 from David Manca <medzernik1@gmail.com> --- It's a bug and it **has to get fixed IMMIDIATELY** it's a production-killing thing that will end your system. This *HAS TO GET FIXED* -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c46 --- Comment #46 from Ronan Chagas <ronisbr@gmail.com> --- I have to agree. This is a very old bug that affects Leap and Tumbleweed. I think I am seeing this since 2016. By that time, it was said that disabling quotas would fix it. However, quotas is now being used by snapper to clean snapshots I think. Hence, you can try to disable quotas by now and see if the problem is gone. Notice that you will have to manually clean snapshots then. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c47 --- Comment #47 from Thomas König <tkoenig@netcologne.de> --- I concur. I got rid of the problem by re-installing my system without btrfs, but that is not a general solution. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sebastian.chlad@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c48 --- Comment #48 from Andre Guenther <guenther@mpanrw.de> --- Actually there are at least two bugs here: 1. The system gets unresponsive and you can't even cleanly shutdown it. 2. Hard reset / power loss will very likely kill your data (even if problem 1 didn't arise yet, but especially then). As already has been mentioned - the second one is even worse. Shouldn't it be considered to split them in two, making the second one critical? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c49 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P1 - Urgent |P2 - High --- Comment #49 from Oliver Kurz <okurz@suse.com> --- (In reply to David Manca from comment #43)
Who even thinks if not marking this as critical to fix. Have we dropped standards now?
Sorry, I don't understand your language here. So you are arguing that this issue is critical and not *just* major? As a reporter I set the severity to "Major" according to https://bugzilla.opensuse.org/page.cgi?id=importance_matrix.html and while I agree that this bug here is about the most important one to handle in the context of "openSUSE as a daily operating system" I would not regard it as critical because so far I have not seen data loss linked to it.
It's a bug and it **has to get fixed IMMIDIATELY** it's a production-killing thing that will end your system. This *HAS TO GET FIXED*
Please stay calm and objective here :) The priority field is used by teams working on bugs to prioritize their internal backlog so please refrain from changing it without consideration of the according development team. Feel free to bring more people onto this bug and let them express their opinion by "voting". IMHO this is a good way to show how many people are affected and care about the issue without needing to add more comments which do not add objective information which help to actually fix the issues I work as a QA engineer at SUSE and try to help with resolving this bug. Currently I have the challenge to find a clear reproducer. So it would help very much if we find one scenario which we can automate the failure reproduction. Providing this to the development teams could help to fix the issue faster. With the help of openQA we can already automate a lot which are very realistic scenarios but I would appreciate some help now :) Rest assured that the issue *is being worked on* already but without a way to reproduce the issue as is observed on the side of the users it will likely take very long to fix the issue *as you see them*. (In reply to Andre Guenther from comment #48)
Actually there are at least two bugs here: 1. The system gets unresponsive and you can't even cleanly shutdown it. 2. Hard reset / power loss will very likely kill your data (even if problem 1 didn't arise yet, but especially then).
As already has been mentioned - the second one is even worse.
Shouldn't it be considered to split them in two, making the second one critical?
Yes but only when we identified that the issues are actually different or at least the way how to reproduce are different. I would really appreciate if you could provide steps to reproduce this issue more easily. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c50 --- Comment #50 from David Manca <medzernik1@gmail.com> --- (In reply to Oliver Kurz from comment #49)
Sorry, I don't understand your language here. So you are arguing that this issue is critical and not *just* major? As a reporter I set the severity to "Major" according to https://bugzilla.opensuse.org/page.cgi?id=importance_matrix.html and while I agree that this bug here is about the most important one to handle in the context of "openSUSE as a daily operating system" I would not regard it as critical because so far I have not seen data loss linked to it.
You what? There have been people here *complaining* that they had their systems fucked and data lost. "Critical: Crash, data loss or corruption, severe memory leak, etc. " Crash definitely happens, it's severe enough! It is a critical bug that affects everyone, just read the comments!
Please stay calm and objective here :) The priority field is used by teams working on bugs to prioritize their internal backlog so please refrain from changing it without consideration of the according development team. Feel free to bring more people onto this bug and let them express their opinion by "voting". IMHO this is a good way to show how many people are affected and care about the issue without needing to add more comments which do not add objective information which help to actually fix the issues
People don't even know that this bugtracker exists, yet they experience it a lot. I had to find the bugtracker on reddit, because I did not know how to find it. People not seeing this thread =/= it doesn't affect them.
I work as a QA engineer at SUSE and try to help with resolving this bug. Currently I have the challenge to find a clear reproducer. So it would help very much if we find one scenario which we can automate the failure reproduction. Providing this to the development teams could help to fix the issue faster. With the help of openQA we can already automate a lot which are very realistic scenarios but I would appreciate some help now :)
Install opensuse on a laptop, run YAST once or twice, use it and then see the 99% CPU drain and eventual crash of the system at least once a day.
Rest assured that the issue *is being worked on* already but without a way to reproduce the issue as is observed on the side of the users it will likely take very long to fix the issue *as you see them*.
Thank god for that It's a major bug that basically stopped me from deploying openSUSE machines. Hopefully there will be progress, since it's like 1/2 of year from the report of the bug and 0 progress -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c51 --- Comment #51 from Andre Guenther <guenther@mpanrw.de> --- (In reply to Oliver Kurz from comment #49)
Yes but only when we identified that the issues are actually different or at least the way how to reproduce are different. I would really appreciate if you could provide steps to reproduce this issue more easily.
Regarding the "stalling" issue: To be honest since i installed three systems with different versions of LEAP, different hardware and btrfs, i am under the impression that *every* installation suffers from this problem (more or less). Isn't that so? The second one is very easy to reproduce: Wait until btrfs does it's 100% CPU utilisation thing (balance, trim, whatever), press the reset-switch and say goodbye to your filesystem :-) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c52 --- Comment #52 from Gabor Katona <katonag@gmail.com> --- (In reply to Oliver Kurz from comment #49)
I work as a QA engineer at SUSE and try to help with resolving this bug. Currently I have the challenge to find a clear reproducer. So it would help very much if we find one scenario which we can automate the failure reproduction. Providing this to the development teams could help to fix the issue faster. With the help of openQA we can already automate a lot which are very realistic scenarios but I would appreciate some help now :)
...
Yes but only when we identified that the issues are actually different or at least the way how to reproduce are different. I would really appreciate if you could provide steps to reproduce this issue more easily.
Actually the development team has two really important tasks which can be split into two bugs. The first is to solve this CRITICAL bug. Rendering a system unusable (YES, UNUSABLE) for several hours is more than critical. It is just like someone would come and take the computer away for a few hours. No, restart does not help, since the balancing continues in the emergency state, additionally you risk data loss. The second is just as important but more general. A fundamental system component like a file system should NEVER eat up the CPU or render the system unusable in any other way. Measures should be made to avoid such scenario completely. Bugs are always coming and passing, but a filesystem should be coded in a way not to make the system unusable by 100% CPU usage. It should detect if a process, subcomponent, anything stucks, eats the CPU, etc. Currently BTRFS is experimental, the sooner you accept it the faster you provide a solution: SKIP BTRFS for opensuse. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dheidler@suse.com, | |foursixnine@opensuse.org -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c53 Richard Brown <rbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rbrown@suse.com --- Comment #53 from Richard Brown <rbrown@suse.com> --- (In reply to Andre Guenther from comment #51)
The second one is very easy to reproduce: Wait until btrfs does it's 100% CPU utilisation thing (balance, trim, whatever), press the reset-switch and say goodbye to your filesystem :-)
I'm sorry, but that's nonsense. While the load condition is reproducible, and I can confirm that pressing the reset-switch during a high-load btrfs condition MAY make the filesystem unmountable, but I have literally dozens of cases where following our documented process [1] fixes such problems and ZERO where it does not. Therefore there claims of dataloss are not valid and this second issue could be considered Major (because of the disruption) but not Critical Note it's my experience that the problems with high-load btrfs conditions are often exacerbated by scrubs/balance/trims not being run often enough, therefore having more of a mess to fix when they do actually run - Maybe the solution should be to run them more often, such as via systemd timers so we can be sure they run more often. [1] https://en.opensuse.org/SDB:BTRFS#How_to_repair_a_broken.2Funmountable_btrfs... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c54 --- Comment #54 from Oliver Kurz <okurz@suse.com> --- (In reply to Gabor Katona from comment #52)
Hopefully there will be progress, since it's like 1/2 of year from the report of the bug and 0 progress
The predecessor is https://bugzilla.opensuse.org/show_bug.cgi?id=1017461 so actually the whole story is already older. Of course I am aware that there had been many more reports over different channels but my observation was that none of them really brought fixing the issue at hand any forward as too unspecific. This is why I want to help in this domain more and I think collecting the according information in according bugs can help. (In reply to Andre Guenther from comment #51)
Regarding the "stalling" issue: To be honest since i installed three systems with different versions of LEAP, different hardware and btrfs, i am under the impression that *every* installation suffers from this problem (more or less). Isn't that so?
Yes, I think you are right, e.g. see https://bugzilla.opensuse.org/show_bug.cgi?id=1017461 as a report against openSUSE Leap 42.2 . However since then changes to the different components of the systems - foremost the kernel itself - have introduced a lot of changes which should help. This is what was achieved in before and therefore it is very important to report in which product version (and potentially also which kernel) what problem was observed. Could you test the latest openSUSE Tumbleweed or openSUSE Leap 15.0? I already stated that I have troubles to find a scenario which can clearly reproduce the issue.
The second one is very easy to reproduce: Wait until btrfs does it's 100% CPU utilisation thing (balance, trim, whatever), press the reset-switch and say goodbye to your filesystem :-)
Can not confirm. I just tried that: * Installed a recent openSUSE Tumbleweed 20180605 x86_64 on LVM, encrypted root, 90 GB HDD, notebook hardware * Installed a lot of packages (full plasma session, servers, etc.), started yast2 * copied random data to hard disk, e.g. `for i in {1..1000}; do dd if=/dev/urandom bs=64M count=1 of=/tmp/out_$i.bin ; done` * started `btrfs scrub start /`, started snapper service * hard-rebooted the system with `magic sysrq-b` * system could startup without any problem observed so not as easy to reproduce as that :( (In reply to Gabor Katona from comment #52)
[…] The second is just as important but more general. A fundamental system component like a file system should NEVER eat up the CPU or render the system unusable in any other way.
I agree with all your points here.
Currently BTRFS is experimental, the sooner you accept it the faster you provide a solution: SKIP BTRFS for opensuse.
I guess you can achieve the same by disabling qgroups and snapshots. This is also easily possible from the installer. However, important features are missing then which you might want to have by different means, e.g. LVM including snapshot volumes and more backups including your own cleanup strategy for this. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c55 --- Comment #55 from Andre Guenther <guenther@mpanrw.de> --- (In reply to Richard Brown from comment #53)
While the load condition is reproducible, and I can confirm that pressing the reset-switch during a high-load btrfs condition MAY make the filesystem unmountable, but I have literally dozens of cases where following our documented process [1] fixes such problems and ZERO where it does not.
Interesting. It happened to me two times now, but maybe it was bad luck. I would like to test this. Does it make any sense to protocol this in a certain way for debugging purposes? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c56 --- Comment #56 from Richard Brown <rbrown@suse.com> --- (In reply to Andre Guenther from comment #55)
Interesting. It happened to me two times now, but maybe it was bad luck. I would like to test this. Does it make any sense to protocol this in a certain way for debugging purposes?
If you're following the guide at https://en.opensuse.org/SDB:BTRFS#How_to_repair_a_broken.2Funmountable_btrfs... when for the two times you say the problem has happened you should have filed bugs with logs including the output of "btrfs check" Can you link me to those bugs? -- You are receiving this mail because: You are on the CC list for the bug.
If you're following the guide at I was following guide (in facht cross-checked several to be on the safe side) but not that one. There is of course the possibility that i did it wrong, but i tried to order
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c57 --- Comment #57 from Andre Guenther <guenther@mpanrw.de> --- (In reply to Richard Brown from comment #56) the steps by riskiness.
Can you link me to those bugs?
I haven't filed bugs for that. I didn't even know there was a bugtracker for that at opensuse. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c60 --- Comment #60 from Ronan Chagas <ronisbr@gmail.com> --- (In reply to Richard Brown from comment #53)
While the load condition is reproducible, and I can confirm that pressing the reset-switch during a high-load btrfs condition MAY make the filesystem unmountable, but I have literally dozens of cases where following our documented process [1] fixes such problems and ZERO where it does not.
Therefore there claims of dataloss are not valid and this second issue could be considered Major (because of the disruption) but not Critical
I reported this problem in 2016 at the opensuse-factory mailing list: https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html Back in the day, this was affecting a server (with HDDs) and my laptop (with SSDs). Unfortunately, I had to reinstall all my server once because a power failure during those high loads corrupted the filesystem and I could not managed to recover it... However, it was 2 years ago and things could have been improved. I have never seen this anymore because my first action after installing openSUSE is to disable qgroups now. (In reply to Gabor Katona from comment #52)
Currently BTRFS is experimental, the sooner you accept it the faster you provide a solution: SKIP BTRFS for opensuse.
I do not agree. BTRFS has been running here without any problems for 2 years, since I disable qgroups. What should be considered experimental (as warned here https://lists.opensuse.org/opensuse-factory/2016-09/msg00032.html ) is quotas / qgroups. I have no idea if this has been improved, but since people are seeing the problem, I guess it is still the same. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c61 --- Comment #61 from Gabor Katona <katonag@gmail.com> --- (In reply to Ronan Chagas from comment #60)
I do not agree. BTRFS has been running here without any problems for 2 years, since I disable qgroups. What should be considered experimental (as warned here https://lists.opensuse.org/opensuse-factory/2016-09/msg00032.html ) is quotas / qgroups. I have no idea if this has been improved, but since people are seeing the problem, I guess it is still the same.
Yes, you are right, I was not precise. However, despite marking quotas and qgroups as experimental, these are enabled by default in opensuse. And this should be changed immediately. Btrfs as presented in opensuse is experimental without under the hood tweaking. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c62 --- Comment #62 from Oliver Kurz <okurz@suse.com> --- (In reply to Gabor Katona from comment #61)
[…] However, despite marking quotas and qgroups as experimental, these are enabled by default in opensuse. And this should be changed immediately.
Just to relay that message (not in my responsibility to make that decision): This is very unlikely to change. btrs including qgroups provides some core functionality which is marketed as part of SUSE Linux Enterprise and is therefore seen to be enterprise-ready and *is* used by enterprise customers on a big scale. Sure, many are also selecting a different filesystem. This is of course possible and also fully supported. However, many users including myself are running btrfs on a plethora of systems from server to micro-notebooks which no problems which are *specific* to btrfs. I do have problems myself but they are most likely related to the generic Linux behaviour in case of "thrashing", e.g. https://bugzilla.suse.com/show_bug.cgi?id=1087873 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c63 --- Comment #63 from Richard Brown <rbrown@suse.com> --- (In reply to Gabor Katona from comment #61)
Yes, you are right, I was not precise. However, despite marking quotas and qgroups as experimental, these are enabled by default in opensuse. And this should be changed immediately. Btrfs as presented in opensuse is experimental without under the hood tweaking.
quotas and qgroups are not experimental https://btrfs.wiki.kernel.org/index.php/Status They are defined as "safe for general use, there are some known problems that do not affect majority of users" They are the default in SUSE Linux enterprise and have been for years now - if millions of dollars of enterprise systems are trusted with it without major issue, I struggle to see how it isn't a suitable option for a default in openSUSE -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c64 --- Comment #64 from Gabor Katona <katonag@gmail.com> --- (In reply to Richard Brown from comment #63)
They are the default in SUSE Linux enterprise and have been for years now - if millions of dollars of enterprise systems are trusted with it without major issue, I struggle to see how it isn't a suitable option for a default in openSUSE
The answer is quite simple to your last question. Btrfs in opensuse (!!, not in SUSE Linux) renders several systems unusable for hours. A filesystem. Not some user installed crap. If this is not enough for dropping it as default until the bug is resolved, than nothing. As far as I see this is not enough. Which is just sad. My longest wait was 14 hours before hard reset. 14 hours and nothing happened, balance status just showed the same. 14 hours on a 50GB partition. It could rewrite the whole partition bit by bit several times during this time. Now when I realize that balance is running (it is quite easy) I immediately issue a balance cancel. Usually it takes 3-5 hours just to cancel. Do you think it is suitable for a default? Maybe the issue is some conflict between btrfs and other opensuse subsystems, this is why it isn't present in SUSE Linux or other distros. But the result is an unusable opensuse system. This seems quite experimental to me and is not suitable for a default. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c65 --- Comment #65 from Eric Schirra <ecsos@schirra.net> --- (In reply to Richard Brown from comment #63)
(In reply to Gabor Katona from comment #61)
Yes, you are right, I was not precise. However, despite marking quotas and qgroups as experimental, these are enabled by default in opensuse. And this should be changed immediately. Btrfs as presented in opensuse is experimental without under the hood tweaking.
quotas and qgroups are not experimental
https://btrfs.wiki.kernel.org/index.php/Status
They are defined as "safe for general use, there are some known problems that do not affect majority of users"
This isn't right. In your link is: Quotas, qgroups | mostly OK | tbd | mostly OK | qgroups with many snapshots slows down balance This shows accurate the problems: mostly and slows down!! For me, this is not stable. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c66 --- Comment #66 from Eric Schirra <ecsos@schirra.net> --- (In reply to Eric Schirra from comment #65)
(In reply to Richard Brown from comment #63)
(In reply to Gabor Katona from comment #61)
Yes, you are right, I was not precise. However, despite marking quotas and qgroups as experimental, these are enabled by default in opensuse. And this should be changed immediately. Btrfs as presented in opensuse is experimental without under the hood tweaking.
quotas and qgroups are not experimental
https://btrfs.wiki.kernel.org/index.php/Status
They are defined as "safe for general use, there are some known problems that do not affect majority of users"
This isn't right. In your link is:
Quotas, qgroups | mostly OK | tbd | mostly OK | qgroups with many snapshots slows down balance
This shows accurate the problems: mostly and slows down!!
For me, this is not stable.
And under known issus (https://btrfs.wiki.kernel.org/index.php/Quota_support): Combining quota with (too many) snapshots of subvolumes can cause performance problems, for example when deleting snapshots. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c67 --- Comment #67 from Eric Schirra <ecsos@schirra.net> --- (In reply to Gabor Katona from comment #64)
Maybe the issue is some conflict between btrfs and other opensuse subsystems, this is why it isn't present in SUSE Linux or other distros. But the result is an unusable opensuse system. This seems quite experimental to me and is not suitable for a default.
Not right: For ca. one year i post this link: https://www.reddit.com/r/btrfs/comments/4qz1qd/problems_with_btrfs_quota/ -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c68 --- Comment #68 from Gabor Katona <katonag@gmail.com> --- (In reply to Eric Schirra from comment #65)
Quotas, qgroups | mostly OK | tbd | mostly OK | qgroups with many snapshots slows down balance
This shows accurate the problems: mostly and slows down!!
For me, this is not stable.
Definitely not stable. And mostly can be anything from experimental to beta, since mostly does not describe the problem when the situation is outside of "mostly". If in some cases there would be a few minutes of performance drop, it could be beta, but with several hours of knock-out this is experimental. Mostly OK, but still experimental. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c69 --- Comment #69 from Gabor Katona <katonag@gmail.com> --- (In reply to Eric Schirra from comment #67)
(In reply to Gabor Katona from comment #64)
Maybe the issue is some conflict between btrfs and other opensuse subsystems, this is why it isn't present in SUSE Linux or other distros. But the result is an unusable opensuse system. This seems quite experimental to me and is not suitable for a default.
Not right: For ca. one year i post this link: https://www.reddit.com/r/btrfs/comments/4qz1qd/problems_with_btrfs_quota/
OK, but it still can be a component in opensource distros not present in SUSE Linux Enterprise, because it seems that somehow this bug does not affect SUSE Linux Enterprise, since companies would not let it ruin their systems. I guess at least. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c70 --- Comment #70 from Richard Brown <rbrown@suse.com> --- (In reply to Gabor Katona from comment #64)
(In reply to Richard Brown from comment #63)
They are the default in SUSE Linux enterprise and have been for years now - if millions of dollars of enterprise systems are trusted with it without major issue, I struggle to see how it isn't a suitable option for a default in openSUSE
The answer is quite simple to your last question. Btrfs in opensuse (!!, not in SUSE Linux) renders several systems unusable for hours.
openSUSE Leap or openSUSE Tumbleweed openSUSE Leap has 100% identical code to SUSE Linux Enterprise when it comes to the kernel, btrfs tooling, etc So I find it hard to accept your assertion that openSUSE has a general problem in this area when you accept that SLE does not. Tumbleweed has a matching configuration, though obviously the latest upstream versions So there is scope for a problem, but you need to be specific as to which Tumbleweed snapshots so we can identify kernel versions and the like that might be involved Less emotion please, more facts - else I'm just going to ignore your comments and focus on those which can help a resolution to this bug. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c71 --- Comment #71 from Sergio Lindo Mansilla <slindomansilla@suse.com> --- Oliver mentioned that we use btrfs at SUSE but he forgot to mention that we use openSUSE Leap (42.3, 15) and tumbleweed with btrfs, not only SLE, and we also haven't experienced those problems that you describe. In our daily work we also depend on machines with openSUSE, so we do care that openSUSE also works. This is the reason why this ticket exists, to collect information from people who have the problem and be able to fix it. But since it work for us in our daily work, we didn't have any reason to not make it the default and considering it stable enough. (anyway it is not in our hands to do that) We would need a way to reproduce your issues, so we can handle them. A lot of you claim that you have the same problem on every machine, but we were not able to reproduce your problems in any machine. And since the information you provided until now doesn't help, I (that's a personal opinion) still think the problem is not btrfs. Please, don't take it personal, we just need more that comments and bad experiences to determine if btrfs is really not stable as you claim. I hope we could find the problem and solve it. - Are you using at least Leap 42.3 or newer? - Are you using the suggested partitioning (and suggested sub modules)? - Are you making the machine creates an excessive amount of snapshots (like massively installing/uninstalling software) without properly cleaning them up? - Are you cancelling the balancing process a few times before you have this "total corruption problem" (causing the problem yourself)? - Did you have some of the problem described in https://en.opensuse.org/SDB:BTRFS#How_to_repair_a_broken.2Funmountable_btrfs... that you didn't handle properly as described there before that "total corruption problem" (causing the problem yourself)? (remember that "btrfs check" and "btrfs check --repair" are not your friends, you should use them as last resort) - Could you try to reproduce it again on a fresh installation (at least Leap 42.3) providing installation logs (https://en.opensuse.org/openSUSE:Report_a_YaST_bug) and each step done on the installed system until that "total corruption problem"? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c72 --- Comment #72 from Aaron Williams <aaron.w2@gmail.com> --- I run into this problem frequently on my laptop running Tumbleweed. Now I have not done anything to change qgroups or quotas, though I did change how often the rebalancing occurs to monthly instead of weekly. Now I don't boot my laptop all the time and it might go weeks without use. It has a 100GB btrfs root filesystem. When the rebalancing occurs, it's guaranteed to go out to lunch for quite some time. A couple of months ago I had to shut down my laptop by holding the power button down. Afterwards, it could not mount the root filesystem until I ran the fsck tool which spewed a lot of errors and took around 20 hours. Just the other day I booted up my laptop and it started its rebalancing procedure, again rendering it unusable. In this case, I did not have the charger handy and I again was forced to shut it down with init 0. It eventually did so. After I found the charger it would only boot into single user mode until the rebalancing completed after quite a bit of time. I installed Tumbleweed a year or two back and have continually updated it since then. It also has a 1TB SSD drive. I can't say I've had any experience with BTRFS with leap because a few years ago I tried BTRFS and it left a very bad taste in my mouth. The performance was abysmally slow and I switched to XFS. I generally always choose XFS instead of BTRFS due to its stability, performance and tools. I must say I am looking forward to when the next generation XFS comes out. If you want a system that acts up, my laptop does so frequently. Note that my laptop has also hung in the past numerous times requiring holding down the power button in the past, but now it seems stable as long as BTRFS behaves. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c73 Jeff Mahoney <jeffm@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|dsterba@suse.com |jeffm@suse.com --- Comment #73 from Jeff Mahoney <jeffm@suse.com> --- Guys, we don't need to keep debating whether this is an issue. It is. It just doesn't affect everyone. It could be that users unaffected don't have as many snapshots, don't run balance as frequently, or don't have the 'btrfsmaintenance' package installed so balance isn't run as a regular maintenance task. These days, running balance frequently isn't as needed since we clean up unused block groups automatically in the background (in the kernel), and have for years. Where it does help is if the workload on the file system swings from one extreme to the other (ie: data heavy vs metadata heavy) and we need to relocate a chunk to allocate it for other purposes. We do see reports infrequently enough that I still consider qgroups stable except for the severe performance issues that occur during balancing. I understand there are folks commenting that disagree, but as the person responsible for maintaining btrfs in SLES and openSUSE, I suspect I may have more of a big-picture view. I see reports of users encountering file systems that must be fscked if balance is interrupted. Without specific bug reports with metadata images, those are issues that won't get fixed. One thing I can do is to deploy the workaround that we have in place for relocation recovery due to it essentially hanging the file system since 4.8. That suspends quotas while relocation is recovering and re-enables them afterward. The overhead during relocation is reduced significantly, so it really just extends the heavy i/o period of balance a bit longer, while the overall runtime is substantially reduced. If quotas are used with limits, suspending them may not be wanted, so I'll probably need to come up with a way to opt in or out of that behavior. Lastly, the roadmap is this: balance itself already has a back reference resolver that is good and caches very well. I already have prototype code to leverage this mostly into qgroups, but my other responsibilities have limited the time I've had to spend on it. So, enough with the debating. Yes, I get that people are severely affected by this issue when it pops up. Belittling their concerns doesn't stop the issue from happening. Likewise, calling for everyone's heads because you're experiencing this issue doesn't get it fixed any more quickly. Workarounds are to disable quotas or uninstall the btrfsmaintenance package. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c74 --- Comment #74 from Thomas König <tkoenig@netcologne.de> --- (In reply to Jeff Mahoney from comment #73)
So, enough with the debating. Yes, I get that people are severely affected by this issue when it pops up. Belittling their concerns doesn't stop the issue from happening. Likewise, calling for everyone's heads because you're experiencing this issue doesn't get it fixed any more quickly. Workarounds are to disable quotas or uninstall the btrfsmaintenance package.
Disabling quotas did not work for me when I experienced the bug. Like I said above, I cannot provide any more data because I chose to re-install my system without btrfs (which I need for gcc development). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c75 --- Comment #75 from Jeff Mahoney <jeffm@suse.com> --- If disabling quotas didn't work for you, you're experiencing an entirely different issue. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c76 --- Comment #76 from Ludwig Nussel <lnussel@suse.com> --- Short of fixing the root cause for everyone, maybe we can improve the experience for the worse cases. Maybe we can make the balancing more visible and more explicit, specifically on desktops? What's really bad about the current implementation is that those btrfs maintenance tasks hit you unexpectedly. So if you want to get work done and the systems suddenly starts to be unresponsive of course you get grumpy. The background timer does not know if the time is a convenient one for the user. How about for example running those btrfs tasks directly after installing updates, as part of what the desktop applet shows? That's something desktop users are expected to do regularly anyways and installing updates also degrades system performance. In addition there could be eg some passive notification that tells the user that some cleanup tasks need to be done, in case the system detects the need for that. Firefox does something like that for example. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c77 --- Comment #77 from Harald Achitz <harald.achitz@gmail.com> --- this #76 is a very good, and since some time the first, productive suggestion, thanks Ludwik! and maybe provide a URL to the wiki in the desktop messages where the tuning options, and their consequences, are explained ;-) since it is one thing to copy past disabling quotas, what I did on a notebook last year, and know what the background is, what the consequences are and so on. And I am talking as in the view of a normally (un)skilled user, not a SUSE developer ;-) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c78 --- Comment #78 from Frederic Crozat <fcrozat@suse.com> --- (In reply to Ludwig Nussel from comment #76)
The background timer does not know if the time is a convenient one for the user. How about for example running those btrfs tasks directly after installing updates, as part of what the desktop applet shows? That's something desktop users are expected to do regularly anyways and installing updates also degrades system performance.
I'm not so sure it would work: right now, the quota bug (which is the most visible bug) is happening in btrfs transaction. This means it causes ANY IO write to the same btrfs partition to be blocked. If you happen to have /home in that, it will cause freeze of most applications trying to write on the system (in my case, gnome-shell, or evolution or irc client). Having a notification that balance is in effect will not be helpful at all for that case, because either you will not see it or even if you see it, the system will still be in a "frozen" state, with no possible action on it
In addition there could be eg some passive notification that tells the user that some cleanup tasks need to be done, in case the system detects the need for that. Firefox does something like that for example.
If we get to a point where the quota bug doesn't block any more btrfs transaction, balancing shouldn't be a problem anymore (I hope). But it could still be sensible to have a passive notification "system optimization in progress" with some way to pause it, if needed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c79 --- Comment #79 from Ludwig Nussel <lnussel@suse.com> --- (In reply to Frederic Crozat from comment #78)
(In reply to Ludwig Nussel from comment #76)
The background timer does not know if the time is a convenient one for the user. How about for example running those btrfs tasks directly after installing updates, as part of what the desktop applet shows? That's something desktop users are expected to do regularly anyways and installing updates also degrades system performance.
I'm not so sure it would work:
right now, the quota bug (which is the most visible bug) is happening in btrfs transaction. This means it causes ANY IO write to the same btrfs partition to be blocked. If you happen to have /home in that, it will cause freeze of most applications trying to write on the system (in my case, gnome-shell, or evolution or irc client). Having a notification that balance is in effect will not be helpful at all for that case, because either you will not see it or even if you see it, the system will still be in a "frozen" state, with no possible action on it
That's why balancing should ideally only be done on ACK by the user. Just showing a notification when the timer triggers would be an improvement over the current solution but not fully satisfactory as the user still can't do anything about it. To avoid more and more things bothering the user we could attach the balance job to installing updates as the user resp admin already has to trigger that manually. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c80 Ulrich Derenthal <uli.2001@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |uli.2001@gmx.de --- Comment #80 from Ulrich Derenthal <uli.2001@gmx.de> --- Apparently my laptop has a similar problem (with Opensuse Tumbleweed). It seems to occur weekly (on Mondays) and sometimes lasts for more than an hour. After reading many of the comments here, it remains unclear to me whether there are several different issues (and if so, how to find out which might affect me), whether there is a workaround, and how to proceed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c81 --- Comment #81 from Eric Schirra <ecsos@schirra.net> --- I have quota enable in Leap 15.0 again. And the leaks are shorter, but not gone. Early transacti and co. hangs PC several hours. Now it hangs "only" serveral Minutes. The bug is not fixed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Chris . <chris@kuta.bid> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |chris@kuta.bid -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c83 Steven Susbauer <ssusbauer@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ssusbauer@gmail.com --- Comment #83 from Steven Susbauer <ssusbauer@gmail.com> --- I run into this issue on my laptop, especially when resuming from standby. Thankfully it does not take hours, but there are a few minutes when the machine is mostly unusable. I have seen the suggestion to disable quotas, but other than "reduced functionality" nobody has said what that actually does to snapper - if the normal time-based cleanups are still enabled in the snapper config is that enough? I've found the basic snapper space-aware cleanup info at http://snapper.io/2016/05/18/space-aware-cleanup.html but it doesn't really answer the question. This is an SSD machine running Leap 15.0 with plenty of ram, and I don't think I do anything wild to create high numbers of snapshots. Also, I'm not even sure why these operations were running right now, unless it's related to the standby? It was only overnight. It seems btrfs-balance and btrfs-scrub are running at the same time. ● btrfs-balance.timer - Balance block groups on a btrfs filesystem Loaded: loaded (/usr/lib/systemd/system/btrfs-balance.timer; enabled; vendor> Drop-In: /etc/systemd/system/btrfs-balance.timer.d └─schedule.conf Active: active (waiting) since Thu 2018-06-28 18:35:34 PDT; 2 days ago Trigger: Mon 2018-07-02 00:00:00 PDT; 8h left ● btrfs-scrub.timer - Scrub btrfs filesystem, verify block checksums Loaded: loaded (/usr/lib/systemd/system/btrfs-scrub.timer; enabled; vendor p> Drop-In: /etc/systemd/system/btrfs-scrub.timer.d └─schedule.conf Active: active (waiting) since Thu 2018-06-28 18:35:34 PDT; 2 days ago Trigger: Wed 2018-08-01 00:00:00 PDT; 4 weeks 2 days left ● btrfs-balance.service - Balance block groups on a btrfs filesystem Loaded: loaded (/usr/lib/systemd/system/btrfs-balance.service; static; vendo> Active: inactive (dead) since Sun 2018-07-01 14:30:56 PDT; 45min ago ● btrfs-scrub.service - Scrub btrfs filesystem, verify block checksums Loaded: loaded (/usr/lib/systemd/system/btrfs-scrub.service; static; vendor
Active: inactive (dead) since Sun 2018-07-01 14:31:13 PDT; 45min ago Jul 01 14:29:16 thinkpad systemd[1]: Started Scrub btrfs filesystem, verify block checksums. Jul 01 14:29:16 thinkpad btrfs-scrub.sh[4644]: Running scrub on / Jul 01 14:31:13 thinkpad btrfs-scrub.sh[4644]: scrub device /dev/mapper/linux-root (id 1) done Jul 01 14:31:13 thinkpad btrfs-scrub.sh[4644]: scrub started at Sun Jul 1 14:29:16 2018 and finished after 00:01:57 Jul 01 14:31:13 thinkpad btrfs-scrub.sh[4644]: total bytes scrubbed: 8.58GiB with 0 errors -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c85 --- Comment #85 from Daniel Pecka <nettezzaumanaa@gmail.com> --- here is some discussion about btrfs-balance: https://github.com/firehol/netdata/issues/3203 btrfs-balance is very intensive operation and it's completely situational .. doing that at weekly basis as generic default for everybody who has btrfs is utterly and painfully wrong !!! moreover, in this scope I can realize that it also significantly increases harddrive utilization decreasing the hardrive lifetime, making softerrors just killers and triggering not healthy drives to fail earlier and suddenly in the other words, this hammer-style generic default is wrong and insane .. it's human-factor bug .. I wonder how it could sneak into the distro .. let's please find another solution to this .. regards, daniel -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c92 Keks Dose <cookie170@web.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |cookie170@web.de --- Comment #92 from Keks Dose <cookie170@web.de> --- Leap 15.0, fresh install on Thinkpad 450s. Often brfs cleaner makes system unresponsive for several minutes. btrfs cleaner starts after each boot and sometimes after hibernate. I've got a fast SSD and even with this its annoying. What happens to users who have an HDD? I use an usb-harddisk for backups with btrfs. If you have an idea how to solve the issue, take into consideration the impact on external drives. Under 42.3 I ruined a snapshot on the backup probably by killing btrfs cleaner for the hdd, because it took longer than my train would wait for me. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c94 Joachim Banzhaf <joachim.banzhaf@googlemail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |joachim.banzhaf@googlemail. | |com --- Comment #94 from Joachim Banzhaf <joachim.banzhaf@googlemail.com> --- seems I was bitten by this, too for a long time. Leap 43.3 with updates on a 8gb ram laptop with btrfs root fs on ssd. Gui freezes for a long time (minutes). Switch to console worked once. Could see btrfs-tra... at 100% This time I could not stop the system before the battery was drained. After reboot, system went to emergency mode. Cpu at 100%, changing between btrfs snd mount of root fs On previous occasions the stops were not so long and I always thought that is some hardware related defect. Now that I had the time, I read many forum posts and bugzilla comments, I know it is not hardware but fs and there is no real solution, but to wait until the process finishes. For the decision makers: no, a fs that does this is not stable, even if there is no data loss. A fs that needs substantial resources just to stay healthy is crap. I maintain linux systems for a long time (started with suse 6.x). I very rarely have the need of going back to a previous system state. So despite it sounds cool to have a feature that does this easily, it is not worth the trouble it currently causes by far. And the reason why it does not happen on sles? My personal experience says: because we run opensuse on our notebooks and so use other fs where it matters: ext and xfs. Hm, in the meantime it looks like the system has recovered: can I provide some useful info? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c95 --- Comment #95 from Harald Achitz <harald.achitz@gmail.com> --- this bug has a history back to 2016 and has been declared resolved I requested if there have been meaningful tests been done, you can look up the answer. https://bugzilla.opensuse.org/show_bug.cgi?id=1017461#c110 It seems either the responsible person do not understand the problem, or they want to hide something. I guess both is the case. But obviously they either don't care, or do not know how to reproduce this behavior do do meaningful development for this problem. I guess both is the case. With the history of this bug and the stubborn ignorance in handling this problem, while providing no meaningful documentation/wiki for this problem at all I have to say: To bad to see that SuSE is not something I can recommend anymore to anyone -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c96 --- Comment #96 from Oliver Kurz <okurz@suse.com> --- (In reply to Harald Achitz from comment #95)
this bug has a history back to 2016 and has been declared resolved I requested if there have been meaningful tests been done, you can look up the answer. https://bugzilla.opensuse.org/show_bug.cgi?id=1017461#c110
It seems either the responsible person do not understand the problem, or they want to hide something. I guess both is the case. But obviously they either don't care, or do not know how to reproduce this behavior do do meaningful development for this problem. I guess both is the case.
I struggle to understand how you can come to this conclusion. You should keep in mind that I am also the reporter of the current bug and I did not give up on this story. Certainly you might want to state that I do not understand the problem but at least enough of it to keep the discussion running. Accussing me of wanting to hide something is not very nice :( I agree with you that this bug does not receive enough attention as *I* would like it to see but also I trust the developers that can actually fix it are aware of the issue as well as all others in their backlog and they decide based on priorities and feasibility on what to work on first and what next. I see your above statement as expressing your frustration but this will unfortunately most likely not have any helpful impact. It has been expressed that the issue is known and is – despite what the user reports indicate – not easy to reproduce in a consistent manner that can be used to help fix the issue (or issues) more easily. I am of course also running openSUSE with the default btrfs + qgroups + snapshots enabled on my notebook which I use for daily work and (unfortunately) I do not see the issue at all in my environment! If you want to help then try to provide either better tests or proposals for fixes. I have not seen any contributions of that kind rejected. @jeffm, dsterba: I guess you could help here. I would really appreciate if you can give your current view on this and let the bug status reflect this, e.g. "CONFIRMED" or "IN_PROGRESS" -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c97 --- Comment #97 from Thomas Rother <t.rother@netzwissen.de> --- (In reply to Harald Achitz from comment #95)
this bug has a history back to 2016 and has been declared resolved I requested if there have been meaningful tests been done, you can look up the answer. https://bugzilla.opensuse.org/show_bug.cgi?id=1017461#c110
It seems either the responsible person do not understand the problem, or they want to hide something. I guess both is the case. But obviously they either don't care, or do not know how to reproduce this behavior do do meaningful development for this problem. I guess both is the case.
With the history of this bug and the stubborn ignorance in handling this problem, while providing no meaningful documentation/wiki for this problem at all I have to say: To bad to see that SuSE is not something I can recommend anymore to anyone
I would strongly support Olivers statement, but I also understand your own frustation. I also had this issue on two laptops back in the OpenSUSE 42.3 times. In the office we also have SLES machines with btrfs but without ANY similar issues. I followed some of the published workarounds and since Leap 15 I haven't seen the issue for a long time on both laptop machines. But I don't really know why it disappeared finally. Some issues even in the open source field take a long time to solve and the best way to help is to describe the setups and situations where it appears in detail to make it reproducible for others. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c98 --- Comment #98 from Daniel Pecka <nettezzaumanaa@gmail.com> --- (In reply to Oliver Kurz from comment #96)
If you want to help then try to provide either better tests or proposals for fixes. I have not seen any contributions of that kind rejected.
This is ridiculous. So you would like to play the game that the problem doesn't exist unless it hits your computer ? omg I've already (and other ppl as well) provided proposals and insight, only what I can do is to repeat myself (and MANY other ppl discussing that on internet, just google that): 1) requirement for btrfs balance is completely situational and it is normally NOT needed to run that regularly !!! having this hammer-style operation as default is insane and uncompetent. 2) it severely decreases the lifetime of hardrives and it makes possible unhealthy drives to fail earlier and suddenly because it rewrites huge amounts of data unnecessarily and it's very very intensive operation. It's painfully wrong to have it as generic default done on scheduled basis for everybody and - I dare to say - it just confirms the lack of experience and understanding to the problem, exactly as Harald Achitz said. endnote proposal: kick it away, this shall NOT be default. regards, dan ps: ``requirement for btrfs balance is completely situational and it is normally NOT needed to run that regularly !!! having this hammer-style operation as default is insane and uncompetent.'' - I considered this so important, that I had to repeat that again (I know certain ppl like/need things being repeated) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c99 --- Comment #99 from Oliver Kurz <okurz@suse.com> --- (In reply to Daniel Pecka from comment #98)
[…] I dare to say - it just confirms the lack of experience and understanding to the problem, exactly as Harald Achitz said.
I did not doubt that. I am *just* a stupid QA engineer, no kernel filesystem hacking expert :) -> see who is reporter and who is assignee -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c100 --- Comment #100 from Daniel Pecka <nettezzaumanaa@gmail.com> --- (In reply to Oliver Kurz from comment #99)
(In reply to Daniel Pecka from comment #98)
[…] I dare to say - it just confirms the lack of experience and understanding to the problem, exactly as Harald Achitz said.
I did not doubt that. I am *just* a stupid QA engineer, no kernel filesystem hacking expert :) -> see who is reporter and who is assignee
if you wish to take it personally THIS way, please make the outcome other than just words .. I have to repeat myself: 1) requirement for btrfs balance is completely situational and it is normally NOT needed to run that regularly (moreover at generic basis as default for everybody) 2) it severely decreases the lifetime of hardrives and it makes possible unhealthy drives to fail earlier and suddenly because it rewrites huge amounts of data unnecessarily and it's very very intensive operation. ^^ beyond it's easy to reproduce that, it's very very easy to understand at least to point #2 and adopt it dan -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c101 --- Comment #101 from Ronan Chagas <ronisbr@gmail.com> --- Guys, let’s calm down. First, I am facing this problem since 2016 and I completely support the idea that BTRFS as used by openSUSE is not ready for production. I saw this bug in every setup with both Leap and Tumbleweed. However, the workaround cannot be simpler. Just disable quotas. This fixed the problem in 100% of my cases. You will have a stable system while the devs fix the problems. The only downside is that you will loose the snapshot auto cleanup feature, which I, personally, do not care. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c102 Jeff Mahoney <jeffm@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CONFIRMED --- Comment #102 from Jeff Mahoney <jeffm@suse.com> --- Please see comment #58 before making blind comments about the issue being ignored or whether we're claiming it's already fixed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c103 --- Comment #103 from Thomas Rother <t.rother@netzwissen.de> --- (In reply to Jeff Mahoney from comment #102)
Please see comment #58 before making blind comments about the issue being ignored or whether we're claiming it's already fixed.
Given this information, the bug should be finally closed and someone should add the information from #58 about fixing this for upgraders ("If you have a system that has been gradually updated, ensure that the old cron.* jobs aren't still installed. fstrim/balance is a bad combination to run simultaneously.") into the upgrade documentation at https://doc.opensuse.org/release-notes/ -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c104 --- Comment #104 from Jeff Mahoney <jeffm@suse.com> --- It shouldn't be closed. The bug is still there. There's still a terrible algorithm at its core that needs to be fixed. This isn't an issue of ignoring the community, it's an issue of finite resources. Since there is a clear workaround, other things move ahead of it in the queue. If someone wants to take a shot at fixing it, I'd be happy to provide some guidance. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c105 --- Comment #105 from Daniel Pecka <nettezzaumanaa@gmail.com> --- and I have to repeat myself ... requirement for doing balance is situational and NOT needed to be run on time scheduled regular basis !! With quotas and without quotas it still moves unnecessarily huge amount on data and decreases the lifetime of drives and it also triggers unhealthy drives or drives in predictive failure state to die before they normally could !!! It's broken concept that everybody who uses btrfs just does it periodically by default .. This needs to be removed and not just fixed with tape and crutch by disabling quotas .. regards, dan -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c106 --- Comment #106 from Jeff Mahoney <jeffm@suse.com> --- I agree that it doesn't need to be done automatically anymore. It wasn't always situational. It used to be that btrfs wouldn't clean up empty block groups, so once you used all of your storage, even if you cleaned it up, if your workload changed you'd be out of luck. Balance also needs to be smarter about when it relocates. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c107 --- Comment #107 from Daniel Pecka <nettezzaumanaa@gmail.com> --- (In reply to Jeff Mahoney from comment #106)
I agree that it doesn't need to be done automatically anymore. It wasn't always situational. It used to be that btrfs wouldn't clean up empty block groups, so once you used all of your storage, even if you cleaned it up, if your workload changed you'd be out of luck. Balance also needs to be smarter about when it relocates.
exactly ... it was needed in past, so let's step ahead from past to present :) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c108 --- Comment #108 from Richard Brown <rbrown@suse.com> --- (In reply to Daniel Pecka from comment #107)
(In reply to Jeff Mahoney from comment #106)
I agree that it doesn't need to be done automatically anymore. It wasn't always situational. It used to be that btrfs wouldn't clean up empty block groups, so once you used all of your storage, even if you cleaned it up, if your workload changed you'd be out of luck. Balance also needs to be smarter about when it relocates.
exactly ... it was needed in past, so let's step ahead from past to present :)
In the light of Jeff's suggestion, I have made the following submission to patterns-base https://build.opensuse.org/request/show/625441 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c109 --- Comment #109 from Thomas Rother <t.rother@netzwissen.de> --- OK, I understand that this bug should not be closed. But there should be some clear information for "normal users" (those that don't know all the details of the btrfs kernel modul code and all the cronjobs running in the background of a normal OpenSUSE install): a) what is the status of this bug (Answer: in progress). b) What are the circumstances where it appears mainly (Answer: updated systems which have old and new cronjobs running, as I understand?) c) what is the workaround ("Just disable quotas") until a really final solution is found, satisfying both the SLES and OpenSUSE users/communities -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c110 --- Comment #110 from Harald Achitz <harald.achitz@gmail.com> --- what I mostly miss is a clear documentation about the issue. How to trouble shoot, what does it mean, ... and/or the recommendation to not install btrfs if some functionality is not needed. I mean, if I select ext4 in the installer, there is not problem, right? and what do I miss if I do this? Maybe nothing that I ever would use anyway as an average user. So putting this as an default option for openSuSE user, making them to beta testers, risking that their SSDs live time shorten, and that they might have freezes in moments where there should be none, this is a bad option. 'I told you to use CentOS' is nothing I want to hear from colleagues again because you do not provide clear documentation and information! Also, in case of problems there is no advice which info you would like to have/need. As long as the default install option on a notebook can lead tho this freezes, closing this bug would just confirm the neglect ion of this problem -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c111 --- Comment #111 from Steven Susbauer <ssusbauer@gmail.com> --- (In reply to Thomas Rother from comment #109)
b) What are the circumstances where it appears mainly (Answer: updated systems which have old and new cronjobs running, as I understand?)
The old and new may also be an issue, but this behavior also happens on fresh installs of Leap and Tumbleweed. It seems like it potentially happens anywhere with btrfs and the default settings. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c112 --- Comment #112 from Andre Guenther <guenther@mpanrw.de> --- (In reply to Thomas Rother from comment #109)
b) What are the circumstances where it appears mainly (Answer: updated systems which have old and new cronjobs running, as I understand?)
My 42.3 System was a fresh install and has the problem (ssd). I also tried to replace the btrfs-balance and btrfs-trim scripts with the suggested combined btrfs-balance-trim script but that didn't change anything. (Mostly there is about 30 minutes of stagnation on fridays but sometimes it's several hours. BTW: I think the best workaround is to remove the scripts altogether and make the balance every 1 or 2 month by hand instead of disabling quotas. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c113 --- Comment #113 from Keks Dose <cookie170@web.de> --- (In reply to Jeff Mahoney from comment #104)
It shouldn't be closed. The bug is still there. There's still a terrible algorithm at its core that needs to be fixed. This isn't an issue of ignoring the community, it's an issue of finite resources. Since there is a clear workaround, other things move ahead of it in the queue. If someone wants to take a shot at fixing it, I'd be happy to provide some guidance.
»Since there is a clear workaround...« I'm only a user and messing with a filesystem really really is way above my skills. Sorry. I know, if everything goes wrong, I only have to reinstall the OS. But I installed Leap 15.0 to work with it, because it is supposed to be stable! The urgent need to reinstall because / run out of space may occur right when I don't have time to deal with it. (During the last days I haven't encountered a btrfs cleaner attack -- has there been an update?) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c114 --- Comment #114 from Richard Brown <rbrown@suse.com> --- (In reply to Andre Guenther from comment #112)
BTW: I think the best workaround is to remove the scripts altogether and make the balance every 1 or 2 month by hand instead of disabling quotas.
Which is exactly what the submission I made and shared in comment#108 will achieve Can we please all collectively try to reduce the noise on this bug please? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c115 --- Comment #115 from Richard Brown <rbrown@suse.com> --- (In reply to Keks Dose from comment #113)
(During the last days I haven't encountered a btrfs cleaner attack -- has there been an update?)
Yes, as stated repeatedly in this bug there has been many updates in this area and while this bug will remain open until the specific, well documented, well reported, issue is resolved, please refrain from adding anything to this bug that doesn't include bug reports or helpful information relevant to this bug. Questions like the above can be asked on the openSUSE Forums for example. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c116 --- Comment #116 from Keks Dose <cookie170@web.de> --- (In reply to Richard Brown from comment #114)
(In reply to Andre Guenther from comment #112)
Can we please all collectively try to reduce the noise on this bug please?
Reviewing your messages in this thread and elsewhere, you seem have to missed some opportunities to follow your own advice. :-) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c117 --- Comment #117 from Joachim Banzhaf <joachim.banzhaf@googlemail.com> --- Ok, trying to ignor all political and personal stuff and instead add some info: (repeat: notebook with ssd, 8GB ram, leap 42.3, btrfs root fs, originally it was leap 42.1, iirc I did not change settings on os level -> suse defaults) I did not disable quotas yet (and I dont know if they are enabled, I only found how to disable or enable them, but not how to check their status yet). I did delete all snapshots with yast (only a few, I did that before) Then I removed snapper stuff: rpm -e grub2-snapper-plugin-2.02-10.2.noarch snapper-zypp-plugin-0.5.0-1.1.noarch yast2-snapper-3.2.0-3.5.x86_64 snapper-0.5.0-1.1.x86_64 It did not help. Freezes after starting my notebook today. I do not find systemd timers related to btrfs, just two cron jobs (probably because my notebook still runs on leap 42.3). I now disabled the cron jobs. The cron jobs probably missed some configuration stuff? cat: /etc/default/btrfsmaintenance: No such file or directory cat: /etc/system/btrfsmaintenance: No such file or directory Or is that normal and script defaults are ok? Where is the output of the script then? I did not find something in the usual suspects: no /var/log/messages, journalctl | grep btrfs-balance produces an error (Failed to get journal fields: Bad message) and root mailbox is empty since december. I don't know how the timers will work, but I did not like the cron mechanism that tends to trigger this stuff at startup. It should run only when the system was idle for some time and only when not on batteries. Only if this strategy fails for some time, it might be ok to do this at startup and on batteries. Btw. I also dont like the "windows" way of doing stuff at shutdown. Usually, if I shut my notebook down I want it to be off fast. If it takes long, I have to close it, It goes to sleep, and when I need it again, it greats me with shutdown tasks still going on and a drained battery due to beeing asleep instead of switched off. I did not disable quotas yet, because I read the system itself uses them somehow? But this will be the next step in a few days regardless. Finally: Why scrubbing? The idea of scrubbing is to detect failures before you have multiple of them, because then you cannot repair your raid anymore. Right? On a notebook I have one SSD with no raid. Also, I do backups on all my important data, so I have a kind of scrubbing where it matters. So, without raid it seems like only having heavy drawbacks and no gain? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c118 --- Comment #118 from Daniel Pecka <nettezzaumanaa@gmail.com> --- (In reply to Joachim Banzhaf from comment #117) hello, you'd rather uninstall btrfsmaintenance package .. snapper with his snapshots is innocent in this .. if you wish to not uninstall btrfsmaintenance entirely, just disable related timers (check `systemctl list timers' or just `rpm -q btrfsmaintenance -l') + probaly also systemctl mask them .. personally, I just did uninstall the btrfsmaintenance package (and git cloned that locally afterwards without timers, cron, etc ...) regards, dan -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 nicholas cunliffe <ndcunliffe@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|ndcunliffe@gmail.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Riku Ahonen <rikuah1n@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rikuah1n@gmail.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Karsten de Freese <karsten.defreese@posteo.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |karsten.defreese@posteo.de -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Andrei Borzenkov <arvidjaar@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |arvidjaar@gmail.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Yunhe Guo <i@guoyunhe.me> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |i@guoyunhe.me -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Michiel Janssens <michiel@nexigon.net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |michiel@nexigon.net -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c132 Tasik B <bulve.rec@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bulve.rec@gmail.com --- Comment #132 from Tasik B <bulve.rec@gmail.com> --- Don't want to create a new report or new post in the forum, people comments above says a lot. I will describe my situation: openSUSE Leap 15, KDE Plasma, Xeon E5-2665, decent SSD for the root file system. File system by default as suggested on install so its BTRFS root file system all other disks as ext4. Usually, overnight keeping PC on sleep mode, but when waking up and "btrfs balance start -v 50" kicks in seems that PC is frozen and after single Ctrl+Alt+Backspace happens hard restart. Or if the process starts when logged in then its impossible to do anything, everything freezes and unusable. I think there is no such an issue on no other OS, at least. Maybe this process needs to have lover priority or need a slightly different design? Just don't know for how long more I will forgive to opensuse... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Dan Čermák <dcermak@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dcermak@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Goldwyn Rodrigues <rgoldwyn@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|rgoldwyn@suse.com | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Vadim Krevs <vkrevs@yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |vkrevs@yahoo.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c200 Simcha Lerner <syl-novell-mji@sufrin.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |syl-novell-mji@sufrin.org --- Comment #200 from Simcha Lerner <syl-novell-mji@sufrin.org> --- I see a lot of automated posts lately, but they're not particularly informative to the uninitiated. I'd appreciate it if someone could summarize the current status of this bug (these bugs?) in terms of what fixes have been put into place (I'm running the latest Tumbleweed on some systems, Leap 15.1 on others) and what the current roadmap is for further work. Thank you very much. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c201 Wenruo Qu <wqu@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #201 from Wenruo Qu <wqu@suse.com> --- (In reply to Simcha Lerner from comment #200)
I see a lot of automated posts lately, but they're not particularly informative to the uninitiated.
I'd appreciate it if someone could summarize the current status of this bug (these bugs?) in terms of what fixes have been put into place (I'm running the latest Tumbleweed on some systems, Leap 15.1 on others) and what the current roadmap is for further work.
Thank you very much.
In short, the problem should be solved in upstream after v5.1 and SLE12-SP3/SLE15. The fix is skipping tree blocks if they are not modified after balance. It's still possible if there are a lot of writes along with balance, but even for that case, the load should be much smaller than the old behavior. I forgot to close this bug, sorry for that. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c202 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED |--- --- Comment #202 from Oliver Kurz <okurz@suse.com> --- (In reply to Wenruo Qu from comment #201)
[…] I forgot to close this bug, sorry for that.
Hi Wenruo, thanks for your answer. Unfortunately it seems you overlooked what was described as problem in the initial description: https://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c0 states that the priority as configured in /etc/sysconfig/btrfsmaintenance is not regarded in the triggered maintenance jobs. I could easily check this on my up-to-date openSUSE Leap 15.1 system by calling sudo sh -x /usr/share/btrfsmaintenance/btrfs-balance.sh and observing that the maintenance jobs are started without any nice level and the system responsiveness is impacted. Seeing that the bash script has not been changed to regard e.g. "idle" priority this is understandable. It feels that the changes you have applied actually help to prevent a critical performance degradation so an improvement *is* noticeable but the original problem is still present, at least partially hence reopening. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c203 --- Comment #203 from Wenruo Qu <wqu@suse.com> --- (In reply to Oliver Kurz from comment #202)
(In reply to Wenruo Qu from comment #201)
[…] I forgot to close this bug, sorry for that.
Hi Wenruo, thanks for your answer. Unfortunately it seems you overlooked what was described as problem in the initial description:
https://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c0 states that the priority as configured in /etc/sysconfig/btrfsmaintenance is not regarded in the triggered maintenance jobs. I could easily check this on my up-to-date openSUSE Leap 15.1 system by calling sudo sh -x /usr/share/btrfsmaintenance/btrfs-balance.sh and observing that the maintenance jobs are started without any nice level and the system responsiveness is impacted. Seeing that the bash script has not been changed to regard e.g. "idle" priority this is understandable. It feels that the changes you have applied actually help to prevent a critical performance degradation so an improvement *is* noticeable but the original problem is still present, at least partially hence reopening.
Thanks for the extra explain. It indeed looks like a problem, but I'm not yet 100% sure. Would you please do me a favor by disabling quota and retest? If the problem still exists, then it's 100% sure the problem is not quota related. If the problem is just gone, then it's the old quota problem and I must dig further. BTW, if it's pure balance/scrub/trim related, it would be must better to change the title to remove the quota part. Thanks, Qu -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c204 --- Comment #204 from Jeff Mahoney <jeffm@suse.com> --- At this point, it should just be scrub. Everything else has ioprio set in the systemd unit files. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c205 --- Comment #205 from Oliver Kurz <okurz@suse.com> --- (In reply to Wenruo Qu from comment #203)
[…] Would you please do me a favor by disabling quota and retest?
sorry but I don't even think this is necessary. This is simply about arguments readout in shell scripts. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c207 Jeff Mahoney <jeffm@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution|--- |FIXED --- Comment #207 from Jeff Mahoney <jeffm@suse.com> --- "idle" isn't interpreted for scrub because it's the default. It's documented in the manpage. Closing as FIXED. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c208 --- Comment #208 from Simcha Lerner <syl-novell-mji@sufrin.org> --- (In reply to Wenruo Qu from comment #201)
In short, the problem should be solved in upstream after v5.1 and SLE12-SP3/SLE15.
Is this included in Leap 15.1 and the current version of Tumbleweed? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c209 --- Comment #209 from Wenruo Qu <wqu@suse.com> --- (In reply to Simcha Lerner from comment #208)
(In reply to Wenruo Qu from comment #201)
In short, the problem should be solved in upstream after v5.1 and SLE12-SP3/SLE15.
Is this included in Leap 15.1 and the current version of Tumbleweed?
Should be included for a while, at least for tumbleweed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c265 Samuel DENIS <sam@wekk.io> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |sam@wekk.io --- Comment #265 from Samuel DENIS <sam@wekk.io> --- Hello, I think I am facing this bug with a recent MicroOS install on a really low power laptop (a Xiaomi Mi Air 12.5" from 2016). I makes this computer totally freeze. My only solution is to force power off and restart it, but it doesn't let me a lot of time to take any action after reboot (even un recovery mode) because the computer freezes quickly. I have something like 5 snapshots (I had more but I was able to remove some once), maybe 6 or 7. I don't think this is a huge amount of snapshots, correct me if I'm wrong. I tried to disable quota by running `sudo btrfs quota disable /` but I have a "readonly filesystem" error. I leads me to two questions: * are quotas required for MicroOS (especially Snapper)? * if no, how can I disable quotas? Thanks for your help! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c266 --- Comment #266 from Wenruo Qu <wqu@suse.com> --- (In reply to Samuel DENIS from comment #265)
Hello,
I think I am facing this bug with a recent MicroOS install on a really low power laptop (a Xiaomi Mi Air 12.5" from 2016). I makes this computer totally freeze. My only solution is to force power off and restart it, but it doesn't let me a lot of time to take any action after reboot (even un recovery mode) because the computer freezes quickly.
Unfortunately, latest kernels already have a proper way to skip the heavy workload at balance time. So it's unlikely that's the cause.
I have something like 5 snapshots (I had more but I was able to remove some once), maybe 6 or 7. I don't think this is a huge amount of snapshots, correct me if I'm wrong.
I tried to disable quota by running `sudo btrfs quota disable /` but I have a "readonly filesystem" error.
This looks like the root cause, by somehow your fs is already corrupted, thus it looks like your freeze is caused by some kernel crash.
I leads me to two questions:
* are quotas required for MicroOS (especially Snapper)? * if no, how can I disable quotas?
Thanks for your help!
Do you have any dmesg when the read-only happens? It's better to open another BZ for your bug though. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 http://bugzilla.opensuse.org/show_bug.cgi?id=1063638#c267 --- Comment #267 from Samuel DENIS <sam@wekk.io> ---
Unfortunately, latest kernels already have a proper way to skip the heavy workload at balance time.
So it's unlikely that's the cause. [���] This looks like the root cause, by somehow your fs is already corrupted, thus it looks like your freeze is caused by some kernel crash.
Thanks for this explanation.
Do you have any dmesg when the read-only happens? It's better to open another BZ for your bug though.
I'll try to look at dmesg, and if I have a particular message I'll look for another bug or create one. I thought it could be this bug as it looks like it and I saw the #264 comment (from 2021-08-11 00:01:08 UTC). If my filesystem is corrupted, any first thought about how to fix it? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1063638 Fusion Future <qydwhotmail@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |qydwhotmail@gmail.com -- You are receiving this mail because: You are on the CC list for the bug.
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com