Hi guys,
Sometime ago, a BTRFS developer (Chris Murphy), which was helping me to fix a bug, warned that quotas must not be enabled by default in BTRFS because it is not stable. He actually sent a message to this mailing list:
https://lists.opensuse.org/opensuse-factory/2016-09/msg00032.html
However, some openSUSE developers contradicted Chris, specially Richard Brown:
https://lists.opensuse.org/opensuse-factory/2016-09/msg00085.html
Hence, nobody took the advice and quotas were enabled by default in Leap 42.2.
I am sending this message because I think it is necessary to rethink this decision. A very annoying bug that I was having in all my openSUSE machines (Leap and Tumbleweed) is actually caused by quotas. Every week, when the maintenance script is started, my systems become unresponsive for almost 30 min. I mentioned it here but received just one answer:
https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html
Today, I saw a bug reported by Jan Ritzerfeld describing exactly the same behavior and mentioning that disabling quota fixes it:
https://bugzilla.opensuse.org/show_bug.cgi?id=1017461
I tried, and I can confirm that the bug is indeed fixed by this workaround in Leap and Tumbleweed. My server, which was recently updated to Leap 42.2, started to show those freezes. Notice that quotas were not enabled in 42.1. Hence, I think we must revisit this subject as soon as possible.
Best regards, Ronan Arraes
I am sending this message because I think it is necessary to rethink this decision. A very annoying bug that I was having in all my openSUSE machines (Leap and Tumbleweed) is actually caused by quotas. Every week, when the maintenance script is started, my systems become unresponsive for almost 30 min. I mentioned it here but received just one answer:
https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html
This also happens to me on TW, and is incredibly frustrating. So can we please also disable quotas on TW as well?
On 01/03/2017 11:11 AM, Aleksa Sarai wrote:
I am sending this message because I think it is necessary to rethink this decision. A very annoying bug that I was having in all my openSUSE machines (Leap and Tumbleweed) is actually caused by quotas. Every week, when the maintenance script is started, my systems become unresponsive for almost 30 min. I mentioned it here but received just one answer:
https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html
This also happens to me on TW, and is incredibly frustrating. So can we please also disable quotas on TW as well?
Or we could give the kernel maintainers a couple of weeks to come back from there holidays and see if they can come up with a patch to fix the issue. This report is 1 week old and given the time of year thats not much time for anyone to deal with the issue. Once someone has time to look at it the bug (or if there is a long period of know one looking at it) then we can talk about changing defaults. "Feature X has a bug, Lets disable Feature X" doesn't seem like the best approach we should try and fix the bug first.
Simon Lees wrote:
On 01/03/2017 11:11 AM, Aleksa Sarai wrote:
I am sending this message because I think it is necessary to rethink this decision. A very annoying bug that I was having in all my openSUSE machines (Leap and Tumbleweed) is actually caused by quotas. Every week, when the maintenance script is started, my systems become unresponsive for almost 30 min. I mentioned it here but received just one answer:
https://lists.opensuse.org/opensuse-factory/2016-09/msg00130.html
This also happens to me on TW, and is incredibly frustrating. So can we please also disable quotas on TW as well?
Or we could give the kernel maintainers ...
kernel maintainers != brtfs developers; brtfs isn't in maintenance mode and likely won't be for some time. The problem seems caused by a user-mode script running. If that can lock up the kernel, wouldn't that be good for a denial-of-service attack?
But why give them any time if it is a bad decision? Why should quotas be on at all on a user system?
im not sure i see any conclusive tests here.
there are several factors 1 - enospc error vs performance hit 2 - maintainance script with or without quotas
if it is claimed disbling quota system is a fix - how is this determined - wait one week, hope the load case is equal and see the results?
a proper test would be a - balance (sudo /etc/cron.weekly/btrfs-balance) b - write say 10,000 blocks with dd c - run balance d - repeat without quotas enabled
reading the wiki, quota performance can be impacted by too many snapshots
Hi Nicholas.
Em ter, 2017-01-03 às 08:58 +0100, nicholas escreveu:
im not sure i see any conclusive tests here.
there are several factors 1 - enospc error vs performance hit 2 - maintainance script with or without quotas
It is not ENOSPC bug. I had a ENOSPC bug, which we thought might be related to quotas that were enabled by default in Tumbleweed. Because of this, Chris sent the message to this mailing list suggesting to disable it in Leap / Tumbleweed. The freezing problem was described by me in September, but since nobody beside Achim Gratz replied, I thought it was something "normal" in BTRFS. One week ago, a user reported the bug because he is experiencing the freeze and verified that disabling quotas is a workaround.
if it is claimed disbling quota system is a fix - how is this determined - wait one week, hope the load case is equal and see the results?
a proper test would be a - balance (sudo /etc/cron.weekly/btrfs-balance) b - write say 10,000 blocks with dd c - run balance d - repeat without quotas enabled
I did something very similar to that and I can confirm that disabling quotas fixed the freeze. However, you mentioned something important. Maybe the number of snapshots + quotas is what cause the freeze. I will try to debug that.
Regards, Ronan Arraes
I did something very similar to that and I can confirm that disabling quotas fixed the freeze. However, you mentioned something important. Maybe the number of snapshots + quotas is what cause the freeze. I will try to debug that.
Regards, Ronan Arraes
yes - my point is that there are many moving parts here, and it is very difficult to run an objective/repeatable test. It is hard to tell from anecdotes with certainty the root or extent of the problem. It is hard to get bugs fixed without a test case.
there might be many workarounds. changing frequency and target level of the balanace. or perhaps it is merely the creation/removal of many snapshots, i dont know.
I have snapshot on updates only and have no problems.
Am Dienstag, 3. Januar 2017, 11:50:19 CET schrieb Simon Lees:
[...] Or we could give the kernel maintainers a couple of weeks to come back from there holidays and see if they can come up with a patch to fix the issue. This report is 1 week old and given the time of year thats not much time for anyone to deal with the issue. [...]
The message well I hear, my faith alone is weak. This is actually happening to bug reports about btrfs: https://lists.opensuse.org/opensuse-factory/2017-01/msg00002.html
Gruß Jan