On 9/4/13 2:22 PM, Jim Henderson wrote:
On Wed, 04 Sep 2013 12:04:58 -0400, Jeff Mahoney wrote:
On Tue, 03 Sep 2013 19:59:06 -0400, Jeff Mahoney wrote: My point was more that, for the most part, those users who've actually been using btrfs weren't the ones chiming in and claiming that it's not
On 9/3/13 8:36 PM, Jim Henderson wrote: trustworthy yet. It's the ones who're nervous about trying it at all and are being overly conservative to the point of derailing the conversation without data to back their apprehension.
It's difficult to have hard data when you're apprehensive about running it because you've heard that there are still problems with it, or that there's a lack of effective/mature tools for fixing problems.
Even bearing in mind that support venues don't tend to get people saying "everything's just fine, no problems here at all".
I don't think it's "overly conservative" though to be cautious about risking your data. I think it's up to those who believe it's stable to demonstrate that it is, and to assure the users that this is a safe filesystem to use.
If it is, then of course, I'd want to use it. But I don't want to take a bigger risk so there can be more data gathered about unrecoverable problems. Does that make sense?
I agree that it's not overly conservative to want to preserve your data. That's the baseline of what you expect from a file system. What I do object to is people saying "no" without having an actual reason for saying so other than being worried about it. Worry is fine, but depending on worries as a data set is problematic.
The areas in which btrfs has historically been weak include:
- error handling - This is an ongoing effort, mostly completed in v3.4, where certain classes of errors are handled by taking the file system "offline" (read-only).
What are the next steps after taking the filesystem offline in this instance?
Like every file system, it depends on the error case. This is the same reaction to errors that xfs, ext3, ext4, reiserfs, etc have when they don't want to risk corrupting your data. If it's a disk failure or media access error, that needs to be corrected. If it's corruption, that also needs to be corrected. If it's an ENOMEM, that gets more difficult and that's an area in which I want to invest more effort into avoiding, but it's low priority right now. For the most part, though, the allocation requests are small. I've never seen an ENOMEM in the middle of btrfs actually happen. In most cases, if the error can be corrected without a reboot, a simple umount <fix> remount cycle will be enough.
- btrfsck - In truth, not as powerful as it should be yet. - Realistically, there's only so much that can be done without encountering novel broken file systems and collecting metadata images for analysis. There's a chicken/egg problem here in that there's no way to create a fsck tool that is prepared to encounter actual failures cases without seeing the failure cases. Sure, we can write up a tool that can predict certain cases, but there will *always* be surprises. e2fsck is still evolving as well. - The difference between btrfs and ext[234] is that we can 'scrub' the file system online to detect errors before they're actually encountered.
Sounds like some room for work here, but I understand what you're saying about predicting the unpredictable, too. Since this is a SUSE effort, have you looked at how you might test this in, say, superlab in Provo? Getting time on the schedule might be difficult (I don't know how busy they are these days), but if you could push out an image to 100-200 machines and run an automated test suite to read/write data and try to stress the filesystems on a relatively large number of machines, that might give you some testing that doesn't involve risking real users' data.
Yeah, we definitely plan on doing wider stress tests in the coming months.
From where I sit, the expectation isn't to eliminate the possibility of errors - but to look for something that's maybe one step better than "good enough".
Agreed. This is an area where we plan to invest more effort.
The scrubbing capability sounds interesting.
- VM image performance - Performance is generally regarded as horrible. - This is because CoW on what is essentially a block device backing store means a ton of write amplification for each write that the VM issues. - The file system supports a 'nodatacow' file attribute: chattr -C <file>. This attribute changes the CoW behavior of file writes such that the write only causes a CoW to be performed only if there is more than one reference held on the data extent. - Caveat: There is currently a strange corner case where nodatacow prevents a reflink copy but allows a snapshot of the subvolume to make a snapshot of the file. (They're essentially the same thing on the back end.) - Solution(s): 1) Remove the distinction between the reflink copy and the snapshot cloning. 2) Always handle CoW on data extents as overwrites when there is only a single reference on the extent. - 1) is probably a bug fix, while 2) may meet with resistance within the file system development community since the CoW behavior also ensures that no parts of the data are overwritten and we already have a way to do this with chattr -C.
Would it make a difference if one used a preallocated disk image rather than a dynamic image?
No. The issue isn't the initial allocation, it's the CoW for writes into the image file.
New features: - Deduplication - SUSE's Mark Fasheh has added an extension to the clone ioctl that allows us to do an 'offline' (read: out-of-band) deduplication of data extents. - "Offline" doesn't mean unmounted - it means that the user makes use of an external tool that implements the deduplication policy. - Not "perfect" dedupe like "online" (in-band) implementations, but without the I/O amplification behavior that online dedupe has. I'm happy to discuss this further if there's interest in hearing more.
That sounds like a cool feature. Has anyone played with this on, say, truecrypt encrypted devices as yet? (I have a very large truecrypt encrypted volume that I know has some duplication of data on it, and scripting to remove the duplicates, while not difficult, is something I haven't taken the time to do yet.
Not specifically, AFAIK.
- Removal of the strange per-directory hard link limit - Due to the backreferences to a single inode needing to fit in a single file system block, there was a limit to the number of hard links in a single directory. It could be quite low. - Limit removed by adding a new extended inode ref item, not enabled by default yet since it's a disk format change. Extended inode ref only used when required since it's not as space-efficient as the single node item. There's probably room for discussion within the file system community on whether we'd want to add an "ok to change" bit so that file systems have the ability to use the new extended inode ref items when needed but doesn't set the incompat bit until they're actually used. The other side of that coin is that it may not be clear to users when/if their file system has become incompatible with older kernels.
Most of that is over my head - what's the bottom line/impact on this?
The bottom line is that w/o this enabled, things that use a lot of hard links in a single directory can run into EMLINK. You can fix the issue by enabling the extended inode ref feature, but it means that you can mount the file system on older kernels. "Older" in this case means prior to 3.7 IIRC, so oS 12.3.
Areas that still need work: - Error handling - Not in the handling failure cases sense, but in the fsfuzzer sense.
- btrfsck - As I mentioned, we need broken file systems to fix in order to improve the tool.
- General performance - For a root file system with general user activity, it performs reasonably well. I've asked one of my team to come up with solid performance numbers so that we can 1) demonstrate where the file system is performing relative to the usual suspects, and 2) identify where we need to focus our efforts. - Historically, fsync() was a problem spot but that's been mitigated with the introduction of a "tree log" that is similar to a journal but is really just used to accelerate fsync.
Some general performance numbers would be good to see - as well as performance on large files/small files.
Once we have something we can publish, I'll be happy to share them. The baseline off-the-cuff performance shows that performance is similar to ext3 for some workloads, and way off in others, specifically those that are unlink-heavy.
FWIW, I've had a 4 TB btrfs file system with multiple subvolumes running for several years now. It sits on top of a 3 disk MDRAID5 volume. I've never encountered a file system corruption issue with it, though I have seen a few crashes. The last one was well over a year ago. There have been a few power outages as well without ill result. This is my "production" file system as far as that goes for me. It's not high throughput but it does serve as my "everything" volume, serving the local copies of my git trees for work, music, videos, hosts time machine backups for my wife, etc.
It's good to hear success stories. I'm curious - do you back this up, or is most of the data available elsewhere in the event of an unrecoverable issue? (Of course, "unrecoverable" for you is probably different than it would be for me, since you know the filesystem well enough to manually work on it if necessary).
I don't have backups for anything but time machine bundles, my mail mirror, and photos. The music and videos can be reproduced in a time-consuming manner. Mostly it's a matter of the price of backup space being more expensive than I want to spend. But, yeah, I do have the luxury of knowing where to start to fix it manually if I must.
I know people don't really want to compare SLES with openSUSE, but here's a case in which the story matters. We've been offering official support for btrfs since SLE11 SP2. SP3 was released a few months ago. Many people thought we were insane to do so because OMG BTRFS IS STILL EXPERIMENTAL, but we've crafted a file system implementation that *is* supportable. Between limiting the feature set for which we offer support and our kernel teams aggressively identifying and backporting fixes that may not have been pulled into the mainline kernel yet (more a factor of the maintainers being busy than the patches not being fully baked), we've created a pretty solid file system implementation.
Given that the work for btrfs in SLE and openSUSE is being handled largely by the same people, I think it makes sense to make the comparison.
SLE doesn't yet default to btrfs, though, does it?
SLE11 defaults to ext3 and we don't change the default in a service pack. I can't comment on what the default in SLE12 will be. I'll refer questions about that to our product manager for SLES, Matthias Eckermann. It should be apparent that SUSE is invested in the success of btrfs, though.
_THAT_ is why I do things like suggest that we have a similar "supported feature set" for openSUSE. It's not about limiting choice, though I suppose that's a side effect. It's about making it clear which parts of the file system are mature enough to be trusted and not just assuming that paid enterprise users are the only ones who care about things like that.
That's sensible.
All that said, it's possible that some of the things listed in the "unsupported" list work fine and could be considered mature already. That's a matter of testing and confirming that they are and there are only so many hours in the day to do it. That's also why the guard against unsupported features can be lifted by the user pretty easily.
To be fair, though, most of this conversation from my perspective is about the stability of the file system itself and that's not the whole experience. We still need to focus on things like whether or not snapper is too aggressive in saving snapshots, and I believe there's been work in that area recently in response to user complaints. It's about the whole picture and I don't think two months to release is too late to have that conversation.
It's good to have the conversation - thank you for the detailed explanation of things. That really helps put my mind at ease that this isn't (as we see from time to time) a "throw it over the wall and see what breaks" approach. It sounds like you've really done your homework and stand behind what you and your team have done to make btrfs production quality. While I still have reservations (and probably will until it reaches some sort of critical mass), my concerns are largely addressed.
Exactly. This is a file system which we've seen deployed with SLE11 SP2/3 and with which we've seen pretty good results. It's also something that we put significant effort into improving even after the initial release of the service pack. -Jeff -- Jeff Mahoney SUSE Labs