On 9/3/13 8:36 PM, Jim Henderson wrote:
On Tue, 03 Sep 2013 19:59:06 -0400, Jeff Mahoney wrote:
Well that's the main thrust behind the "allow unsupported" module option. We have the feature set that we've evaluated to be mature and that's what we allow by default.
That seems a little counterintuitive to me. Allowing unsupported features would seem to indicate those features are immature, rather than mature. Am I missing something?
Yes, I'm proposing the opposite of that. The "allow unsupported" option would be disabled by default and is the "guard" in front of those immature features.
When I cast a wide net across forums and mailing lists last month asking for user experiences, I got a lot of uninformed opinion and very little concrete data.
Concrete data might be hard to come by, but I don't know that the way to get it is to risk new users' data (I assume upgrades wouldn't be affected - but is that a safe assumption?) to gather it. It needs to be an opt- in, not a default that's set that may result in users losing data.
My point was more that, for the most part, those users who've actually been using btrfs weren't the ones chiming in and claiming that it's not trustworthy yet. It's the ones who're nervous about trying it at all and are being overly conservative to the point of derailing the conversation without data to back their apprehension.
Most of the negative data was in the area of snapper being too aggressive in creating snapshots and not aggressive enough in cleaning them up. There was some negative opinion WRT the file system itself, but most of it was in the realm of "I heard..." or "I don't trust it" based on too much hearsay and too little experience. It's that kind of rumor-response that is unhelpful in making decisions or improving the pain points with the file system. There were a few reports of people having troubles with the file system itself, but they tended to be with compression or RAID enabled -- the features that we don't entirely trust yet and want to disable so the casual user doesn't become an unwitting beta tester.
So whether it's "considered" unstable or experimental largely depends on what features are being tested and who's doing the testing. A lot of times it involves armchair punditry and no testing at all.
So for users to accept that their data is safe (or at least no less safe than it is with current - more mature - filesystems like ext4), don't set the default, but sell us on the idea. Tell us more about how the filesystem has improved, what the current outstanding issues are, and how they're being addressed.
A lot of individuals aren't willing to test an unproven filesystem because of the risk to their data, or end up in a situation where the system has to be reinstalled. Myself, my openSUSE systems are my production work environment - so I need to be confident that I'm not going to lose critical data (which yes, I do back up the most critical data) and I'm not going to lose billable hours having to rebuild a system because the filesystem became inconsistent. I can certainly put it in a VM, but it's not going to get a thorough "real-world" workout there.
OSS is all about transparency, so let's hear a little more about how btrfs has improved in the past 12-18 months.
Absolutely. That's a completely reasonable request. The areas in which btrfs has historically been weak include: - error handling - This is an ongoing effort, mostly completed in v3.4, where certain classes of errors are handled by taking the file system "offline" (read-only). - ENOSPC - Case 1: incorrect calculation of reservation size - This is the case which Lew Wolfgang encountered in the bug report he mentioned. These cases /should/ be mostly fixed. I haven't seen one in a while. Where "a while" is defined in terms of kernel release versions, not linear clock time. - Case 2: unable to free space on a full file system - This is probably the most infuriating case. The gist is that in a CoW file system, blocks may need to be allocated in order to free other blocks. If we get into a pathological situation where all of the blocks are in use, then we essentially encounter a deadlocked file system where the shared resource is the free block count. This has been fixed with the introduction of a reserved metadata block pool that can only be used for removal operations when ENOSPC has already been encountered. I fixed a case of this last month so that subvolume removal should succeed when the file system is full. I believe these to be pretty much eliminated now. If I'm wrong about that, the good news is that since we already have the reserved metadata block pool implemented, the fix is only about 5 lines as a fallback case for an ENOSPC handler. - btrfsck - In truth, not as powerful as it should be yet. - Realistically, there's only so much that can be done without encountering novel broken file systems and collecting metadata images for analysis. There's a chicken/egg problem here in that there's no way to create a fsck tool that is prepared to encounter actual failures cases without seeing the failure cases. Sure, we can write up a tool that can predict certain cases, but there will *always* be surprises. e2fsck is still evolving as well. - The difference between btrfs and ext[234] is that we can 'scrub' the file system online to detect errors before they're actually encountered. - VM image performance - Performance is generally regarded as horrible. - This is because CoW on what is essentially a block device backing store means a ton of write amplification for each write that the VM issues. - The file system supports a 'nodatacow' file attribute: chattr -C <file>. This attribute changes the CoW behavior of file writes such that the write only causes a CoW to be performed only if there is more than one reference held on the data extent. - Caveat: There is currently a strange corner case where nodatacow prevents a reflink copy but allows a snapshot of the subvolume to make a snapshot of the file. (They're essentially the same thing on the back end.) - Solution(s): 1) Remove the distinction between the reflink copy and the snapshot cloning. 2) Always handle CoW on data extents as overwrites when there is only a single reference on the extent. - 1) is probably a bug fix, while 2) may meet with resistance within the file system development community since the CoW behavior also ensures that no parts of the data are overwritten and we already have a way to do this with chattr -C. New features: - Deduplication - SUSE's Mark Fasheh has added an extension to the clone ioctl that allows us to do an 'offline' (read: out-of-band) deduplication of data extents. - "Offline" doesn't mean unmounted - it means that the user makes use of an external tool that implements the deduplication policy. - Not "perfect" dedupe like "online" (in-band) implementations, but without the I/O amplification behavior that online dedupe has. I'm happy to discuss this further if there's interest in hearing more. - Removal of the strange per-directory hard link limit - Due to the backreferences to a single inode needing to fit in a single file system block, there was a limit to the number of hard links in a single directory. It could be quite low. - Limit removed by adding a new extended inode ref item, not enabled by default yet since it's a disk format change. Extended inode ref only used when required since it's not as space-efficient as the single node item. There's probably room for discussion within the file system community on whether we'd want to add an "ok to change" bit so that file systems have the ability to use the new extended inode ref items when needed but doesn't set the incompat bit until they're actually used. The other side of that coin is that it may not be clear to users when/if their file system has become incompatible with older kernels. - In-place conversion of reiserfs filesystems - Similar to the ext[234] converter - converts reiserfs filesystems to btrfs using the free space in the reiserfs filesystem. - See home:jeff_mahoney:convert for code and packages (still beta). Areas that still need work: - Error handling - Not in the handling failure cases sense, but in the fsfuzzer sense. - btrfsck - As I mentioned, we need broken file systems to fix in order to improve the tool. - General performance - For a root file system with general user activity, it performs reasonably well. I've asked one of my team to come up with solid performance numbers so that we can 1) demonstrate where the file system is performing relative to the usual suspects, and 2) identify where we need to focus our efforts. - Historically, fsync() was a problem spot but that's been mitigated with the introduction of a "tree log" that is similar to a journal but is really just used to accelerate fsync. FWIW, I've had a 4 TB btrfs file system with multiple subvolumes running for several years now. It sits on top of a 3 disk MDRAID5 volume. I've never encountered a file system corruption issue with it, though I have seen a few crashes. The last one was well over a year ago. There have been a few power outages as well without ill result. This is my "production" file system as far as that goes for me. It's not high throughput but it does serve as my "everything" volume, serving the local copies of my git trees for work, music, videos, hosts time machine backups for my wife, etc. I know people don't really want to compare SLES with openSUSE, but here's a case in which the story matters. We've been offering official support for btrfs since SLE11 SP2. SP3 was released a few months ago. Many people thought we were insane to do so because OMG BTRFS IS STILL EXPERIMENTAL, but we've crafted a file system implementation that *is* supportable. Between limiting the feature set for which we offer support and our kernel teams aggressively identifying and backporting fixes that may not have been pulled into the mainline kernel yet (more a factor of the maintainers being busy than the patches not being fully baked), we've created a pretty solid file system implementation. _THAT_ is why I do things like suggest that we have a similar "supported feature set" for openSUSE. It's not about limiting choice, though I suppose that's a side effect. It's about making it clear which parts of the file system are mature enough to be trusted and not just assuming that paid enterprise users are the only ones who care about things like that. Someone asked in another thread about who gets to determine whether or not a feature is mature enough. I'll be honest here. I lead the SUSE Labs Storage and File Systems team. We do, with significant informed feedback from our users. We perform testing on the the file system and cooperate with our QA staff to perform more testing on the file system. If we encounter bugs, we fix them. If we perform a ton of testing and don't encounter bugs, we start to lean to the "mature" phase. We depend on experienced users who don't mind playing with untested tech to file bug reports. If being in a Linux support environment for the past 14 years has taught me anything it's that users of software are much more creative about breaking things than the developers are. [This was also true when I was on the other side of the fence in my previous life as a big-UNIX sysadmin. Real support response when encountering a pretty funny error code on a large disk array's RAID controller: "You saw what? You should never see that."] All that said, it's possible that some of the things listed in the "unsupported" list work fine and could be considered mature already. That's a matter of testing and confirming that they are and there are only so many hours in the day to do it. That's also why the guard against unsupported features can be lifted by the user pretty easily. To be fair, though, most of this conversation from my perspective is about the stability of the file system itself and that's not the whole experience. We still need to focus on things like whether or not snapper is too aggressive in saving snapshots, and I believe there's been work in that area recently in response to user complaints. It's about the whole picture and I don't think two months to release is too late to have that conversation. -Jeff -- Jeff Mahoney SUSE Labs