[opensuse-factory] Re: BtrFS as default fs?

4 Sep 2013

      On Wed, 04 Sep 2013 12:04:58 -0400, Jeff Mahoney wrote:
...
On 9/3/13 8:36 PM, Jim Henderson wrote:
...
On Tue, 03 Sep 2013 19:59:06 -0400, Jeff Mahoney wrote:
...
Well that's the main thrust behind the "allow unsupported" module
option. We have the feature set that we've evaluated to be mature and
that's what we allow by default.
That seems a little counterintuitive to me.  Allowing unsupported
features would seem to indicate those features are immature, rather
than mature.  Am I missing something?
Yes, I'm proposing the opposite of that. The "allow unsupported" option
would be disabled by default and is the "guard" in front of those
immature features.
OK, that makes sense, thanks for the clarification.
...
My point was more that, for the most part, those users who've actually
been using btrfs weren't the ones chiming in and claiming that it's not
trustworthy yet. It's the ones who're nervous about trying it at all and
are being overly conservative to the point of derailing the conversation
without data to back their apprehension.
It's difficult to have hard data when you're apprehensive about running 
it because you've heard that there are still problems with it, or that 
there's a lack of effective/mature tools for fixing problems.

Even bearing in mind that support venues don't tend to get people saying 
"everything's just fine, no problems here at all".

I don't think it's "overly conservative" though to be cautious about 
risking your data.  I think it's up to those who believe it's stable to 
demonstrate that it is, and to assure the users that this is a safe 
filesystem to use.

If it is, then of course, I'd want to use it.  But I don't want to take a 
bigger risk so there can be more data gathered about unrecoverable 
problems.  Does that make sense?
...
...
OSS is all about transparency, so let's hear a little more about how
btrfs has improved in the past 12-18 months.
Absolutely. That's a completely reasonable request.
Thank you. :)
...
The areas in which btrfs has historically been weak include:
- error handling
  - This is an ongoing effort, mostly completed in v3.4, where certain
    classes of errors are handled by taking the file system "offline"
    (read-only).
What are the next steps after taking the filesystem offline in this 
instance?
...
- ENOSPC
  - Case 1: incorrect calculation of reservation size
    - This is the case which Lew Wolfgang encountered in the bug report
      he mentioned. These cases /should/ be mostly fixed. I haven't seen
      one in a while. Where "a while" is defined in terms of kernel
      release versions, not linear clock time.
  - Case 2: unable to free space on a full file system
    - This is probably the most infuriating case. The gist is that in a
      CoW file system, blocks may need to be allocated in order to free
      other blocks. If we get into a pathological situation where all of
      the blocks are in use, then we essentially encounter a deadlocked
      file system where the shared resource is the free block count.
      This has been fixed with the introduction of a reserved metadata
      block pool that can only be used for removal operations when
      ENOSPC has already been encountered. I fixed a case of this last
      month so that subvolume removal should succeed when the file
      system is full. I believe these to be pretty much eliminated now.
      If I'm wrong about that, the good news is that since we already
      have the reserved metadata block pool implemented, the fix is only
      about 5 lines as a fallback case for an ENOSPC handler.
Sounds like some good progress here.
...
- btrfsck
  - In truth, not as powerful as it should be yet.
  - Realistically, there's only so much that can be done without
    encountering novel broken file systems and collecting metadata
    images for analysis. There's a chicken/egg problem here in that
    there's no way to create a fsck tool that is prepared to encounter
    actual failures cases without seeing the failure cases. Sure, we can
    write up a tool that can predict certain cases, but there will
    *always* be surprises. e2fsck is still evolving as well.
  - The difference between btrfs and ext[234] is that we can 'scrub' the
    file system online to detect errors before they're actually
    encountered.
Sounds like some room for work here, but I understand what you're saying 
about predicting the unpredictable, too.  Since this is a SUSE effort, 
have you looked at how you might test this in, say, superlab in Provo?  
Getting time on the schedule might be difficult (I don't know how busy 
they are these days), but if you could push out an image to 100-200 
machines and run an automated test suite to read/write data and try to 
stress the filesystems on a relatively large number of machines, that 
might give you some testing that doesn't involve risking real users' data.
...
From where I sit, the expectation isn't to eliminate the possibility of 
errors - but to look for something that's maybe one step better than 
"good enough".
The scrubbing capability sounds interesting.
...
- VM image performance
  - Performance is generally regarded as horrible.
  - This is because CoW on what is essentially a block device backing
    store means a ton of write amplification for each write that the VM
    issues.
  - The file system supports a 'nodatacow' file attribute:
    chattr -C <file>. This attribute changes the CoW behavior of file
    writes such that the write only causes a CoW to be performed only if
    there is more than one reference held on the data extent.
  - Caveat: There is currently a strange corner case where nodatacow
    prevents a reflink copy but allows a snapshot of the subvolume to
    make a snapshot of the file. (They're essentially the same thing on
    the back end.)
  - Solution(s):
    1) Remove the distinction between the reflink copy and the snapshot
       cloning.
    2) Always handle CoW on data extents as overwrites when there is
       only a single reference on the extent.
    - 1) is probably a bug fix, while 2) may meet with resistance
      within the file system development community since the CoW
      behavior also ensures that no parts of the data are overwritten
      and we already have a way to do this with chattr -C.
Would it make a difference if one used a preallocated disk image rather 
than a dynamic image?
...
New features:
- Deduplication
  - SUSE's Mark Fasheh has added an extension to the clone ioctl that
    allows us to do an 'offline' (read: out-of-band) deduplication of
    data extents.
  - "Offline" doesn't mean unmounted - it means that the user makes use
    of an external tool that implements the deduplication policy.
  - Not "perfect" dedupe like "online" (in-band) implementations, but
    without the I/O amplification behavior that online dedupe has. I'm
    happy to discuss this further if there's interest in hearing more.
That sounds like a cool feature.  Has anyone played with this on, say, 
truecrypt encrypted devices as yet?  (I have a very large truecrypt 
encrypted volume that I know has some duplication of data on it, and 
scripting to remove the duplicates, while not difficult, is something I 
haven't taken the time to do yet.
...
- Removal of the strange per-directory hard link limit
  - Due to the backreferences to a single inode needing to fit in a
    single file system block, there was a limit to the number of hard
    links in a single directory. It could be quite low.
  - Limit removed by adding a new extended inode ref item, not enabled
    by default yet since it's a disk format change. Extended inode ref
    only used when required since it's not as space-efficient as the
    single node item. There's probably room for discussion within the
    file system community on whether we'd want to add an "ok to change"
    bit so that file systems have the ability to use the new extended
    inode ref items when needed but doesn't set the incompat bit until
    they're actually used. The other side of that coin is that it may
    not be clear to users when/if their file system has become
    incompatible with older kernels.
Most of that is over my head - what's the bottom line/impact on this?
...
- In-place conversion of reiserfs filesystems
  - Similar to the ext[234] converter - converts reiserfs filesystems
    to btrfs using the free space in the reiserfs filesystem.
  - See home:jeff_mahoney:convert for code and packages (still beta).
Nice.  I can see that being useful for those who have upgraded through 
several versions.
...
Areas that still need work:
- Error handling
  - Not in the handling failure cases sense, but in the fsfuzzer sense.
- btrfsck
  - As I mentioned, we need broken file systems to fix in order to
    improve the tool.
- General performance
  - For a root file system with general user activity, it performs
    reasonably well. I've asked one of my team to come up with solid
    performance numbers so that we can 1) demonstrate where the file
    system is performing relative to the usual suspects, and 2) identify
    where we need to focus our efforts.
  - Historically, fsync() was a problem spot but that's been mitigated
    with the introduction of a "tree log" that is similar to a journal
    but is really just used to accelerate fsync.
Some general performance numbers would be good to see - as well as 
performance on large files/small files.
...
FWIW, I've had a 4 TB btrfs file system with multiple subvolumes running
for several years now. It sits on top of a 3 disk MDRAID5 volume. I've
never encountered a file system corruption issue with it, though I have
seen a few crashes. The last one was well over a year ago. There have
been a few power outages as well without ill result. This is my
"production" file system as far as that goes for me. It's not high
throughput but it does serve as my "everything" volume, serving the
local copies of my git trees for work, music, videos, hosts time machine
backups for my wife, etc.
It's good to hear success stories.  I'm curious - do you back this up, or 
is most of the data available elsewhere in the event of an unrecoverable 
issue?  (Of course, "unrecoverable" for you is probably different than it 
would be for me, since you know the filesystem well enough to manually 
work on it if necessary).
...
I know people don't really want to compare SLES with openSUSE, but
here's a case in which the story matters. We've been offering official
support for btrfs since SLE11 SP2. SP3 was released a few months ago.
Many people thought we were insane to do so because OMG BTRFS IS STILL
EXPERIMENTAL, but we've crafted a file system implementation that *is*
supportable. Between limiting the feature set for which we offer support
and our kernel teams aggressively identifying and backporting fixes that
may not have been pulled into the mainline kernel yet (more a factor of
the maintainers being busy than the patches not being fully baked),
we've created a pretty solid file system implementation.
Given that the work for btrfs in SLE and openSUSE is being handled 
largely by the same people, I think it makes sense to make the comparison.

SLE doesn't yet default to btrfs, though, does it?
...
_THAT_ is why I do things like suggest that we have a similar "supported
feature set" for openSUSE. It's not about limiting choice, though I
suppose that's a side effect. It's about making it clear which parts of
the file system are mature enough to be trusted and not just assuming
that paid enterprise users are the only ones who care about things like
that.
That's sensible.
...
Someone asked in another thread about who gets to determine whether or
not a feature is mature enough. I'll be honest here. I lead the SUSE
Labs Storage and File Systems team. We do, with significant informed
feedback from our users. We perform testing on the the file system and
cooperate with our QA staff to perform more testing on the file system.
If we encounter bugs, we fix them. If we perform a ton of testing and
don't encounter bugs, we start to lean to the "mature" phase. We depend
on experienced users who don't mind playing with untested tech to file
bug reports. If being in a Linux support environment for the past 14
years has taught me anything it's that users of software are much more
creative about breaking things than the developers are. [This was also
true when I was on the other side of the fence in my previous life as a
big-UNIX sysadmin. Real support response when encountering a pretty
funny error code on a large disk array's RAID controller: "You saw what?
You should never see that."]
Heh, yeah, I've had plenty of circumstances myself where I've seen 
weirdness never before encountered in the lab (from both sides of the 
conversation) - users do come up with very creative ways to break things.
...
All that said, it's possible that some of the things listed in the
"unsupported" list work fine and could be considered mature already.
That's a matter of testing and confirming that they are and there are
only so many hours in the day to do it. That's also why the guard
against unsupported features can be lifted by the user pretty easily.
To be fair, though, most of this conversation from my perspective is
about the stability of the file system itself and that's not the whole
experience. We still need to focus on things like whether or not snapper
is too aggressive in saving snapshots, and I believe there's been work
in that area recently in response to user complaints. It's about the
whole picture and I don't think two months to release is too late to
have that conversation.
It's good to have the conversation - thank you for the detailed 
explanation of things.  That really helps put my mind at ease that this 
isn't (as we see from time to time) a "throw it over the wall and see 
what breaks" approach.  It sounds like you've really done your homework 
and stand behind what you and your team have done to make btrfs 
production quality.  While I still have reservations (and probably will 
until it reaches some sort of critical mass), my concerns are largely 
addressed.

Jim
-- 
 Jim Henderson
 Please keep on-topic replies on the list so everyone benefits

-- 
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-factory+owner@opensuse.org