Re: [opensuse-factory] filesystem query

2 Jun 2011

      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/02/2011 04:22 PM, Greg Freemyer wrote:
...
On Thu, Jun 2, 2011 at 4:00 PM, jdd <jdd@dodin.org> wrote:
...
Le 02/06/2011 21:27, Carlos E. R. a écrit :
...
Which means run fsck on all opened filesystems.
shouldn't. I usually see only a journal control
jdd
Remember meta-data journaling is fairly common.
Data journaling much less so.
Data journaling will be more robust, so if robustness is your issue,
give it a shot.
I don't know what filesystems offer data journaling, but ext3
definitely does.  From the main page in the ext3 section:
==============
       data={journal|ordered|writeback}
              Specifies  the  journalling  mode for file data.
Metadata is always journaled.  To use modes other than ordered on the
root filesystem,
              pass the mode to the kernel as boot parameter, e.g.
rootflags=data=journal.
journal
                     All data is committed into the journal prior to
being written into the main filesystem.
ordered
                     This is the default mode.  All data is forced
directly out to the main file system prior to its metadata being
committed  to  the
                     journal.
writeback
                     Data  ordering is not preserved - data may be
written into the main filesystem after its metadata has been committed
to the jour‐
                     nal.  This is rumoured to be the
highest-throughput option.  It guarantees internal filesystem
integrity, however  it  can  allow
                     old data to appear in files after a crash and
journal recovery.
================
writeback is the least robust.  Data can be written in any order and
conceivably sit in cache for extended periods.  5+ years ago, I think
this was the normal behavior for most mainstream filesystems.
ext3 now defaults to data=ordered  (Remember the journals are flushed
on every mount, so it is easy to switch from one mode to another.)
I don't know if "data=journal" is any safer than "data=ordered" or not.
The choice between the two isn't one of robustness. It's a choice of
workload. They'll both have your data on disk when fsync() returns and
neither can make any guarantees about data being written before then.
Within the confines of existing APIs, file systems can't make any
promises WRT file contents beyond a chunk of data at a certain offset.
It only understands its own metadata.

Writes are still cached before being written to disk. In both cases,
writes can be split into multiple transactions. The writes are split up
into page-sized chunks (along with associated metadata, like bitmaps or
indirect blocks), each of which may be in its own transaction. In
neither case will a 32 MB write() be performed in an atomic chunk. Each
mode will place the blocks on a list that will be flushed during commit.
The mode determines where it will be flushed: the general file system or
the journal.

For robustness, use fsync(). That's what it's there for.

The descriptions of each mode you've pasted give the "what it does"
aspect of each mode, but not the effects.

data=writeback means that the journal will not stall on large writes
when the journal must be flushed to the general file system or an
fsync() is called. This will perform the best for most write loads but
can introduce corruption at the end of files if the system crashes if
the file is extended (metadata) before the file data itself is written
out (data).

data=ordered means that data writes go directly to the file system and
are guaranteed to hit disk before the transaction commits. This protects
against old file data appearing in sections of a file that have grown
but weren't written yet. It's a bit of a heavy hammer for that purpose
since it writes all of the outstanding writes to the file before the
transaction is committed, not just the ones that fall outside the
boundaries. A side effect of this is that it can stall transaction
commits when there are large writes queued up. There are fairly severe
performance consequences when there is fsync activity on a file system
with a lot of streaming writes. This is because the fsync can't be
honored until the transaction is committed, and there may be other
transactions queued to be committed before it. Even a small write can
stall behind the ordered writeout of a large write list associated with
another file.

data=journal means that _every_ write to the file system must go through
the journal. For streaming workloads, this will usually result in
choppy, bursty performance as the journal overflows again and again and
must be flushed to the filesystem, stalling progress as it does so.
Administrators should be aware that any increase in journal size carries
a corresponding increase in latency when the journal must be flushed. So
you may get longer bursts but they'll be further apart. The flip side of
this is that it also means that for fsync-heavy workloads on small
files, like with a mail spool, the fsync() call can be honored just by
committing the write to the journal. This limits seeking to within the
journal area and allows the file system to write to the general file
system at its leisure, queuing and sequencing writes to minimize seeking.

Chris Mason, some time ago, started playing around with the idea of a
data=guarded mode. This mode would only queue up writes that are outside
the current boundaries of the file so that most of the latency
associated with data=ordered would be eliminated. I didn't really follow
what happened with this effort. If I had to guess I'd say that the
overhead associated with making it work well would be too high to bother
with, since it would require an extent mapping of the file waiting to be
written. I'd also bet that an opportunistically created extent map
wouldn't be complete enough to make it worthwhile.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/

iEYEARECAAYFAk3oBRUACgkQLPWxlyuTD7Kp8ACdHIEaeBofo0u3X8w80jlpgo3f
HlcAn1PvMdocQT3iVG0IxjZRZwLcXeEf
=E87h
-----END PGP SIGNATURE-----
-- 
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse-factory+help@opensuse.org