[opensuse-es] [OT] Se viene un nuevo FS con el kernel 2.6.30: NILFS

3 Jun 2009

      Salió en Linux Magazine, les pego el texto en ingles para los que no
puedan acceder, pero segun parece, su mejor perfomance está en los
almacenamientos de estado sólido SSD::

http://www.linux-mag.com/cache/7345/1.html

NILFS: A File System to Make SSDs Scream

The 2.6.30 kernel is chock full of next-gen file systems. One such
example is NILFS, a new log-structured file system that dramatically
improves write performance

 It’s difficult to write storage articles at this time and not focus
on the upcoming 2.6.30 kernel. Why? This kernel is loaded with a
number of new file systems — some of which we’ve already covered, like
ext4 and btrfs. Another of the hot new file systems that is in 2.6.30
is NILFS. This file system is definitely one that you should be
testing.

NILFS2 (New Implementation of a Log-Structured File System Version 2)
is a very promising new log-structured file system that has continuous
snapshots and versioning of the entire file system. This means that
you can recover files that were deleted or unintentionally modified as
well as perform backups at any time from a snapshot without a
performance penalty normally associated with creating snapshots. In
addition, there is evidence that NILFS has extremely good performance
on SSD drives.

Log-Structured File System?

Log-Structured File Systems are a bit different than other file
systems with both good points and bad points. Rather than write to a
tree structure such as a b-tree or an h-tree, either with or without a
journal, a log-structured file system writes all data and metadata
sequentially in a continuous stream that is called a log (actually it
is a circular log).

The concept was developed by John Ousterhout of TCL fame and Fred
Douglis. The motivation behind log-structured file systems is that
typical file systems lay out data based on spatial locality for
rotating media (hard drives). But rotating media tends to have slow
seek times limiting write performance. In addition, it was presumed
that most IO would become write dominated (this observation is
supported by a study that was summarized in a recent article). So a
log-structured file system takes a new approach and treats the file
system as a circular log and writes sequentially to the “head” of the
log (the beginning) never over writing the existing log. This means
that seeks are kept to a minimum because everything is sequential,
improving write performance.

A log-structured file system, because of its design, makes it very
easy to create snapshots (in NILFS they are called checkpoints) of
both the data and metadata. NILFS can then mount these checkpoints (or
snapshots) along side the primary NILFS file system. From these
checkpoints, you can recover erased files (if the checkpoint has a
date and time prior to when the file was erased) or you can use it for
backups or even disaster recovery images.

Another benefit of log-structured file systems is that recovering from
a crash is easier than the more typical tree based file systems (e.g.
ext2, ext3, etc.). After a log-structured file system crashes, when it
is remounted it can reconstruct its state from the last consistent
point in the log. It starts at the head of the circular log and backs
up until the file system is consistent. This point should be very
close to the head so little if any data or metadata will be lost. This
process is extremely fast regardless of the size of the file system.

This bears repeating - a log-structured file system recovers from a
crash extremely fast and the amount of time is independent of the size
of the file system. In contrast, other file systems have to replay
their journal and possibly even walk their data structures to make
sure the file system is consistent (i.e. run “fsck”). Everyone who has
run fsck on a very large file system knows how much time it can take.

One problematic aspect of log-structured file systems is that they
need to include a fairly sophisticated capability of “garbage
collection” to reclaim free space. Free space needs to be reclaimed
from the tail of the log, primarily the old check points, so that the
file system doesn’t become full when the head of the log wraps around
to the tail. There are many techniques for reclaiming space, one is
covered in the Wikipedia article about log-structured file systems.
The garbage collection process reclaims space from the check points
(snap shots) otherwise the file system would fill far too quickly.

A Log Structured File System for Linux - NILFS

The Nippon Telephone and Telegraph (NTT) CyberSpace Laboratories has
been developing NILFS (also referred to as NILFS2 since it is the
version 2 of the file system) for Linux. It is released under the GPL
2.0 license and is included in the 2.6.30 kernel. It spent a great
deal of time in the -mm kernels and under went much testing since it’s
initial announcement.

One of the most noticeable features of NILFS is that it can
“continuously and automatically save instantaneous states of the file
system without interrupting service”. NILFS refers to these as
checkpoints. In contrast, other file systems such as ZFS, can provide
snapshots but they have to suspend operation to perform the snapshot
operation. NILFS doesn’t have to do this. The snapshots (checkpoints)
are part of the file system design itself.

One of the really cool features of NILFS is that these checkpoints can
actually be mounted along side the primary file system. This has many,
many uses, one of which is to mount a checkpoint to recover files that
were unintentionally erased.

In addition to being able to recover recently erased files and
extremely fast crash recovery times, there are a number of other
features of NILFS that are very attractive:

    * The file size and inode numbers are stored as 64-bit fields

    * File sizes of up to 8 EiB (Exbibyte - approximately an Exabyte)

    * Block sizes that are smaller than a page size (i.e. 1KB-2KB).
This can potentially make NILFS much faster for small files than other
file systems.

    * File and inode blocks use a B-tree (the use of B-trees in a
log-structured file system stems from the implementation which use
something called segments)

    * NILFS uses 32-bit checksums (CRC32) on data and metadata for
integrity assurance

    * Correctly ordered data and meta-data writes

    * Redundant superblock

    * Read-ahead for meta data files as well as data files (helps read
performance)

    * Continuous check pointing which can be used for snapshots. These
can be used for backups or they can even be used for recovering files.

Checkpoints and Snapshots

One of the features that users can really enjoy with NILFS is the
ability to recover erased or modified files. NILFS creates a
checkpoint “every few seconds or per synchronous write basis (unless
there is no change).” (from the kernel documentation). Then the user
can select a checkpoint and convert it into a snapshot. These
snapshots are preserved until they are converted back into
checkpoints. Checkpoints are not preserved for the life of the file
system and after a period of time the garbage collection process will
recover the space in the checkpoint.

This means that users can’t recover files from a long time in the
past. But there is no limit to the number of snapshots that can be
created - at least until the file system volume becomes full. There
are many uses for the snapshots including recovery of erased or
modified files or they can be used by administrators for backups.

There are a few user-space commands that help with check points and
snapshots. From the NILFS web site is an explanation of the process
and is paraphrased here. The first step is to list the check points
using the lscp command.

$ lscp
       CNO        DATE     TIME  MODE  SKT   NBLKINC       ICNT
         1  2008-05-08 14:45:49  cp     -         11          3
         2  2008-05-08 14:50:22  cp     -     200523         81
         3  2008-05-08 20:40:34  cp     -        136         61
         4  2008-05-08 20:41:20  cp     -     187666       1604
         5  2008-05-08 20:41:42  cp     -         51       1634
         6  2008-05-08 20:42:00  cp     -         37       1653
         7  2008-05-08 20:42:42  cp     -     272146       2116
         8  2008-05-08 20:43:13  cp     -     264649       2117
         9  2008-05-08 20:43:44  cp     -     285848       2117
        10  2008-05-08 20:44:16  cp     -     139876       7357

Notice that the output of lscp lists the date and time of the check
points. Under the column labeled “MODE” is either a “cp”, that stands
for “check point”, or “ss” that stands for “snap shot.” If a user does
not want to wait for a check point and wants to create one
immediately, the mkcp command could be used. In general you need to
tell mkcp the device containing a NILFS file system otherwise it
searches /proc/mounts for NILFS file systems.

To take a check point and create a snap shot, one uses the mkcp
command again. In this case, one uses the command mkcp -s to create
the snapshot from an existing checkpoint. You can also use the chcp
command that changes a check point into a snap shot or vice versa.
Again, from the NILFS website is an example of creating a snapshot.

$ sudo chcp ss 2
$ lscp
       CNO        DATE     TIME  MODE  SKT   NBLKINC       ICNT
         1  2008-05-08 14:45:49  cp     -         11          3
         2  2008-05-08 14:50:22  ss     -     200523         81
         3  2008-05-08 20:40:34  cp     -        136         61
         4  2008-05-08 20:41:20  cp     -     187666       1604
         5  2008-05-08 20:41:42  cp     -         51       1634
         6  2008-05-08 20:42:00  cp     -         37       1653
         7  2008-05-08 20:42:42  cp     -     272146       2116
         8  2008-05-08 20:43:13  cp     -     264649       2117
         9  2008-05-08 20:43:44  cp     -     285848       2117
        10  2008-05-08 20:44:16  cp     -     139876       7357
        11  2008-05-08 21:05:23  cp     -         10       7357

Notice that the chcp command changes the second check point into a
snap shot. This is indicated under the “MODE” column where the second
check point is listed as “ss” or snap shot. Now that the check point
is a snap shot, it won’t be deleted during the garbage collection.
However, you can remove the snap shot by using the rmcp command.

NILFS implements garbage collection in a unique way. It uses a
user-space daemon to perform the garbage collection. This daemon is
activated when the file system is mounted via the “mount” command.
This also means that garbage collection can be activated at any time
(if the file system is mounted).

Don’t forget that NILFS will delete check points after a certain
period of time unless the check point is converted to a snap shot. The
amount of time when the check point is held before being deleted is
controlled by parameters in the /etc/nilfs_cleanerd.conf file. You can
adjust the garbage collection (GC) parameters in the file and restart
the GC daemon so that the new parameter values are used (or unmouning
and remounting the file system).

You must have root access or at least sudo ability to mount a snap
shot. Also recall that snap shots are mounted as read-only. From the
NILFS web site example, one can mount the snapshot previously created
(it was created from the second check point)

# mount -t nilfs2 -r -o cp=2 /dev/sdb1 /nilfs-cp
# df -t nilfs2
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1             71679996   3203068  64888832   5% /nilfs
/dev/sdb1             71679996   3203068  64888832   5% /nilfs-cp
# mount -t nilfs2
/dev/sdb1 on /nilfs type nilfs2 (rw,gcpid=13296)
/dev/sdb1 on /nilfs-cp type nilfs2 (ro,cp=2)

The snap shot is mounted on /nilfs-cp in a read-only mode (ro).
Depending upon the options and permissions on the /nilfs-cp mount
point, users could copy files from the snap shot. Alternatively, the
root user could restore the file(s)s for the user. Also, the snap shot
could be easily used for creating back-ups. An administrator could
also use the snap shot for creating a disaster recovery image of the
file system. Just as a reminder the mounted snap shot, while being a
NILFS file system, is mounted read-only so check points of the
snapshot are not created. After you are finished with the snap shot
don’t forget to unmount it and either delete the snap shot or convert
it back to a check point and allow garbage collection to recover the
space.

Speed, Glorious Speed

Recall that one reason log-structured file systems were developed was
to increase write performance (assuming the read performance would be
dominated by caching effects). And who doesn’t like increased write
performance?

One of the earliest reviews of NILFS was in 2007 by Chris Samuel. He
did a very comprehensive review of Emerging File Systems(how prescient
was that review?). He did a very nice review of a number of file
systems including NILFS including running benchmarks. The performance
was good for such a young file system but even at that time it had the
best performance by far for Sequential Deletes. It was even better
than ZFS/OpenSolaris for most tests performed.

In Feb. 2008, there was a presentation by Dongjun Shin from Samsung as
part of the Linux Storage & File System Workshop 2008 (LSF ‘08). He
benchmarked NILFS, Btrfs, Ext2, Ext3, Ext4, ReiserFS, and XFS when
running on an SSD device. Granted that the testing is a little old,
but the results are very, very exciting. The benchmark, Postmark,
simulates an email server. Two groups of files sizes were tested, (1)
9 - 15KB (S), and (2) 0.1 - 3MB (L). For each group, two tests were
run with a small number of files (S), and a larger number of files
(L). Figure 1 and 2 below are the test results.

Figure 1: Postmark Results for Small File Size
Figure 1: Postmark Results for Small File Size

Figure 2: Postmark Results for Large File Size
Figure 2: Postmark Results for Large File Size

Notice that in both cases, the performance of NILFS exceeds that of
other file systems. For small files NILFS was about 25-38% faster than
the nearest competitor (btrfs). For large files NILFS was about 15-25%
faster than the nearest competitor (reiserfs and/or ext4).

It is pretty amazing to see such a boost in performance from a change
in file system, but it does show you that the coupling of file system
design with hardware, in this case SSD’s, can produce a big boost in
performance. But “There Ain’t No Such Thing As A Free Lunch”
TANSTAAFL. There are some current issues with NILFS and SSD’s.

There was a recent posting to the NILFS mailing list about using NILFS
as the root drive for a Linux system. It was pointed out that the root
file system produces a great deal of traffic. Coupled with this is the
fact that NILFS file system activity can be reasonably write heavy and
you have the potential for quickly wearing out SSD drives (remember
that NAND chips which make up SSD’s have a limited number of
rewrites). But the developers of NILFS are aware of this and a better
garbage collection (GC) algorithm is under investigation.

There was also a question on the Linux kernel mailing list about the
effect of age on the performance (i.e. would the performance of NILFS
still remain far above others on the Postmark test after it was used
for a few months?). The answer is that the developers don’t believe
the performance suffers after it is used for a period of time, but
there isn’t any data to back up that claim at the present time.
However, virtually all files systems suffer degrading performance with
age.

NILFS - It’s Definitely Worth Testing

NILFS has a great deal going for it in many regards. It is a modern
file system in almost every respect (OK, no built-in RAID, but that
can be worked around). The log-structured design of NILFS means that
its write performance should be very, very good and there is evidence
of this from the performance report that benchmarked Postmark.

Additionally, the fact that NILFS continuously creates checkpoints
that can be used to create snapshots, is of great benefit. These
checkpoints can be used to recover erased or modified user files. They
can also be used for backups or creating disaster recover images of
data. More over, creating these checkpoints or snapshots do not result
in decreased performance as they do for file systems such as ZFS.

NILFS holds great promise for Linux. There are many scenarios where it
would work extremely well. In particular it works very well for user
directories or work directories. In the HPC world it would work
extremely well for high-speed storage that are dominated by write
performance. Coupling the performance boosts with the snapshot
features make NILFS a potential system administrators dream file
system. It is well worth trying NILFS on your system.
-- 
Para dar de baja la suscripciÃ³n, mande un mensaje a:
   opensuse-es+unsubscribe@opensuse.org
Para obtener el resto de direcciones-comando, mande
un mensaje a:
   opensuse-es+help@opensuse.org

Juan Erbes

Alfredo J. V. P.

Carlos E. R.

tags

participants (3)