[opensuse-factory] filesystem query
Hello, Which filesystems in opensuse 11.4 and later will handle these requirements ? 1. -larger disk volumes i.e in the 2 to 3 Terabyte range. 2. -lots of files on the disk > 150,000+ 3. -will handle a system restart when system stops responding [restart with a power off/on] 4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ? What file systems does the build service use to handle the volumes of data and load ? Thanks Glenn -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
On Wed, Jun 1, 2011 at 11:41 AM, <doiggl@velocitynet.com.au> wrote:
Hello, Which filesystems in opensuse 11.4 and later will handle these requirements ?
1. -larger disk volumes i.e in the 2 to 3 Terabyte range. 2. -lots of files on the disk > 150,000+ 3. -will handle a system restart when system stops responding [restart with a power off/on] 4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ?
What file systems does the build service use to handle the volumes of data and load ? Thanks Glenn
Those are pretty minimal requirements. It may be easier to list the filesystems that don't do that. (FAT / ext2 / etc.) Good old ext3 can do up to 8 TB. So if the above is the limit of your desires, I'd probably stick with ext3. Note, that you want filesystems and files to be in a good state after a power issue. The most robust way to get that is to journal BOTH filesystem metadata and file data. Ext3 can do that, but you need to enable the mount flag for file data journalling. It will slow things done some, but its a tradeoff you need to decide on. Note, ext4 is where all current R&D is ongoing, so if you need to scale up beyond the above, ext4 might be the place for you. Or xfs is also good solid fs that a lot of people use to address the above. I don't think either ext4 or xfs support file data journalling? I use xfs on my fileservers as do a lot of others. But my choosing xfs involved features not on your list, and I made that choice almost 10 years ago now, so my logic from back then no longer applies. I have had a couple fs corruption issues in that time. (I've had 3 production fileservers in use continuously since 2004 or so. I keep them on openSUSE, but typically a little behind the new releases. ie. None are updated to 11.4 yet.) Good Luck Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
Hi, On Wed, Jun 01, Greg Freemyer wrote:
Good old ext3 can do up to 8 TB. So if the above is the limit of your desires, I'd probably stick with ext3.
The filesystem size limit for ext3 is 32 TB. [/share/MD0_DATA] # df -h . Filesystem Size Used Available Use% Mounted on /dev/md0 8.1T 3.5T 4.5T 44% /share/MD0_DATA [/share/MD0_DATA] # mount | grep MD0 /dev/md0 on /share/MD0_DATA type ext3 (rw,usrjquota=aquota.user,jqfmt=vfsv0,user_xattr,data=ordered,noacl) [/share/MD0_DATA] #
Greg
Hubert Mantel -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
On 06/01/2011 05:41 PM, doiggl@velocitynet.com.au wrote:
Hello, Which filesystems in opensuse 11.4 and later will handle these requirements ?
I don't know if my answer will match your expectations ...
1. -larger disk volumes i.e in the 2 to 3 Terabyte range.
any from ext3, ext4, xfs, btrfs, for having old grub 0.9.7 working it's better to have it below 1024cyl like having a small boot partiton.
2. -lots of files on the disk > 150,000+ ext4, xfs depend of the size and usage of files.
3. -will handle a system restart when system stops responding [restart with a power off/on] ? sorry don't understand what you mean 4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ? ext3, ext4, xfs, btrfs(to be tested and only latest version)
What file systems does the build service use to handle the volumes of data and load ? Mainly ext3 for download.opensuse.org For the vm building package they use ext3 (take a look at a log build)
Thanks Glenn
-- Bruno Friedmann Ioda-Net Sàrl www.ioda-net.ch openSUSE Member & Ambassador GPG KEY : D5C9B751C4653227 irc: tigerfoot -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2011-06-01 19:21, Bruno Friedmann wrote:
On 06/01/2011 05:41 PM, doiggl@velocitynet.com.au wrote:
3. -will handle a system restart when system stops responding [restart with a power off/on] ? sorry don't understand what you mean
It is very clear. :-) What happens when the system suffers a hard reset or power off, how well does the filesystem cope with that situation on the next boot? - -- Cheers / Saludos, Carlos E. R. (from 11.2 x86_64 "Emerald" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.12 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/ iEYEARECAAYFAk3meL8ACgkQtTMYHG2NR9WUewCeLDq7+1p5/GNqvViB/KSivKr3 ZWIAoIXw090Qfn9FOpwhQr1PP2heKqAx =Ma+C -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
Le 01/06/2011 19:37, Carlos E. R. a écrit :
What happens when the system suffers a hard reset or power off, how well does the filesystem cope with that situation on the next boot?
AFAIK, ne file system is completely immune on such situation. I even have seen a computer killing lilo on rude start/stop (no liloo boot anymore, start rescue, type lilo and all was as good as ever) and any of the modern filesystem try to accomodate a well as possible, and that's true since the early time of first reiserfs jdd -- http://www.dodin.net http://www.youtube.com/user/jdddodinorg http://jdd.blip.tv/ -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
Hi, On Wed, 1 Jun 2011, Bruno Friedmann wrote:
On 06/01/2011 05:41 PM, doiggl@velocitynet.com.au wrote:
Which filesystems in opensuse 11.4 and later will handle these requirements ?
I don't know if my answer will match your expectations ...
1. -larger disk volumes i.e in the 2 to 3 Terabyte range.
any from ext3, ext4, xfs, btrfs, for having old grub 0.9.7 working it's better to have it below 1024cyl like having a small boot partiton.
2. -lots of files on the disk > 150,000+
ext4, xfs depend of the size and usage of files.
Beware of xfs if you really have a lot of files - in case an xfs_repair is necessary it will die on memory hunger and you will die on time.
3. -will handle a system restart when system stops responding [restart with a power off/on]
? sorry don't understand what you mean
Usually you can decide about some "error behaviour", but there will surely be no way to react with power off/on...
4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ?
ext3, ext4, xfs, btrfs(to be tested and only latest version)
What file systems does the build service use to handle the volumes of data and load ?
Mainly ext3 for download.opensuse.org For the vm building package they use ext3 (take a look at a log build)
Viele Gruesse Eberhard Moenkeberg (emoenke@gwdg.de, em@kki.org) -- Eberhard Moenkeberg Arbeitsgruppe IT-Infrastruktur E-Mail: emoenke@gwdg.de Tel.: +49 (0)551 201-1551 ------------------------------------------------------------------------- Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen (GWDG) Am Fassberg 11, 37077 Goettingen URL: http://www.gwdg.de E-Mail: gwdg@gwdg.de Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150 Geschaeftsfuehrer: Prof. Dr. Oswald Haan und Dr. Paul Suren Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Goettingen Registergericht: Goettingen Handelsregister-Nr. B 598 ------------------------------------------------------------------------- -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
Am Wed, 01 Jun 2011 19:21:48 +0200 schrieb Bruno Friedmann <bruno@ioda-net.ch>:
4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ? ext3, ext4, xfs, btrfs(to be tested and only latest version)
IIUC, btrfs is not exactly a journaling file system ;-) -- Stefan Seyfried "Dispatch war rocket Ajax to bring back his body!" -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
El 01/06/11 11:41, doiggl@velocitynet.com.au escribió:
4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ?
While the filesystem can help having some sort of behaviour on error, you need to have proper power backup and a RAID to lower the chances of lossing data on power fail. You are at the mercy of your hardware which is usually cheap and hence quite buggy. Cheers. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
On 06/01/2011 08:47 PM, Cristian Rodríguez wrote:
El 01/06/11 11:41, doiggl@velocitynet.com.au escribió:
4. -have a journalled file system recovery to ensure filesystem and files are in a healthy state after a system freeze/reboot ?
While the filesystem can help having some sort of behaviour on error, you need to have proper power backup and a RAID to lower the chances of lossing data on power fail.
I have an UPS, but some times the kernel crashes. Yep, it does. I use hibernation to disk, two or three times a day. Eventually, it crashes, once or twice per month. I can not report it because there are no logs and no messages. Which means run fsck on all opened filesystems. -- Saludos/Cheers Carlos E.R. (testing 11-4 Celadon on Lilliput) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
Le 02/06/2011 21:27, Carlos E. R. a écrit :
Which means run fsck on all opened filesystems.
shouldn't. I usually see only a journal control jdd -- http://www.dodin.net http://www.youtube.com/user/jdddodinorg http://jdd.blip.tv/ -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
On Thu, Jun 2, 2011 at 4:00 PM, jdd <jdd@dodin.org> wrote:
Le 02/06/2011 21:27, Carlos E. R. a écrit :
Which means run fsck on all opened filesystems.
shouldn't. I usually see only a journal control
jdd
Remember meta-data journaling is fairly common. Data journaling much less so. Data journaling will be more robust, so if robustness is your issue, give it a shot. I don't know what filesystems offer data journaling, but ext3 definitely does. From the main page in the ext3 section: ============== data={journal|ordered|writeback} Specifies the journalling mode for file data. Metadata is always journaled. To use modes other than ordered on the root filesystem, pass the mode to the kernel as boot parameter, e.g. rootflags=data=journal. journal All data is committed into the journal prior to being written into the main filesystem. ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback Data ordering is not preserved - data may be written into the main filesystem after its metadata has been committed to the jour‐ nal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery. ================ writeback is the least robust. Data can be written in any order and conceivably sit in cache for extended periods. 5+ years ago, I think this was the normal behavior for most mainstream filesystems. ext3 now defaults to data=ordered (Remember the journals are flushed on every mount, so it is easy to switch from one mode to another.) I don't know if "data=journal" is any safer than "data=ordered" or not. Hope this helps Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 06/02/2011 04:22 PM, Greg Freemyer wrote:
On Thu, Jun 2, 2011 at 4:00 PM, jdd <jdd@dodin.org> wrote:
Le 02/06/2011 21:27, Carlos E. R. a écrit :
Which means run fsck on all opened filesystems.
shouldn't. I usually see only a journal control
jdd
Remember meta-data journaling is fairly common.
Data journaling much less so.
Data journaling will be more robust, so if robustness is your issue, give it a shot.
I don't know what filesystems offer data journaling, but ext3 definitely does. From the main page in the ext3 section:
============== data={journal|ordered|writeback} Specifies the journalling mode for file data. Metadata is always journaled. To use modes other than ordered on the root filesystem, pass the mode to the kernel as boot parameter, e.g. rootflags=data=journal.
journal All data is committed into the journal prior to being written into the main filesystem.
ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.
writeback Data ordering is not preserved - data may be written into the main filesystem after its metadata has been committed to the jour‐ nal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery. ================
writeback is the least robust. Data can be written in any order and conceivably sit in cache for extended periods. 5+ years ago, I think this was the normal behavior for most mainstream filesystems.
ext3 now defaults to data=ordered (Remember the journals are flushed on every mount, so it is easy to switch from one mode to another.)
I don't know if "data=journal" is any safer than "data=ordered" or not.
The choice between the two isn't one of robustness. It's a choice of workload. They'll both have your data on disk when fsync() returns and neither can make any guarantees about data being written before then. Within the confines of existing APIs, file systems can't make any promises WRT file contents beyond a chunk of data at a certain offset. It only understands its own metadata. Writes are still cached before being written to disk. In both cases, writes can be split into multiple transactions. The writes are split up into page-sized chunks (along with associated metadata, like bitmaps or indirect blocks), each of which may be in its own transaction. In neither case will a 32 MB write() be performed in an atomic chunk. Each mode will place the blocks on a list that will be flushed during commit. The mode determines where it will be flushed: the general file system or the journal. For robustness, use fsync(). That's what it's there for. The descriptions of each mode you've pasted give the "what it does" aspect of each mode, but not the effects. data=writeback means that the journal will not stall on large writes when the journal must be flushed to the general file system or an fsync() is called. This will perform the best for most write loads but can introduce corruption at the end of files if the system crashes if the file is extended (metadata) before the file data itself is written out (data). data=ordered means that data writes go directly to the file system and are guaranteed to hit disk before the transaction commits. This protects against old file data appearing in sections of a file that have grown but weren't written yet. It's a bit of a heavy hammer for that purpose since it writes all of the outstanding writes to the file before the transaction is committed, not just the ones that fall outside the boundaries. A side effect of this is that it can stall transaction commits when there are large writes queued up. There are fairly severe performance consequences when there is fsync activity on a file system with a lot of streaming writes. This is because the fsync can't be honored until the transaction is committed, and there may be other transactions queued to be committed before it. Even a small write can stall behind the ordered writeout of a large write list associated with another file. data=journal means that _every_ write to the file system must go through the journal. For streaming workloads, this will usually result in choppy, bursty performance as the journal overflows again and again and must be flushed to the filesystem, stalling progress as it does so. Administrators should be aware that any increase in journal size carries a corresponding increase in latency when the journal must be flushed. So you may get longer bursts but they'll be further apart. The flip side of this is that it also means that for fsync-heavy workloads on small files, like with a mail spool, the fsync() call can be honored just by committing the write to the journal. This limits seeking to within the journal area and allows the file system to write to the general file system at its leisure, queuing and sequencing writes to minimize seeking. Chris Mason, some time ago, started playing around with the idea of a data=guarded mode. This mode would only queue up writes that are outside the current boundaries of the file so that most of the latency associated with data=ordered would be eliminated. I didn't really follow what happened with this effort. If I had to guess I'd say that the overhead associated with making it work well would be too high to bother with, since it would require an extent mapping of the file waiting to be written. I'd also bet that an opportunistically created extent map wouldn't be complete enough to make it worthwhile. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org/ iEYEARECAAYFAk3oBRUACgkQLPWxlyuTD7Kp8ACdHIEaeBofo0u3X8w80jlpgo3f HlcAn1PvMdocQT3iVG0IxjZRZwLcXeEf =E87h -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
El 02/06/11 17:48, Jeff Mahoney escribió:
For robustness, use fsync(). That's what it's there for.
Unfortunately not even critical tools in the base system do that :-| -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday, 2011-06-02 at 18:32 -0400, Cristian Rodríguez wrote:
For robustness, use fsync(). That's what it's there for.
Unfortunately not even critical tools in the base system do that :-|
Define "critical". It impairs speed a lot. - -- Cheers, Carlos E. R. (from 11.4 x86_64 "Celadon" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.16 (GNU/Linux) iEYEARECAAYFAk3q59YACgkQtTMYHG2NR9UN2gCcDzjB4N0wvs7Q3hm1nln4QvU4 LZ8AoI2nDe/e+fHksJi6mpMjCdU0Dppg =Z0EF -----END PGP SIGNATURE-----
On Thu, Jun 2, 2011 at 5:48 PM, Jeff Mahoney <jeffm@suse.com> wrote: <snip>
data=writeback means that the journal will not stall on large writes when the journal must be flushed to the general file system or an fsync() is called. This will perform the best for most write loads but can introduce corruption at the end of files if the system crashes if the file is extended (metadata) before the file data itself is written out (data).
I don't recall it being an issue with ext3, but XFS in the 2003 timeframe had a major issue if a system crashed while a config file update was pending. Userspace (KDE in particular back then) would do something like open() truncate_config_file() write_new_data() // fail to call fsync() close() Then xfs would journal the truncate command, and the new post write file size, but the data itself would sit in cache (apparently for an extended time). Then when the system crash occurred, you would get the infamous file full of nulls. They resolved that years ago, but ever since then, I've been wary of data=writeback functionality. I want the data on disk prior to all the metadata starts flying around. Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
On 06/03/2011 12:34 AM, Greg Freemyer wrote:
On Thu, Jun 2, 2011 at 5:48 PM, Jeff Mahoney<jeffm@suse.com> wrote: <snip>
They resolved that years ago, but ever since then, I've been wary of data=writeback functionality. I want the data on disk prior to all the metadata starts flying around.
You need battery backed cache >:-) -- Saludos/Cheers Carlos E.R. (testing 11-4 Celadon on Lilliput) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
On Thu, Jun 2, 2011 at 6:41 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:
On 06/03/2011 12:34 AM, Greg Freemyer wrote:
On Thu, Jun 2, 2011 at 5:48 PM, Jeff Mahoney<jeffm@suse.com> wrote: <snip>
They resolved that years ago, but ever since then, I've been wary of data=writeback functionality. I want the data on disk prior to all the metadata starts flying around.
You need battery backed cache >:-)
One of my clients just had to replace their cache battery. No longer in production, so it cost them $1K. I don't know where they found it! But the old battery was on a Y cable so they could plug in the replacement battery before they disconnected the old battery. Now that's what I call paranoia! Greg -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
El 02/06/11 18:34, Greg Freemyer escribió:
On Thu, Jun 2, 2011 at 5:48 PM, Jeff Mahoney <jeffm@suse.com> wrote: <snip>
data=writeback means that the journal will not stall on large writes when the journal must be flushed to the general file system or an fsync() is called. This will perform the best for most write loads but can introduce corruption at the end of files if the system crashes if the file is extended (metadata) before the file data itself is written out (data).
I don't recall it being an issue with ext3, but XFS in the 2003 timeframe had a major issue if a system crashed while a config file update was pending.
Userspace (KDE in particular back then) would do something like
open() truncate_config_file() write_new_data() // fail to call fsync() close()
This kind of bugs still exist in abudance, mount with noauto_da_alloc to disable workarounds. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-factory+help@opensuse.org
participants (10)
-
Bruno Friedmann
-
Carlos E. R.
-
Cristian Rodríguez
-
doiggl@velocitynet.com.au
-
Eberhard Moenkeberg
-
Greg Freemyer
-
Hubert Mantel
-
jdd
-
Jeff Mahoney
-
Stefan Seyfried