[opensuse] Revisit old topic: Re: Question re filesystem reserved block percent

newer
[opensuse] Root and boot...

older
[opensuse] replacement for mgdiff...

Linda Walsh

10 Mar 2013 10 Mar '13

18:41

Lew Wolfgang wrote:

...

On 02/10/2013 12:43 PM, Dennis Gallien wrote:

...
Advice, please . . .

The default filesystem reserved blocks is 5%. IIRC that goes back a long time to when much smaller drives were in use. Is there a formula or rule-of-thumb now for today's large drives/partitions? I have quite a few 100-300GB partitions which I have tuned down to 3%, but it still seems I'm wasting a lot of space. Suggestions?

...

Hi Dennis,

As I recall, the reserved blocks were used to improve read/write performance because it makes it easier to find contiguous blocks. Anton Aylward wrote: Certainly not ones that use B-Tree algorithms!

----- Sorry to bring this up so long after the fact, but I wanted to add some *counter* information to the idea of *decreasing* a reserve factor. Instead of decreasing it, there are times when you might want to increase it -- it depends on *your priorities*. If you need space and don't care so much about speed, shrinking is likely the best way to go, but if you want to keep your disks running in their maximum speed range, I've noticed with 1-1.5TB partitions on a 24gB (12x2tB) RAID that free space should be kept around 20-25% and that's on a B-Tree based file system (XFS). That's because my max rate in read/write is 1TB/s, so even the smallest delays will create a much more noticeable hit in performance than they would out of 1 disk -- i.e. the computer can do many computations while the I/O controller is doing a 64K read/write on 1 disk, but the time it has to do calculations on a 12 spindle raid is a decimal order of magnitude less. So the cpu needs to search through 12X the data in 1/10th the time to maintain max I/O rates. The block allocator starts having problems finding large contiguous free spaces on file systems over 75% full, if (like me) you've gone up into the low 90's% usage, and back down. Once you've done that, you've got files spread throughout the free area. (I could rebuild the fs from scratch and that would give some more speed, until I let it get too full again). If your HD was the speed of a floppy, using it up 99%, you wouldn't notice a speed hit because the I/O was so slow, but the faster your HD subsystem, the more you'll notice the lag time caused by the block allocator and the buffering system. Many people with RAIDs have noticed a 30% or greater speed boost by avoiding the buffer cache. As long as CPU and memory speeds were staying 100-1000 times faster than disk, alot can be done to optimize I/O in memory, but with cpu speeds going down (to conserve power) and disk speeds going up with moves to SSD's and RAID's, those same algorithms won't work as well at the limits... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Anton Aylward

10 Mar 10 Mar

22:18

Linda Walsh said the following on 03/10/2013 02:41 PM:

...

----- Sorry to bring this up so long after the fact, but I wanted to add some *counter* information to the idea of *decreasing* a reserve factor.

Instead of decreasing it, there are times when you might want to increase it -- it depends on *your priorities*.

If you need space and don't care so much about speed, shrinking is likely the best way to go, but if you want to keep your disks running in their maximum speed range, I've noticed with 1-1.5TB partitions on a 24gB (12x2tB) RAID that free space should be kept around 20-25% and that's on a B-Tree based file system (XFS).

Context please! I can think of specifics: * a file system devotes to things that don't change (except for updates) such a the hierarchies /usr/lib, /usr/share/man, /usr/share/icons and many others, so there is no churn and in the limiting case these might be treated as 'read only' file systems. Certainly in a shared environment such as thin workstations that PXE boot[1] or have only a minimal local file system (such as /etc) and NFS mount everything else, administration would have it that everything except the home directory/roving-share * a file system that contains structured data, such as database files, which are very large and whose internals change but whose size does not, so the issue of free space and allocation and the 'churn' of the free list does not matter. (Some FS based databases on RAID systems used by OLTP applications can be affected by the 'small write' problem.) * a file system that has a very high 'churn' such as one supporting a development project - editing of source, rcs/subversion, compiling and testing. (Which may also run into the 'small write' problem with some RAID set-ups.) The rate of churn is an important consideration. The old V7 file system, may it rest in peace, worked very well for the first case above; the file system is laid down allowing for rotational delays and simple reads could be quite fast, given the technology of the day; but the churn of a multi-user environment so randomized it that performance degraded very rapidly. In that world the second case was usually dealt with by putting the database on a raw partition rather than a file system. In a high churn situation such as the 3rd case or users home directories, files being created, edit and deleted at a fair clip, then the space allocation is going to try to 'optimise' for something. Some cases might want a complete rewrite of the files (think: using VI) so having ample free space so the file partitions can be 'grouped' is going to improve the performance of not only allocator[2] but also of the file access. Of course that makes little sense with small files such as the config files that litter /etc - under 4K or whatever your allocator blocks size is set to. Or, right, some B-tree FSs can pack small files into 'tails' of other files. Now, given all that, there is a big rider. Some disk managers such as LVM mean that while a file system is optimized, the blocks that make up the partition (Logical Volume in LVM parlance) may be subject to a scatter-gather. HO HO HO!

...

That's because my max rate in read/write is 1TB/s, so even the smallest delays will create a much more noticeable hit in performance than they would out of 1 disk -- i.e. the computer can do many computations while the I/O controller is doing a 64K read/write on 1 disk, but the time it has to do calculations on a 12 spindle raid is a decimal order of magnitude less. So the cpu needs to search through 12X the data in 1/10th the time to maintain max I/O rates. The block allocator starts having problems finding large contiguous free spaces on file systems over 75% full, if (like me) you've gone up into the low 90's% usage, and back down. Once you've done that, you've got files spread throughout the free area. (I could rebuild the fs from scratch and that would give some more speed, until I let it get too full again).

See footnote #2. There are a few assumptions in all that. I once had a machine that 5-port memory. Yes, I could get 4 DMA channels working in parallel for a RAID-line transfer, but that was because the memory could be addressed at the 512 bytes level - the chips were that small. With a modern machine the issue of memory bandwidth is very different. Later I worked on PDP-1s and VAXen where the disk controller could support 8 drives and do seeks on all of them simultaneously but since there was only one channel to memory that was the limiting factor. Apart from that, your reasoning is correct, so long as we are considering the 'high churn rate' type of file systems.

...

If your HD was the speed of a floppy, using it up 99%, you wouldn't notice a speed hit because the I/O was so slow, but the faster your HD subsystem, the more you'll notice the lag time caused by the block allocator and the buffering system. Many people with RAIDs have noticed a 30% or greater speed boost by avoiding the buffer cache. As long as CPU and memory speeds were staying 100-1000 times faster than disk, alot can be done to optimize I/O in memory, but with cpu speeds going down (to conserve power) and disk speeds going up with moves to SSD's and RAID's, those same algorithms won't work as well at the limits...

Again, there are a lot of unstated assumptions. There are various types of caching to do with a file system. Oh, right, you can say "its all data" but that's unhelpful. Over the years we've seen how helpful DIFFERENT types of caching can be. Caching inodes separate from file content data was one such step. Certainly if you are opening and closing the same files or a group of files in the same directory or sub-tree, that helps enormously. Remember: *nix makes a clear distinction between the contents of a file and the information about a file (the inode). Unlike some (e.g.) VAX and IBM file systems, the file does not include any control information, such as its length or an end-of-file (EOF) delimiter. SUN, it was, IIR correctly, who introduced path name caching as well. If you are developing and doing everything "." relative then you are only dealing with one name segment, and yes caching that inode is nice, but in reality there is a lot of relative and absolute name resolution going on. Caching something like "/etc" and its node, or the KDE or any other DM's icons under /usr/share. Yes "a lot can be done to optimise I/O in memory" and many of the algorithms are independent of the file system. Whether things like inode caching and name segment caching are used is one. See http://makarevitch.org/rant/bufchint.html, and note the part on how an application can advise the kernel about its expected FS use. I note, also, that you've omitted a lot of IO tuning, such as the disk elevator algorithm, queue depth and other matters. See http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAI... Other articles point out that file systems should know about the striping of the underlying storage in order to deliver the best performance. Lets face it: one size, one solution, be it the choice of file system, the choice of storage medium, the use or not of RAID, the number of cores in your CPU, the amount of memory and 'free memory' (which gets devoted to caches and buffers rather than processing) and the degree of parallelism in your programming or multiprogramming, all affect performance. But the number one issue is the nature of the application. I keep saying Context is Everything and there is no 'one size fits all' solution. Oh, and "time changes everything". Who knows what BtrFS will be delivering a year from now ... [1] For example http://cloudboot.org/ [2] If you want to see this in action, try using the Windows file system compressor on a DOS (not vfat) FS with less than 15% free space! -- Sometimes I think your train of though is carrying a shipment of toxic waste. -- Ozy & Millie, Monday, August 28, 2000 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

11 Mar 11 Mar

05:15

New subject: [opensuse] Re: Revisit old topic: Re: Question re filesystem reserved block percent

Anton Aylward wrote:

...

Linda Walsh said the following on 03/10/2013 02:41 PM:

...
----- Sorry to bring this up so long after the fact, but I wanted to add some *counter* information to the idea of *decreasing* a reserve factor.

Instead of decreasing it, there are times when you might want to increase it -- it depends on *your priorities*.

If you need space and don't care so much about speed, shrinking is likely the best way to go, but if you want to keep your disks running in their maximum speed range, I've noticed with 1-1.5TB partitions on a 24gB (12x2tB) RAID that free space should be kept around 20-25% and that's on a B-Tree based file system (XFS).

Context please!

You were part of the original discussion -- I remember your name. In that discussion, the general consensus was decreasing it to 3% free space on larger disks was probably fine. I was giving one specific (i.e. context provided) example, yet you go off like I was saying "always do XXYZ".... ?!?!!? did you not read the whole note before thinking of your counterpoints?

...

I can think of specifics:

* a file system devotes to things that don't change (except for updates) such a the hierarchies /usr/lib, /usr/share/man, /usr/share/icons and many others, so there is no churn and in the limiting case these might be treated as 'read only' file systems. Certainly in a shared environment such as thin workstations that PXE boot[1] or have only a minimal local file system (such as /etc) and NFS mount everything else, administration would have it that everything except the home directory/roving-share

---- How is that a file system that has been up to 90% full and back down again?

...

* a file system that contains structured data, such as database files, which are very large and whose internals change but whose size does not, so the issue of free space and allocation and the 'churn' of the free list does not matter. (Some FS based databases on RAID systems used by OLTP applications can be affected by the 'small write' problem.)

---- Not my scenario.

...

* a file system that has a very high 'churn' such as one supporting a development project - editing of source, rcs/subversion, compiling and testing. (Which may also run into the 'small write' problem with some RAID set-ups.)

---- Much closer.

...

In a high churn situation such as the 3rd case or users home directories, files being created, edit and deleted at a fair clip, then the space allocation is going to try to 'optimise' for something. Some cases might want a complete rewrite of the files (think: using VI) so having ample free space so the file partitions can be 'grouped' is going to improve the performance of not only allocator[2] but also of the file access.

Right, and it sounded like a home user was asking this question. earlier, no? That's not the advice he went off with.

...

Of course that makes little sense with small files such as the config files that litter /etc - under 4K or whatever your allocator blocks size is set to. Or, right, some B-tree FSs can pack small files into 'tails' of other files.

AFAIK, only ReiserFS has that, and development on that FS doesn't seem to be progressing. EXT4 just added packing content into inodes as XFS has had for 20 years...(first white paper I read about it's design in replacing the EFS file system talked about the decisions that were made -- and the features included and was dated in 1993.

...

Now, given all that, there is a big rider. Some disk managers such as LVM mean that while a file system is optimized, the blocks that make up the partition (Logical Volume in LVM parlance) may be subject to a scatter-gather. HO HO HO!

---- ?!? I'm pretty sure you are being vague and confusing. If you have your disks allocated to be contiguous and aren't trying to use any 'thin provisioning', which AFAIK, Suse doesn't support anyway, the File system itself isn't moved around on disk. But how does that tie into file system optimization -- and what do you mean by that? I don't know of many file systems that support free space defragmenting. Until the past few year XFS was the only one that supported any type of defragmenting util at all (other than windows)... but it doesn't handle defragging free space.

...

...
That's because my max rate in read/write is 1TB/s, so even the smallest delays will create a much more noticeable hit in performance than they would out of 1 disk -- i.e. the computer can do many computations while the I/O controller is doing a 64K read/write on 1 disk, but the time it has to do calculations on a 12 spindle raid is a decimal order of magnitude less. So the cpu needs to search through 12X the data in 1/10th the time to maintain max I/O rates. The block allocator starts having problems finding large contiguous free spaces on file systems over 75% full, if (like me) you've gone up into the low 90's% usage, and back down. Once you've done that, you've got files spread throughout the free area. (I could rebuild the fs from scratch and that would give some more speed, until I let it get too full again).

...

Apart from that, your reasoning is correct, so long as we are considering the 'high churn rate' type of file systems.

...
If your HD was the speed of a floppy, using it up 99%, you wouldn't notice a speed hit because the I/O was so slow, but the faster your HD subsystem, the more you'll notice the lag time caused by the block allocator and the buffering system. Many people with RAIDs have noticed a 30% or greater speed boost by avoiding the buffer cache. As long as CPU and memory speeds were staying 100-1000 times faster than disk, alot can be done to optimize I/O in memory, but with cpu speeds going down (to conserve power) and disk speeds going up with moves to SSD's and RAID's, those same algorithms won't work as well at the limits...

Again, there are a lot of unstated assumptions. There are various types of caching to do with a file system. Oh, right, you can say "its all data" but that's unhelpful.

---- There are no unstated assumptions. Those are statistical trends. you bring the number of cpu's up to 256 and the GZ down to 1 as on some new intel machines, then toss in a stack of RAID10 w/30 spindles, you are easily going to be pushing the limits of the machine's allocators... With single SSD's *boasting* 100's of MB/s, how do they hold up in RAIDS -- some of the newer Enterprise SSD's maintain rated speeds without trim (though the rated speeds are not the highest categories).

...

Yes "a lot can be done to optimise I/O in memory" and many of the algorithms are independent of the file system. Whether things like inode caching and name segment caching are used is one. See http://makarevitch.org/rant/bufchint.html, and note the part on how an application can advise the kernel about its expected FS use.

The day you can get all the apps to cooperate and tell the OS about their projected I/O usage is the day I'll never need to work again (or hell freezes over, something like that...;-))... Appwriters are hard pressed to know themselves how their app will behave in the field under customer load -- let alone, predict it well enough to tell the OS about it...

...

I note, also, that you've omitted a lot of IO tuning, such as the disk elevator algorithm, queue depth and other matters.

Um... The point was how much free space to leave... not a full dissertation on all factors affecting disk speed and optimization. The [home] user was given information that you now say is wrong; i.e. they had their free space down to 3%, and you said they had plenty of space left... -------- Original Message -------- Subject: Question re filesystem reserved block percent Date: Sun, 10 Feb 2013 17:02:44 -0500 From: Anton Aylward <opensuse@antonaylward.com> Dennis Gallien said the following on 02/10/2013 03:43 PM:

...

...
Advice, please . . .

The default filesystem reserved blocks is 5%. IIRC that goes back a long time to when much smaller drives were in use. Is there a formula or rule-of-thumb now for today's large drives/partitions? I have quite a few 100-300GB partitions which I have tuned down to 3%, but it still seems I'm wasting a lot of space. Suggestions?

Run 'df' on the partition that/those filesystem(s) is/are on. Unless you're really s[h]ort of space then you're not wasting space, you've still got plenty to spare. ----------------------------------------

...

I keep saying Context is Everything and there is no 'one size fits all' solution.

Oh, and "time changes everything". Who knows what BtrFS will be delivering a year from now ...

Yeah... who knows? The claims sound a bit too good to be true... and usually when they sound that way, they are, but occasionally there are diamonds in the rough... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

13:49

New subject: [opensuse] Re: Revisit old topic: Re: Question re filesystem reserved block percent

Linda Walsh said the following on 03/11/2013 01:15 AM:

...

[..]

...
Context please!

---- You were part of the original discussion -- I remember your name. In that discussion, the general consensus was decreasing it to 3% free space on larger disks was probably fine.

Linda: Since you sent that to me personally I've replied personally, since I doubt our argument is of interest to others, but PLEASE do not NOT send to BOTH the list and me. Its redundant. I'm sure there's something in the etiquette guide about it. -- I do not fear computers. I fear the lack of them. -- Isaac Asimov -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

4313

Age (days ago)

4314

Last active (days ago)

List overview

Download

3 comments

2 participants

participants (2)

Anton Aylward
Linda Walsh