Re: [opensuse] 12.3 + ntfs + ext4 + USB3 + different copying speeds

29 Mar 2013

      Basil,

This is a pretty technical response and doesn't really address your needs, so feel free to ignore it.  (Per has to read it!)

Per Jessen <per@computer.org> wrote:
...
Greg Freemyer wrote:
...
To repeat:
with dd -> ext4  75 MB/sec data transfer and 160% cpu load  (uses
multiple cores.  No single core overloaded.)
with dd -> ntfs   25-30 MB/sec data transfer and only 120% cpu load
(one core running mount.ntfs at 75% load)
I actually find that kind of strange, but it was repeatable.
I think it is pretty clear that mount.ntfs is the bottleneck.
I guess mount.ntfs is the FUSE binary?
I don't think fuse has a binary.  It is in the kernel and provides a kernel api.  There is a fuse library that binaries can use.  

But as you know when you issue a write(2) in a normal user application, that has to be mapped to specific disk sectors and the underlying sector write performed.  And of course if the file grew because the write(2) extended the file, then various metadata has to be updated.

With ext4 all of that logic is handled by the ext4 kernel driver/module.

With fuse, the logic is pushed to userspace.  I believe that mount.ntfs is the program that implements the logic required to read/write a ntfs filesystem.  It uses the fuse api, but I would not call it the fuse binary.

Fyi: I have at least one other tool that uses fuse.  Ewfmount.  It is a far simpler tool than ntfs, so if you want to see some source that uses the fuse api that might be a good choice to look at.  It is in the ewftools subpackage of libewf and is in OBS.  (Or I assume the srpm is on the dvd?  I do all my source level work in conjunction with OBS, so I'm not sure where the srpms are.)
...
...
The open (and unimportant) question if it is just poorly written or
if
the fact it is based on fuse makes it inefficient no matter what.
The FUSE cuases two context switches per operation, it is bound to have
an impact.  A factor 2.5-3 seems a lot, though.
The logic to implement a filesystem is very complex even for something like ext2.  Ntfs is more complex than ext4.  Probably closer in complexity to btrfs.  Ie. It is extent based, has a journal, and at least in windows supports snapshots/shadow copies of files within the filesystem.

So in my test I used dd with 1 MB writes.  That data has to be copied to the kernel when write(2) is called.  I assume with mount.ntfs that the kernel in turn has to copy that data back out to userspace.  Once there, mount.ntfs has to organize it the way it will exist on the disk.  When this is done in the kernel, it is all handled by maintaining a page buffer and all the shuffleing is done by working with pointers.

That really only works because the disk controllers have dma circuitry that accepts arrays of pointers that tell it where to pull the data from.  They call that scatter/gather dma.  Note that by design, disk controllers only read/write data in kernel buffers, they never directly access user space buffers.

So mount.ntfs doesn't have a dma circuit it can call on, instead once it has decided how to manipulate that 1MB of data dd sent it, it has to call back into the kernel for it to copy that data back down into the kernel for it to interface with the disk controller and initiate the actual dma activity that moves the data out to the disk controller.

While all that is happening the disk is spinning.  Let's say 6000 rpm for ease of math.  At that speed it takes 10 msecs for one rotation of the drive.  That is a really long time in the world of cpus.  So if while all the above logic is going on, an extra disk rotation slips in every once in a while, it causes big delays.

In the 70's it was not realistic to write data to consecutive sectors as the disk spun because cpus were just to slow.  The solution was interleaving multiple data streams.  You may recall we used to format drives with a interleave factor of 2 or even 3.

With a interleave of 3, when the cpu wanted to write consecutive data out to the drive it would write to sector 1, then 4, then 7, etc.  The drive is spinning as it writes, so if you had 250 sectors per cylinder, then that 6000 rpm drive sees a new sector pass under the read head every 40 microseconds.  With a interleave of 3, that meant the cpu had 120 microseconds to get the next sector of data out to the drive controller.

For the last 20 years or so, cpus have been fast enough to do the job in the original 40 microseconds, so sectors are no longer interleaved.  Also sectors have gotten physically smaller and smaller, so if there are now 2500 sectors per track, that is a sector every 4 microseconds, but rotation speeds haven't changed so it still takes 10 milliseconds per rotation.  So with modern drives, if you miss that 4 microsecond deadline, a huge 10 millisecond penalty is handed to you by the disk (since it has to go all the way around again to let you try again). 

Back to mount.ntfs.  When you introduce the delays associated with all that extra data movement caused by fuse, i'm sure there are times that the 4 microsecond deadline is missed.  Every time that happens, a 10 millisecond delay of game penalty is thrown.

As you can see its a complex dance, and the surprise to me is not that there is a 3x slow down, but that it is not way worse than that.

Greg

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org