[opensuse] History repeats itself: disk cache

Roger Oberholtzer

21 May 2018 21 May '18

10:57

A couple of years ago I had a problem where a program was streaming data to a file. Unfortunately, the OS cached the contents in RAM and only dealt with it occasionally. The problem was that this would cause periodic delays of a long time while the cache was dealt with. Write calls would have to wait until this task was completed. See the following for the original discussion: https://lists.opensuse.org/opensuse/2013-06/msg00069.html The solution that worked was to have something like the following running: while [ 1 ]; do sync; echo 1 > /proc/sys/vm/drop_caches; sleep 60; done & This really did solve the problem. We have now updated the OS for this system to Leap 42.3. It does not seem that this is having the same effect. One difference I see is that the sync command seems to take a very long time. Like over a minute each time. The fwrite() calls change from taking 1 msec to write a buffer (which is actually cached in ram by the kernel fs driver until later), to 20 secs or a minute. While we do buffer data, these long write delays are a problem. I have looked to see if there is anything different about /proc/sys/vm/drop_caches since we had originally used it. I think it was openSUSE 13.1 where we last saw that it seemed to be working as expected. Anyone have a clue about this? Maybe there is a better way to do this now? We are using ext4. IIRC, we tried other file systems at the time, but this was a general feature of all file systems. -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Roger Oberholtzer

21 May 21 May

11:14

New subject: [opensuse] Re: History repeats itself: disk cache

On Mon, May 21, 2018 at 12:57 PM, Roger Oberholtzer <roger.oberholtzer@gmail.com> wrote:

...

Anyone have a clue about this? Maybe there is a better way to do this now? We are using ext4. IIRC, we tried other file systems at the time, but this was a general feature of all file systems.

My mistake. While the OS is ext4, the disks with the stream files (there are two) are xfs. Both of these are as it was in 13.1 when it last functioned. -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Dave Howorth

11:39

New subject: [opensuse] Re: History repeats itself: disk cache

On Mon, 21 May 2018 13:14:53 +0200 Roger Oberholtzer <roger.oberholtzer@gmail.com> wrote:

...

On Mon, May 21, 2018 at 12:57 PM, Roger Oberholtzer <roger.oberholtzer@gmail.com> wrote:

...
Anyone have a clue about this? Maybe there is a better way to do this now? We are using ext4. IIRC, we tried other file systems at the time, but this was a general feature of all file systems.

My mistake. While the OS is ext4, the disks with the stream files (there are two) are xfs. Both of these are as it was in 13.1 when it last functioned.

Have you also upgraded the filesystems where the data is stored? i.e. are you still using the same old on-disk format or are you running the latest xfs software and the latest xfs on-disk format? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

11:51

New subject: [opensuse] Re: History repeats itself: disk cache

On 2018-05-21 13:14, Roger Oberholtzer wrote:

...

On Mon, May 21, 2018 at 12:57 PM, Roger Oberholtzer <roger.oberholtzer@gmail.com> wrote:

...

A couple of years ago I had a problem where a program was streaming data to a file. Unfortunately, the OS cached the contents in RAM and only dealt with it occasionally. The problem was that this would cause periodic delays of a long time while the cache was dealt with. Write calls would have to wait until this task was completed. See the following for the original discussion:

I remember.

...

...
Anyone have a clue about this? Maybe there is a better way to do this now? We are using ext4. IIRC, we tried other file systems at the time, but this was a general feature of all file systems.

My mistake. While the OS is ext4, the disks with the stream files (there are two) are xfs. Both of these are as it was in 13.1 when it last functioned.

As Dave mentions, maybe you should also format those disks to have the latest updates to the filesystem available. Or, perhaps better, create a test system with leap 15.0 and newly formatted xfs disks. Then I would ask on the xfs thread. I don't remember if you did so when the original thread arose. -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

Roger Oberholtzer

12:40

New subject: [opensuse] Re: History repeats itself: disk cache

On Mon, May 21, 2018 at 1:51 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

On 2018-05-21 13:14, Roger Oberholtzer wrote:

...
On Mon, May 21, 2018 at 12:57 PM, Roger Oberholtzer <roger.oberholtzer@gmail.com> wrote:

...
A couple of years ago I had a problem where a program was streaming data to a file. Unfortunately, the OS cached the contents in RAM and only dealt with it occasionally. The problem was that this would cause periodic delays of a long time while the cache was dealt with. Write calls would have to wait until this task was completed. See the following for the original discussion:

I remember.

...
...
Anyone have a clue about this? Maybe there is a better way to do this now? We are using ext4. IIRC, we tried other file systems at the time, but this was a general feature of all file systems.

My mistake. While the OS is ext4, the disks with the stream files (there are two) are xfs. Both of these are as it was in 13.1 when it last functioned.

As Dave mentions, maybe you should also format those disks to have the latest updates to the filesystem available. Or, perhaps better, create a test system with leap 15.0 and newly formatted xfs disks.

We use Leap 42.3 in production. So we want to stay with that here. We can try re-formatting the disks. I suspect it makes no difference as it is not a file system type thing. Meaning that ext4, xfs, etc do not implement this. It is the general file system stuff in the kernel that seems to manage this.

...

Then I would ask on the xfs thread. I don't remember if you did so when the original thread arose.

I asked when this first happened. We discovered that the same thing happened for all file system types. The memory cache is a general file system feature that, it seems, all file systems share. -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

12:58

New subject: [opensuse] Re: History repeats itself: disk cache

On 2018-05-21 14:40, Roger Oberholtzer wrote:

...

On Mon, May 21, 2018 at 1:51 PM, Carlos E. R. <> wrote:

...
On 2018-05-21 13:14, Roger Oberholtzer wrote:

...
On Mon, May 21, 2018 at 12:57 PM, Roger Oberholtzer <> wrote:

...

...
As Dave mentions, maybe you should also format those disks to have the latest updates to the filesystem available. Or, perhaps better, create a test system with leap 15.0 and newly formatted xfs disks.

We use Leap 42.3 in production. So we want to stay with that here.

Yes, that a good reason. But Leap 15.0 has a newer kernel, might behave differently for this task. I would test it. And 42.3 will be EOL in few months.

...

We can try re-formatting the disks. I suspect it makes no difference as it is not a file system type thing. Meaning that ext4, xfs, etc do not implement this. It is the general file system stuff in the kernel that seems to manage this.

...
Then I would ask on the xfs thread. I don't remember if you did so when the original thread arose.

I asked when this first happened. We discovered that the same thing happened for all file system types. The memory cache is a general file system feature that, it seems, all file systems share.

Ah, then if they said that, there is little hope that a difference in the disk format would help. I wonder is there is a mount option that would help? A commit timeout? Nevertheless, the XFS people have been very active and have done changes to the filesystem. Now that I think, I do not know if it is possible to update the filesystem "online". -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

Roger Oberholtzer

13:10

New subject: [opensuse] Re: History repeats itself: disk cache

On Mon, May 21, 2018 at 2:58 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

And 42.3 will be EOL in few months.

IIRC, Jan 2019. We are always a bit behind in the one we use for production. No matter how I try to be more current, it never seems to happen. We do our system installations from an OEM image that we make with KIWI so all systems are identical. Even though this makes the OS installation quick and easy, the general system configuration always is a bit more involved. And things like what I am describing here always results in lots of additional testing / troubleshooting. So I can get people to do it at most every 3rd year. We usually start a year in advance of when we need to use the system in production. We are very conservative this way :) -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

13:55

New subject: [opensuse] Re: History repeats itself: disk cache

I see the following write times when making a file: # time dd if=/dev/zero of=file.txt count=2096576 bs=4096 2096576+0 records in 2096576+0 records out 8587575296 bytes (8.6 GB, 8.0 GiB) copied, 42.8592 s, 200 MB/s real 0m45.338s user 0m0.526s sys 0m5.100s # time dd if=/dev/zero of=file.txt count=1096576 bs=4096 1096576+0 records in 1096576+0 records out 4491575296 bytes (4.5 GB, 4.2 GiB) copied, 2.69905 s, 1.7 GB/s real 0m2.708s user 0m0.333s sys 0m2.366s The 8 GB is perhaps closer to what we are doing. We make files so that each is not over 2 GB. But the writing is sustained. Perhaps this is not a bad write speed. It is faster than the data rate. So we should be keeping up. It is interesting that the disk sync time is not in either user or sys time for the 8 GB file. But as you can see from the real time, we still have to wait for it to happen. But something just ain't right. Time to review our application to see if something else has happened. We used to use an Intel library to compress JPEG buffers (memory -> memory). We now use Turbo JPEG to do the same. That seems to be as fast. But it is the only thing we have changed in the code. (Famous last words - but I did verify this in our revision control system.) -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Peter Suetterlin

16:02

New subject: [opensuse] Re: History repeats itself: disk cache

Roger Oberholtzer wrote:

...

But something just ain't right. Time to review our application to see if something else has happened. We used to use an Intel library to compress JPEG buffers (memory -> memory). We now use Turbo JPEG to do the same. That seems to be as fast. But it is the only thing we have changed in the code. (Famous last words - but I did verify this in our revision control system.)

I didn't really follow closely, but did the old system use kernel page table isolation already? Do things change if you boot with nopti? If nonsense -> /dev/null -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

17:53

New subject: [opensuse] Re: History repeats itself: disk cache

21.05.2018 16:55, Roger Oberholtzer пишет:

...

I see the following write times when making a file:

# time dd if=/dev/zero of=file.txt count=2096576 bs=4096 2096576+0 records in 2096576+0 records out 8587575296 bytes (8.6 GB, 8.0 GiB) copied, 42.8592 s, 200 MB/s

real 0m45.338s user 0m0.526s sys 0m5.100s

# time dd if=/dev/zero of=file.txt count=1096576 bs=4096 1096576+0 records in 1096576+0 records out 4491575296 bytes (4.5 GB, 4.2 GiB) copied, 2.69905 s, 1.7 GB/s

real 0m2.708s user 0m0.333s sys 0m2.366s

What are values of /proc/sys/vm/dirty_ratio and dirty_bytes? And what is memory size?

...

The 8 GB is perhaps closer to what we are doing. We make files so that each is not over 2 GB. But the writing is sustained. Perhaps this is not a bad write speed. It is faster than the data rate. So we should be keeping up.

It is interesting that the disk sync time is not in either user or sys time for the 8 GB file. But as you can see from the real time, we still have to wait for it to happen.

I suspect this time is accounted as "IO wait" time mostly.

...

But something just ain't right. Time to review our application to see if something else has happened. We used to use an Intel library to compress JPEG buffers (memory -> memory). We now use Turbo JPEG to do the same. That seems to be as fast. But it is the only thing we have changed in the code. (Famous last words - but I did verify this in our revision control system.)

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

17:58

New subject: [opensuse] Re: History repeats itself: disk cache

On 2018-05-21 15:55, Roger Oberholtzer wrote:

...

I see the following write times when making a file:

# time dd if=/dev/zero of=file.txt count=2096576 bs=4096 2096576+0 records in 2096576+0 records out 8587575296 bytes (8.6 GB, 8.0 GiB) copied, 42.8592 s, 200 MB/s

real 0m45.338s user 0m0.526s sys 0m5.100s

Try playing with options such as "oflag=direct", or "nocache" or "dsync". Dsync just makes sure the data is written to disk asap, but it leaves a copy in cache. Nocache first uses the kernel disk cache mechanism, then flushes it. Direct eliminates the use of the kernel cache altogether.

...

But something just ain't right. Time to review our application to see if something else has happened. We used to use an Intel library to compress JPEG buffers (memory -> memory). We now use Turbo JPEG to do the same. That seems to be as fast. But it is the only thing we have changed in the code. (Famous last words - but I did verify this in our revision control system.)

Maybe your code can call to flush each file when it writes it, similarly to how 'dd' does it. -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

Linda Walsh

20:27

New subject: [opensuse] Re: History repeats itself: disk cache

Carlos E. R. wrote:

...

Try playing with options such as "oflag=direct",

...

dd if=/dev/zero of=foo bs=4k count=1K oflag=direct 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0689802 s, 60.8 MB/s <<4k blocksize dd if=/dev/zero of=foo bs=4M count=1K oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.24457 s, 819 MB/s <<4M blocksize dd if=/dev/zero of=foo bs=8M count=512 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.04259 s, 852 MB/ <<8M

--- I'll second this part, but seriously, 4k at a time?? ---- Do you have to write such small amounts? dd if=/dev/zero of=foo bs=16M count=256 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.90653 s, 875 MB/ <<16M 16M is the sweet stop on my system. Yours may vary. As for your tests: # time dd if=/dev/zero of=file.txt count=2096576 bs=4096 8587575296 bytes (8.6 GB, 8.0 GiB) copied, 42.8592 s, 200 MB/s # time dd if=/dev/zero of=file.txt count=1096576 bs=4096 4491575296 bytes (4.5 GB, 4.2 GiB) copied, 2.69905 s, 1.7 GB/s --- those are no good -- you are writing to ram which eventually gets full and needs to flush to disk. It's the pause when flushing to disk that is killing you. use direct and it won't buffer into memory first. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

21:32

New subject: [opensuse] Re: History repeats itself: disk cache

On 2018-05-21 22:27, Linda Walsh wrote:

...

Carlos E. R. wrote:

...
Try playing with options such as "oflag=direct",

--- I'll second this part, but seriously, 4k at a time?? ---- Do you have to write such small amounts?

...
dd if=/dev/zero of=foo bs=4k count=1K oflag=direct 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0689802 s, 60.8 MB/s <<4k blocksize

...

...
dd if=/dev/zero of=foo bs=4M count=1K oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.24457 s, 819 MB/s <<4M blocksize

...

...
dd if=/dev/zero of=foo bs=8M count=512 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.04259 s, 852 MB/ <<8M dd if=/dev/zero of=foo bs=16M count=256 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.90653 s, 875 MB/ <<16M

16M is the sweet stop on my system. Yours may vary.

Well, with a small block and direct writing to disk, the kernel cache is disabled and speed suffers. Increasing the size of the write block acts like having a cache, but in the application instead than by the kernel.

...

As for your tests:

# time dd if=/dev/zero of=file.txt count=2096576 bs=4096 8587575296 bytes (8.6 GB, 8.0 GiB) copied, 42.8592 s, 200 MB/s # time dd if=/dev/zero of=file.txt count=1096576 bs=4096 4491575296 bytes (4.5 GB, 4.2 GiB) copied, 2.69905 s, 1.7 GB/s

--- those are no good -- you are writing to ram which eventually gets full and needs to flush to disk. It's the pause when flushing to disk that is killing you.

use direct and it won't buffer into memory first.

Not exactly, because direct disables the cache and you see how that impacts small file writing. Rather measure the speed when the cache flushes, like forcing the cache to empty at the end of the dd command. Or, instead of "direct", try "nocache", which uses the cache, then empties it, which thus forces writing to disk. Or, write a file 10 times bigger than the ram. There may be a 10% error in the measurement. The code of the application can call directly a flush of each file when they are closed. I do not know how to emulate the flags that dd can use: direct, dsync, nocache... If I were the developer of that code I would try to find out and experiment. Maybe a flush for every file impacts a lot. -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

Roger Oberholtzer

22 May 22 May

09:29

New subject: [opensuse] Re: History repeats itself: disk cache

Mystery solved (I think). It seems that our new JPEG compression to memory was reporting the original image size as being the size of the compressed buffer. In my case, it was a difference of 28 times. So, the program was writing, say, 2.8 MB instead of 100K for each image buffer. At our frame rate (50 fps), this will not work. I have not verified that the workaround I needed before (periodic syncs to ensure that when the system did this the delay was too big) is still starting in the rc init script that systemd runs. That's a different question... -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

11:09

New subject: [opensuse] Re: History repeats itself: disk cache

On 2018-05-22 11:29, Roger Oberholtzer wrote:

...

Mystery solved (I think). It seems that our new JPEG compression to memory was reporting the original image size as being the size of the compressed buffer. In my case, it was a difference of 28 times. So, the program was writing, say, 2.8 MB instead of 100K for each image buffer. At our frame rate (50 fps), this will not work.

Ah! :-)) -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

L A Walsh

23 May 23 May

00:35

New subject: [opensuse] Re: file and network IO vs/ buffer size --- don't use small buffs unless you want users to suffer.

Carlos E. R. wrote:

...

On 2018-05-21 22:27, Linda Walsh wrote:

...
Carlos E. R. wrote:

...
Try playing with options such as "oflag=direct",

I'll second this part, but seriously, 4k at a time?? ---- Do you have to write such small amounts?

...
dd if=/dev/zero of=foo bs=4k count=1K oflag=direct 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0689802 s, 60.8 MB/s <<4k blocksize

...
...
dd if=/dev/zero of=foo bs=4M count=1K oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.24457 s, 819 MB/s <<4M blocksize

...
...
dd if=/dev/zero of=foo bs=8M count=512 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.04259 s, 852 MB/ <<8M dd if=/dev/zero of=foo bs=16M count=256 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.90653 s, 875 MB/ <<16M

16M is the sweet stop on my system. Yours may vary.

Well, with a small block and direct writing to disk, the kernel cache is disabled and speed suffers. Increasing the size of the write block acts like having a cache, but in the application instead than by the kernel.

Not exactly. Increasing the write size decreases *overhead* just like sending packets through the network. If you send 1 packet of 1.5kB and wait for it to be transmitted & received by the other end, you will get very slow performance due to the overhead of sending each packet. Vs. if you have 1 write and only need an acknowledgment of the whole thing having been received, you only need to wait for 1 reply. Whether you are writing to disk or to a network, the overhead of handling each packet reduces throughput. It depends on how fast the user's application generates data. It generates video in real time and can't be paused. If it only needs 2.8MB/s, any of these methods would work, but if it needed 100 times that, then writing 4k blocks makes no sense and wouldn't work even with oflag=nocache. Nocache tells the OS that it can throw away the data -- it doesn't force it to be thrown away. In writing a 428GB file (then my disk filled), all of memory was filled long before it filled the disk and overall, only averaged 145MB/s. Compare that to 875MB/s when it used no OS caching. Using synchronous I/O that does force the memory to be released at each write reduced speed to 22MB/s (with 4k blocks). In all of the cases, there is no reason to use a 4K I/O size, which with a RAID may result in sub-optimal I/O. With RAID5 or RAID6 based RAID, the results could be abysmal. Even over a local network, a 4K I/O size can result in less than 10% optimal bandwidth usage. My network Samba IO test shows this -- this test only shows network speed, reading from /dev/zero locally and writing to /dev/null on the far end, so no file-i/o buffering is being used. with a blocksize of 4k:

...

bs=4096 bin/iotest Using bs=4.0K, count=524288, iosize=2.0G R:2147483648 bytes (2.0GB) copied, 93.7515 s, 21.8MB/s W:2147483648 bytes (2.0GB) copied, 92.1664 s, 22.2MB/s

(vs it's default 16M I/O size):

...

bin/iotest Using bs=16.0M, count=128, iosize=2.0G R:2147483648 bytes (2.0GB) copied, 3.23306 s, 633MB/s W:2147483648 bytes (2.0GB) copied, 7.37567 s, 278MB/s

The thing that hurts you the most with small block sizes is the per-block overhead.

...

...
As for your tests:

# time dd if=/dev/zero of=file.txt count=2096576 bs=4096 8587575296 bytes (8.6 GB, 8.0 GiB) copied, 42.8592 s, 200 MB/s # time dd if=/dev/zero of=file.txt count=1096576 bs=4096 4491575296 bytes (4.5 GB, 4.2 GiB) copied, 2.69905 s, 1.7 GB/s

--- those are no good -- you are writing to ram which eventually gets full and needs to flush to disk. It's the pause when flushing to disk that is killing you.

use direct and it won't buffer into memory first.

Not exactly,

--- Um...yes. EXACTLY. IF you use direct, it won't buffer it into file-buffer memory first.

...

because direct disables the cache and you see how that impacts small file writing

Ok...it impacts small file writes, but how does that support your disagreeing with the statement that direct I/O turns off buffering? Besides, if you are writing video data to disk as fast as possible, a say standard HD 1920x1080p 3840 x 2160 * 4bytes/dot - 8.3mega-pixels/frame * 60 frames, uncompressed, it would take 475MB/s. There is no way to do that with 4k writes and it wouldn't make any sense. Writing whatever size is optimal for your disks would make sense. On my 10+yr old setup, that's a 16M I/O size.

...

Or, instead of "direct", try "nocache", which uses the cache, then empties it, which thus forces writing to disk.

On a uniprocessor machine would like not make so much difference vs. using 'sync':

...

dd if=/dev/zero of=foo bs=4k count=1k oflag=sync 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.172128 s, 24.4 MB/s

But on a multi-cpu machine, nocache allows the cache to be released in background:

...

dd if=/dev/zero of=foo bs=4k count=1k oflag=nocache 1024+0 records in 1024+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0348474 s, 120 MB/s

...

Or, write a file 10 times bigger than the ram. There may be a 10% error in the measurement.

The code of the application can call directly a flush of each file when they are closed. I do not know how to emulate the flags that dd can use: direct, dsync, nocache... If I were the developer of that code I would try to find out and experiment. Maybe a flush for every file impacts a lot.

--- Depends on filesize. What's more important is keeping up with your data rate and having HW that can handle it.

...

From the above real-time, uncompresed 4K video would take about 1.85GB (1898MB) /. That would be a large RAID, maybe with SSD's.

Small file writes kill performance. Tbird and FF use 4K i/o on everything -- IMAP, xfer to sendmail, local I/O and they all get doggy performance on large files. (those are 32-bit versions, BTW -- dunno what 64-bit versions so). -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

13:43

New subject: [opensuse] Re: file and network IO vs/ buffer size --- don't use small buffs unless you want users to suffer.

Thank you for an excellent analysis and figures, Linda. On 22/05/18 08:35 PM, L A Walsh wrote:

...

The thing that hurts you the most with small block sizes is the per-block overhead.

We saw this with networks too. Back when networking was over PSTN and unreliable, just like IP over bingo drums or smoke signals (see the relevant RFCs) we had to use smaller packets so the cost of retry was low. Later we had Ethernet and more reliable and increased the size of the packets. The analysis of IP-over-avian-carriers (again see RFC) pointed out that even with high latency/transmission delays reliable, low distortion packets could be quite large. Of course that leads to Vint Cerf's work on Interplanetary Internet. https://www.wired.com/2013/05/vint-cerf-interplanetary-internet/ You might also find "The Practical Ramifications of Interstellar Packet Loss" by William Shunn an amusing read. That recent arrived asteroid form another solar system represents a slow but HUGE packet! -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

18:38

New subject: [opensuse] Re: file and network IO vs/ buffer size --- don't use small buffs unless you want users to suffer.

On 2018-05-23 02:35, L A Walsh wrote:

...

Carlos E. R. wrote:

...
On 2018-05-21 22:27, Linda Walsh wrote:

...
Carlos E. R. wrote:

...
Try playing with options such as "oflag=direct",

I'll second this part, but seriously, 4k at a time?? ---- Do you have to write such small amounts?

...
dd if=/dev/zero of=foo bs=4k count=1K oflag=direct 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.0689802 s, 60.8 MB/s <<4k blocksize

...
...
dd if=/dev/zero of=foo bs=4M count=1K oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.24457 s, 819 MB/s <<4M blocksize

...
...
dd if=/dev/zero of=foo bs=8M count=512 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.04259 s, 852 MB/ <<8M dd if=/dev/zero of=foo bs=16M count=256 oflag=direct 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.90653 s, 875 MB/ <<16M

16M is the sweet stop on my system. Yours may vary.

Well, with a small block and direct writing to disk, the kernel cache is disabled and speed suffers. Increasing the size of the write block acts like having a cache, but in the application instead than by the kernel.

Not exactly. Increasing the write size decreases *overhead* just like sending packets through the network. If you send 1 packet of 1.5kB and wait for it to be transmitted & received by the other end, you will get very slow performance due to the overhead of sending each packet. Vs. if you have 1 write and only need an acknowledgment of the whole thing having been received, you only need to wait for 1 reply. Whether you are writing to disk or to a network, the overhead of handling each packet reduces throughput.

While this is true, you forget the impact of "oflag=direct". Without that flag, you see a much smaller difference between writing 1KB or 1MB chunks. Yes, I know that writing small chunks has an impact on performance.

...

It depends on how fast the user's application generates data. It generates video in real time and can't be paused. If it only needs 2.8MB/s, any of these methods would work, but if it needed 100 times that, then writing 4k blocks makes no sense and wouldn't work even with oflag=nocache. Nocache tells the OS that it can throw away the data -- it doesn't force it to be thrown away. In writing a 428GB file (then my disk filled), all of memory was filled long before it filled the disk and overall, only averaged 145MB/s.

Correct. Anyway, the problem was an error in the code, the problem was different and solved. -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

Bernhard Voelker

05:54

On 05/21/2018 12:57 PM, Roger Oberholtzer wrote:

...

The solution that worked was to have something like the following running:

while [ 1 ]; do sync; echo 1 > /proc/sys/vm/drop_caches; sleep 60; done &

This really did solve the problem.

We have now updated the OS for this system to Leap 42.3. It does not seem that this is having the same effect. One difference I see is that the sync command seems to take a very long time. Like over a minute each time.

BTW: newer sync (>=8.24) supports to take a file system as argument, so the others are not sync'ed: $ sync --help | grep -- -f -f, --file-system sync the file systems that contain the files Have a nice day, Berny -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

12:26

On 2018-05-23 07:54, Bernhard Voelker wrote:

...

On 05/21/2018 12:57 PM, Roger Oberholtzer wrote:

...
The solution that worked was to have something like the following running:

while [ 1 ]; do sync; echo 1 > /proc/sys/vm/drop_caches; sleep 60; done &

This really did solve the problem.

We have now updated the OS for this system to Leap 42.3. It does not seem that this is having the same effect. One difference I see is that the sync command seems to take a very long time. Like over a minute each time.

BTW: newer sync (>=8.24) supports to take a file system as argument, so the others are not sync'ed:

$ sync --help | grep -- -f -f, --file-system sync the file systems that contain the files

Wow! :-) -- Cheers / Saludos, Carlos E. R. (from 42.3 x86_64 "Malachite" at Telcontar)

Linda Walsh

14:56

New subject: [opensuse] Re: History repeats itself: disk cache

Bernhard Voelker wrote:

...

BTW: newer sync (>=8.24) supports to take a file system as argument, so the others are not sync'ed:

$ sync --help | grep -- -f -f, --file-system sync the file systems that contain the files

Now, if only the vars in /proc/sys/net/ipv4 could be adjusted per interface. The needs of an internal 10gb eithernet are rather different than those facing the internet over a slower cable interface. Things like tcp_low_latency and congestion control have different needs, among others. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

David Haller

21:50

New subject: [opensuse] Re: History repeats itself: disk cache

Hello, On Wed, 23 May 2018, Linda Walsh wrote:

...

Now, if only the vars in /proc/sys/net/ipv4 could be adjusted per interface.

/proc/sys/net/ipv4/conf/* HTH, -dnh -- With so many "textbook cases" of single points of failure, you'd think that we'd stop building systems to demonstrate the concept. - Matt Curtin -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

2400

Age (days ago)

2402

Last active (days ago)

List overview

Download

21 comments

10 participants

participants (10)

Andrei Borzenkov
Anton Aylward
Bernhard Voelker
Carlos E. R.
Dave Howorth
David Haller
L A Walsh
Linda Walsh
Peter Suetterlin
Roger Oberholtzer

[opensuse] History repeats itself: disk cache

tags

participants (10)