Re: [opensuse] Re: XFS and openSUSE 12.1

12 Jun 2013

      On Tue, 2013-06-11 at 12:14 -0400, Greg Freemyer wrote:
...
On Tue, Jun 11, 2013 at 7:16 AM, Roger Oberholtzer <roger@opq.se> wrote:
...
Despite being quiet on this, we have not solved the problem. We have:
* Tried other file systems (e.g., ext4)
 * Tried faster "server-grade" SATA disks.
 * Tried SATA3 interface as well as SATA2.
The same thing happens. Periodically, write calls are blocking for 4-5
seconds instead of the usual 20-30 msecs.
I have seen one unexpected thing: when running xosview during all this,
the MEM usage shows the cache use slowly growing. The machine has 32 GB
of RAM. The cache use just grows and grows as file file system is
written to. Here is the part I don't get:
* If I close all apps that have a file open on the file system, the
cache use remains.
 * If I run the 'sync(1)' command, the cache use remains. I would have
thought that the cache would be freed as there is nothing left to cache.
If not immediately, over a decent amount of time. But this is not the
case.
 * Only when I unmount the file system does the cache get freed.
Immediately.
Why would the cache grow and grow? Since the delay, when it happens,
grows and grows, I get the feeling that this file system cache in RAM is
slowly getting bigger and bigger, and each time it needs to be flushed,
it takes longer and longer. If the cache is being emptied at some
reasonable point, why would it continue to grow? Remember that for each
mounted file system there is one process writing to a single file. The
disk usage remains 100% constant in terms of what is sent to be written.
Is there some policy or setting that controls how the file system deals
with file system cache in RAM? More specifically, is there any way to
limit it's size for a file system?
Is there a way to see how much of the RAM cache for a file system is
actually containing data waiting to be flushed?
I have seen some reports that using O_SYNC when opening the file makes
the write times more even. I guess I could open() a wile with this, and
then fdopen() it. fcntl() seems not to support O_SYNC...
Roger,
O_SYNC does not bypass the cache, it just flushes continuously, but it
is not the same as drop_cache.  You need O_DIRECT to bypass the cache.
If you want a write buffer and not a cache, why don't you just do
that?  A very basic attempt would be:
I think everyone is misunderstanding the situation. I am not doing
anything with or expecting or manipulating a cache. The cache I see is a
totally private thing being done by the OS. The existence of the cache
is not the problem. In fact, if there was no cache I would think
something was wrong.

The problem I am seeing is that the cache is growing and growing to eat
all my memory. In addition, as the cache grows, the periodic writes to
disk take longer and longer. 100% reproducible.

To be clear: I do not ask for, manipulate or in any other way influence
the cache through any direct action in my application. I am only writing
a single file by a single process. This file is growing at 25 MB a
second (more or less). The file is opened with fopen, written to, and
then closed with fclose(). Files can be big, but never more than 2GB
each.

My initial thought was that the file system was doing something that led
to the longer write delays. So I asked about XFS, which is the file
system we use for this. As I later reported, it seems the issue exists
for all block devices (ext4, but not /dev/null as the file).

I understand that the cache is there so I can possibly read data that
has been recently written. However, I do not see how the kernel can just
grow this cache until my memory is gone. Especially if the bigger cache
also results in significant and increasingly longer delays in write
completions.

The workaround that seems to correct the situation is to run this:

	while [ 1 ]
	do
		echo 1 > /proc/sys/vm/drop_caches
		sleep 60
	done &

Obviously a brute-force approach that is really only possible on my
system as it does not seem to mess up general usage. The rate of 60
seconds is arbitrary. But each time this is run, the cache has grown to
almost 3 GB. 

I wrote a small app that simulates the problem. I will verify that it
really does do so and then can post the C source (very tiny) if anyone
wants to see what happens on their system.
...
- create a named pipe per output file
- dd if=named_pipe of=file oflag=direct bs=64K
This is an interesting approach to getting direct I/O. I will have to
file this for future reference.

Yours sincerely,

Roger Oberholtzer

Ramböll RST / Systems

Office: Int +46 10-615 60 20
Mobile: Int +46 70-815 1696
roger.oberholtzer@ramboll.se
________________________________________

Ramböll Sverige AB
Krukmakargatan 21
P.O. Box 17009
SE-104 62 Stockholm, Sweden
www.rambollrst.se

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org