[opensuse] XFS and openSUSE 12.1

older
[opensuse] rsyslog fails to log...

Roger Oberholtzer

3 Jun 2013 3 Jun '13

09:14

I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2. We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds. I have looked in all the usual suspect places, and nothing seems to point at anything. For one test, I wrote to /dev/null instead of the real file, The delays do not happen. They do seem to be related to actually writing to the physical disk. I expect some delay occasionally when disks are physically flushed. There is buffering in our application to allow this. But 5 seconds is simply wrong. So, I am curious if anyone has seen performance issues like this with XFS on openSUSE 12.1. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Greg Freemyer

3 Jun 3 Jun

13:11

Roger Oberholtzer <roger@opq.se> wrote:

...

I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds. I have looked in all the usual suspect places, and nothing seems to point at anything. For one test, I wrote to /dev/null instead of the real file, The delays do not happen. They do seem to be related to actually writing to the physical disk.

I expect some delay occasionally when disks are physically flushed. There is buffering in our application to allow this. But 5 seconds is simply wrong.

So, I am curious if anyone has seen performance issues like this with XFS on openSUSE 12.1.

The xfs mailing list has very knowledgable people on it and they address performance questions routinely so you can ask there. xfs@oss.sgi.com (I think) Separately have you tried all of the elevators? I think there are only 3 or 4. Last, but not least, have you tried O_DIRECT in your open call. That can have a major impact since it disables kernel buffering. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

17:39

On Mon, 2013-06-03 at 09:11 -0400, Greg Freemyer wrote:

...

O_DIRECT

...

From the open(2) man page:

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."--Linus So of course I will have to try it! -- Yours sincerely, Roger Oberholtzer OPQ Systems / Ramböll RST Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

16:52

On Mon, Jun 3, 2013 at 1:39 PM, Roger Oberholtzer <roger@opq.se> wrote:

...

On Mon, 2013-06-03 at 09:11 -0400, Greg Freemyer wrote:

...
O_DIRECT

From the open(2) man page:

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."--Linus

So of course I will have to try it!

I said it will have a major impact. I didn't say if it would be a good or bad impact! In general for streaming i/o loads I think it is a good thing. dd as an example has an option to use O_DIRECT for the i/o. For a normal random i/o workload it is probably horrible. OTOH, database tools will sometimes want to totally control how caching works, so they use O_DIRECT to get as close to the hard drive as they can. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

19:07

On Mon, 2013-06-03 at 12:52 -0400, Greg Freemyer wrote:

...

On Mon, Jun 3, 2013 at 1:39 PM, Roger Oberholtzer <roger@opq.se> wrote:

...
On Mon, 2013-06-03 at 09:11 -0400, Greg Freemyer wrote:

...
O_DIRECT

From the open(2) man page:

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."--Linus

So of course I will have to try it!

I said it will have a major impact. I didn't say if it would be a good or bad impact!

In general for streaming i/o loads I think it is a good thing. dd as an example has an option to use O_DIRECT for the i/o. For a normal random i/o workload it is probably horrible.

OTOH, database tools will sometimes want to totally control how caching works, so they use O_DIRECT to get as close to the hard drive as they can.

I have three JPEG compression methods that I play with: (1) the standard turbo that comes with openSUSE (used in the systems with the issues described here), (2) a re-implementation of that based on the Intel Performance Primitives (IPP) (same API/ABI), (3) another one based on IPP that deals with the image as an object, via a different API. #2 is 3-4 times faster than #1, and #3 is 6-8 times faster than #1 on our gray scale images of road surfaces. All of these will of course provide data of different sizes as the compressed images vary in size. I do not relish the thought of maintaining fixed sized buffers from any of these to manage O_DIRECT disk I/O. I think I will try a different file system, and then a different disk device in the computer. I am rather surprised by all of this as each disk is written to by only one thread, and only one file at a time is written. There is not a great demand on writing. The disk access light turns on periodically, once every second or so. So it would seem that the data is being written rather often. I wonder if there is some other system dynamic I am missing. There is ample memory, and all buffers are pre-allocated. There are 16 cores, and when I see the system load, 10 are 99% idle. -- Yours sincerely, Roger Oberholtzer OPQ Systems / Ramböll RST Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se s -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

20:02

On Mon, Jun 3, 2013 at 3:07 PM, Roger Oberholtzer <roger@opq.se> wrote:

...

On Mon, 2013-06-03 at 12:52 -0400, Greg Freemyer wrote:

...
On Mon, Jun 3, 2013 at 1:39 PM, Roger Oberholtzer <roger@opq.se> wrote:

...
On Mon, 2013-06-03 at 09:11 -0400, Greg Freemyer wrote:

...
O_DIRECT

From the open(2) man page:

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."--Linus

So of course I will have to try it!

I said it will have a major impact. I didn't say if it would be a good or bad impact!

In general for streaming i/o loads I think it is a good thing. dd as an example has an option to use O_DIRECT for the i/o. For a normal random i/o workload it is probably horrible.

OTOH, database tools will sometimes want to totally control how caching works, so they use O_DIRECT to get as close to the hard drive as they can.

I have three JPEG compression methods that I play with: (1) the standard turbo that comes with openSUSE (used in the systems with the issues described here), (2) a re-implementation of that based on the Intel Performance Primitives (IPP) (same API/ABI), (3) another one based on IPP that deals with the image as an object, via a different API.

#2 is 3-4 times faster than #1, and #3 is 6-8 times faster than #1 on our gray scale images of road surfaces.

All of these will of course provide data of different sizes as the compressed images vary in size. I do not relish the thought of maintaining fixed sized buffers from any of these to manage O_DIRECT disk I/O.

I think I will try a different file system, and then a different disk device in the computer.

I am rather surprised by all of this as each disk is written to by only one thread, and only one file at a time is written. There is not a great demand on writing. The disk access light turns on periodically, once every second or so. So it would seem that the data is being written rather often.

I wonder if there is some other system dynamic I am missing. There is ample memory, and all buffers are pre-allocated. There are 16 cores, and when I see the system load, 10 are 99% idle.

Seriously, post to the XFS list. They get end-user questions fairly often and are pretty friendly about helping out. I've been really surprised how much help the give people working with raid systems. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Dave Howorth

4 Jun 4 Jun

08:33

Greg Freemyer wrote:

...

Seriously, post to the XFS list. They get end-user questions fairly often and are pretty friendly about helping out.

I've been really surprised how much help the give people working with raid systems.

What Greg said. ++ Note that you'll get a much friendlier reception if you post all the information they ask for when you ask your question http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_r... Might be worth skimming the rest of the FAQ too. You'll be talking direct to the devs, and likely getting a response from them. HTH, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

11:16

On Tue, 2013-06-04 at 09:33 +0100, Dave Howorth wrote:

...

Greg Freemyer wrote:

...
Seriously, post to the XFS list. They get end-user questions fairly often and are pretty friendly about helping out.

I've been really surprised how much help the give people working with raid systems.

What Greg said. ++

Note that you'll get a much friendlier reception if you post all the information they ask for when you ask your question

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_r...

Might be worth skimming the rest of the FAQ too. You'll be talking direct to the devs, and likely getting a response from them.

I am seeing how to collect the information they may want. Thanks for the pointer. FTHOI, I changed my app to write to either /dev/null or a file on a tmpfs partition. Oddly, both reported the same 10 msecs to perform the operations. I would have thought that the tmpfs write would take a bit longer as it really happens, while the /dev/null write is simply discarded. Unless this indicates some problem with my metric. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Dave Howorth

11:27

Roger Oberholtzer wrote:

...

FTHOI, I changed my app to write to either /dev/null or a file on a tmpfs partition. Oddly, both reported the same 10 msecs to perform the operations. I would have thought that the tmpfs write would take a bit longer as it really happens, while the /dev/null write is simply discarded. Unless this indicates some problem with my metric.

Some quirk of the timing system? Try doing a benchmark 1000 times as big? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

12:12

Roger Oberholtzer wrote:

...

On Tue, 2013-06-04 at 09:33 +0100, Dave Howorth wrote:

...
Greg Freemyer wrote:

...
Seriously, post to the XFS list. They get end-user questions fairly often and are pretty friendly about helping out.

I've been really surprised how much help the give people working with raid systems.

What Greg said. ++

Note that you'll get a much friendlier reception if you post all the information they ask for when you ask your question

http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_r...

...
Might be worth skimming the rest of the FAQ too. You'll be talking direct to the devs, and likely getting a response from them.

I am seeing how to collect the information they may want. Thanks for the pointer.

FTHOI, I changed my app to write to either /dev/null or a file on a tmpfs partition. Oddly, both reported the same 10 msecs to perform the operations. I would have thought that the tmpfs write would take a bit longer as it really happens, while the /dev/null write is simply discarded. Unless this indicates some problem with my metric.

The difference between to a character device and to a file in memory is probably not really measurable for one operation. -- Per Jessen, Zürich (15.9°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

3 Jun 3 Jun

20:48

On 2013-06-03 18:52, Greg Freemyer wrote:

...

In general for streaming i/o loads I think it is a good thing. dd as an example has an option to use O_DIRECT for the i/o. For a normal random i/o workload it is probably horrible.

I don't see O_DIRECT mentioned in the dd man page :-? -- Cheers / Saludos, Carlos E. R. (from oS 12.3 "Dartmouth" GM (rescate 1)) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Bernhard Voelker

20:57

On 06/03/2013 10:48 PM, Carlos E. R. wrote:

...

I don't see O_DIRECT mentioned in the dd man page :-?

'direct' is a FLAG for iflag and/or oflag: $ man dd | grep direct | head -n 1 direct use direct I/O for data Anyway, for coreutils programs (as for many other GNU projects) the man page is a copy of the --help output, so for details it is better to look into the texinfo manual: $ info coreutils 'dd invocation' `direct' Use direct I/O for data, avoiding the buffer cache. Note that the kernel may impose restrictions on read or write buffer sizes. For example, with an ext4 destination file system and a linux-based kernel, using `oflag=direct' will cause writes to fail with `EINVAL' if the output buffer size is not a multiple of 512. Have a nice day, Berny -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

21:33

On 2013-06-03 22:57, Bernhard Voelker wrote:

...

On 06/03/2013 10:48 PM, Carlos E. R. wrote:

...
I don't see O_DIRECT mentioned in the dd man page :-?

'direct' is a FLAG for iflag and/or oflag:

$ man dd | grep direct | head -n 1 direct use direct I/O for data

Oops! I don't know how it escaped me. I must have mistyped the search string in 'man'.

...

Anyway, for coreutils programs (as for many other GNU projects) the man page is a copy of the --help output, so for details it is better to look into the texinfo manual:

Yes, it happens. -- Cheers / Saludos, Carlos E. R. (from oS 12.3 "Dartmouth" GM (rescate 1)) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

21:49

On Mon, Jun 3, 2013 at 4:48 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

On 2013-06-03 18:52, Greg Freemyer wrote:

...
In general for streaming i/o loads I think it is a good thing. dd as an example has an option to use O_DIRECT for the i/o. For a normal random i/o workload it is probably horrible.

I don't see O_DIRECT mentioned in the dd man page :-?

It's not, they call it "direct I/O". As in: iflag=direct oflag=direct You would have to use strace to verify that it causes the O_DIRECT flag to be passed to open, but I'm pretty sure. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

18:57

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds.

1) Have you tried using a XFS Real-Time segment. It was designed to prevent this type of lag 2) How full is the disk & how fragmented is its free space? 3) I get the impression that you are collecting more data for the same records, such that each (or significantly many) records are growing beyond the original space allocated for such records. So, while the system starts by buffering while it looks for space to hold the additional information (possibly at the end of the disk), when the buffers fill the OS kicks in to free space, and forces a long wait while new space is sought for each buffer it wants to empty -- forcing a length search for free blocks that won't be near your present data, but most likely at the end of it. Does that sound about right? If #3 is true, you might get better long-term performance improvement by restructuring your database by copying files to another partition, and on the other partition, set the allocsize= in your fstab, on the new partition to the size of the largest size your files will become. This will spread out data when the allocator first allocates the files so later updates won't require finding space that is far from the file. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

21:46

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Mon, 2013-06-03 at 11:57 -0700, Linda Walsh wrote:

...

Roger Oberholtzer wrote:

...
I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds.

1) Have you tried using a XFS Real-Time segment. It was designed to prevent this type of lag

I will have to explore this. I am not familiar with it.

...

2) How full is the disk & how fragmented is its free space?

Newly formatted. No fragmentation.

...

3) I get the impression that you are collecting more data for the same records, such that each (or significantly many) records are growing beyond the original space allocated for such records. So, while the system starts by buffering while it looks for space to hold the additional information (possibly at the end of the disk), when the buffers fill the OS kicks in to free space, and forces a long wait while new space is sought for each buffer it wants to empty -- forcing a length search for free blocks that won't be near your present data, but most likely at the end of it. Does that sound about right?

It is a binary file that grows and grows (up to 2 GB, which is the max file size we allow). The file contains a stream of JPEG images. One after another. Each image is 1920 x 450. There are 50 of these per second at max speed. The system has no problem doing this. It can work fine for 30 minutes. Then a single compress suddenly takes 4 or 5 seconds. If I write to /dev/null instead of a physical file, the compress per image stays a constant 10 milliseconds. It is only when I fopen/fwrite a real file on an XFS disk that this happens.

...

If #3 is true, you might get better long-term performance improvement by restructuring your database by copying files to another partition, and on the other partition, set the allocsize= in your fstab, on the new partition to the size of the largest size your files will become. This will spread out data when the allocator first allocates the files so later updates won't require finding space that is far from the file.

-- Yours sincerely, Roger Oberholtzer OPQ Systems / Ramböll RST Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

20:51

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 2013-06-03 23:46, Roger Oberholtzer wrote:

...

On Mon, 2013-06-03 at 11:57 -0700, Linda Walsh wrote:

...

...
3) I get the impression that you are collecting more data for the same records, such that each (or significantly many) records are growing beyond the original space allocated for such records. So, while the system starts by buffering while it looks for space to hold the additional information (possibly at the end of the disk), when the buffers fill the OS kicks in to free space, and forces a long wait while new space is sought for each buffer it wants to empty -- forcing a length search for free blocks that won't be near your present data, but most likely at the end of it. Does that sound about right?

It is a binary file that grows and grows (up to 2 GB, which is the max file size we allow). The file contains a stream of JPEG images. One after another. Each image is 1920 x 450. There are 50 of these per second at max speed. The system has no problem doing this. It can work fine for 30 minutes. Then a single compress suddenly takes 4 or 5 seconds. If I write to /dev/null instead of a physical file, the compress per image stays a constant 10 milliseconds. It is only when I fopen/fwrite a real file on an XFS disk that this happens.

It does look as if XFS is doing some restructuring. I would try allocating the 2GB in advance. Try the XFS mail list. They should know. -- Cheers / Saludos, Carlos E. R. (from oS 12.3 "Dartmouth" GM (rescate 1)) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

22:02

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Mon, 2013-06-03 at 22:51 +0200, Carlos E. R. wrote:

...

It does look as if XFS is doing some restructuring. I would try allocating the 2GB in advance.

I was thinking about doing this. There has been a claim that when this happens, the disk activity light flickers a bit more than usual.

...

Try the XFS mail list. They should know.

Seems the popular thing I should do. Off to join a new list. -- Yours sincerely, Roger Oberholtzer OPQ Systems / Ramböll RST Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

21:36

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 2013-06-04 00:02, Roger Oberholtzer wrote:

...

On Mon, 2013-06-03 at 22:51 +0200, Carlos E. R. wrote:

...
It does look as if XFS is doing some restructuring. I would try allocating the 2GB in advance.

I was thinking about doing this.

There has been a claim that when this happens, the disk activity light flickers a bit more than usual.

I guess...

...

...
Try the XFS mail list. They should know.

Seems the popular thing I should do. Off to join a new list.

Years ago I had a problem with XFS and they helped. I don't remember how exactly. -- Cheers / Saludos, Carlos E. R. (from oS 12.3 "Dartmouth" GM (rescate 1)) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

4 Jun 4 Jun

00:13

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

On Mon, 2013-06-03 at 11:57 -0700, Linda Walsh wrote:

...
Roger Oberholtzer wrote:

...
I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds.

1) Have you tried using a XFS Real-Time segment. It was designed to prevent this type of lag

I will have to explore this. I am not familiar with it.

It's basically a way to get you guaranteed I/O speeds, but I think it sacrifices some flexibility -- like maybe requiring pre-allocation of files (pure guess what the requirements are, as I haven't used it either).

...

It is a binary file that grows and grows (up to 2 GB, which is the max file size we allow). The file contains a stream of JPEG images. One after another. Each image is 1920 x 450. There are 50 of these per second at max speed.

---- What I'm not clear on is your earlier statement that you increased the size per image and now are re-writing them? Is there a 'rewrite' involved, or are you simply dumping data to disk as fast as you can? If it is the latter -- pre-allocate your space, and you will save yourself tons of perf issues. "xfs_alloc_file" (or its equivalent calls). If you have a 2ndary process allocate one of these when the old one gets to 75% full, you shouldn't notice any hiccups. Second thing -- someone else mentioned it -- it sounds like (this is true if you are writing or rewriting, so independent variable), is to do writes with O_DIRECT and do your own buffering to buffer to at least 1M, better 16M boundaries. If you use O_DIRECT you will want to be page & sector (I think the kernel changed, and you now you HAVE to be) aligned or you will get an error indication. You will get about a 30% or greater increase in write throughput. This is assuming your app doesn't immediately turn around and need to read the data again, in which case, you'd be penalized by not using the buffer cache. Do you watch your free memory? I have an "xosview" window open with LOAD/CPU/MEM/DISK (and an outside net)... but I can see used memory or cache memory becoming tight. Attached is a sample of what you can see.. I did a kernel build (make -j) so you could see how it emptied out the cache, for example.

...

The system has no problem doing this. It can work fine for 30 minutes. Then a single compress suddenly takes 4 or 5 seconds.

4-5 seconds after 30 minutes?... Geez, even I have to catch my breath now and then! If I write to /dev/null instead of a physical file, the

...

compress per image stays a constant 10 milliseconds. It is only when I fopen/fwrite a real file on an XFS disk that this happens.

...
If #3 is true, you might get better long-term performance improvement by restructuring your database by copying files to another partition, and on the other partition, set the allocsize= in your fstab, on the new partition to the size of the largest size your files will become. This will spread out data when the allocator first allocates the files so later updates won't require finding space that is far from the file.

Roger Oberholtzer

07:05

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Mon, 2013-06-03 at 17:13 -0700, Linda Walsh wrote:

...

Roger Oberholtzer wrote:

...
On Mon, 2013-06-03 at 11:57 -0700, Linda Walsh wrote:

...
Roger Oberholtzer wrote:

...
I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds.

1) Have you tried using a XFS Real-Time segment. It was designed to prevent this type of lag

I will have to explore this. I am not familiar with it.

It's basically a way to get you guaranteed I/O speeds, but I think it sacrifices some flexibility -- like maybe requiring pre-allocation of files (pure guess what the requirements are, as I haven't used it either).

...
It is a binary file that grows and grows (up to 2 GB, which is the max file size we allow). The file contains a stream of JPEG images. One after another. Each image is 1920 x 450. There are 50 of these per second at max speed.

---- What I'm not clear on is your earlier statement that you increased the size per image and now are re-writing them? Is there a 'rewrite' involved, or are you simply dumping data to disk as fast as you can?

Bad description by me. We have been using XFS for this type of application for years. Recently, we changed our cameras to ones with higher resolution. So we are now writing at a higher data rate. It is only a single file being written to. We do not delete files. We would expect fragmentation to be minimal because of these two reasons.

...

If it is the latter -- pre-allocate your space, and you will save yourself tons of perf issues. "xfs_alloc_file" (or its equivalent calls).

If you have a 2ndary process allocate one of these when the old one gets to 75% full, you shouldn't notice any hiccups.

Second thing -- someone else mentioned it -- it sounds like (this is true if you are writing or rewriting, so independent variable), is to do writes with O_DIRECT and do your own buffering to buffer to at least 1M, better 16M boundaries. If you use O_DIRECT you will want to be page & sector (I think the kernel changed, and you now you HAVE to be) aligned or you will get an error indication.

You will get about a 30% or greater increase in write throughput. This is assuming your app doesn't immediately turn around and need to read the data again, in which case, you'd be penalized by not using the buffer cache.

We never read the data when writing it. The only thing we do is track the file size via ftell. This is because none of the jpeg libraries tell how much they have written.

...

Do you watch your free memory? I have an "xosview" window open with LOAD/CPU/MEM/DISK (and an outside net)... but I can see used memory or cache memory becoming tight. Attached is a sample of what you can see.. I did a kernel build (make -j) so you could see how it emptied out the cache, for example.

We have at least 10 GB free memory. And typically 10 idle CPUs.

...

...
The system has no problem doing this. It can work fine for 30 minutes. Then a single compress suddenly takes 4 or 5 seconds.

4-5 seconds after 30 minutes?... Geez, even I have to catch my breath now and then!

But you are not a computer...

...

If I write to /dev/null instead of a physical file, the

...
compress per image stays a constant 10 milliseconds. It is only when I fopen/fwrite a real file on an XFS disk that this happens.

...
If #3 is true, you might get better long-term performance improvement by restructuring your database by copying files to another partition, and on the other partition, set the allocsize= in your fstab, on the new partition to the size of the largest size your files will become. This will spread out data when the allocator first allocates the files so later updates won't require finding space that is far from the file.

I have seen suggestions of using allocsize=64m when writing streaming media like I have. I will be trying this. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

18:26

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

I have seen suggestions of using allocsize=64m when writing streaming media like I have. I will be trying this.

In your situation, if your entire data file is always 2GB, (i.e. you write data until you hit the 2GB mark, then close and open a new file), and if it is the only thing you are putting on that disk, you might try making allocsize=1g (the max). Then the allocator will get called twice/file, and on a clean disk, it should have no problem finding space. If it is to a raid disk and assuming you set it up stripe aligned, you'd want to to use largeio and swalloc, but I doubt either of those would be at fault. It sounds more like things build up in memory for 30 minutes, then something is flushing and/or having to work hard for 5 seconds -- like you've written to memory buffers for 30 minutes then some routine calls to empty buffers because it is out of space. This is where I think you'll benefit most by using O_DIRECT in your open call and either a 2nd thread to handle I/O, or async I/O. You shouldn't have a 5 second pause like that unless your app is doing lots of little writes and it takes 30 minutes to fill up memory to the point that it has to empty the buffer, but that sounds like a long time. Do you know if your disc is writing continuously while you execute, or do you get disk-i/o only ever 30 minutes? ;-) The extra cpu's. If you run iostat, does your output disk see continuous I/O, or only periodically? (i.e. "iostat 5 /dev/sdX", where X is the disk you are writing to). By writing to /dev/null instead of a disk you are avoiding the kernel's disk-io routines as well as the xfs-io routines. I.e. it could be either one. That's why trying O_DIRECT can help eliminate the kernel's I/O buffering routines. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Bernhard Voelker

19:01

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 06/04/2013 08:26 PM, Linda Walsh wrote:

...

In your situation, if your entire data file is always 2GB, (i.e. you write data until you hit the 2GB mark, then close and open a new file), and if it is the only thing you are putting on that disk, you might try [...]

... to use no file system and instead write directly to the raw partition and thus manage the space there yourself. Just a thought. ;-) Have a nice day, Berny -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

21:21

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-04 at 11:26 -0700, Linda Walsh wrote:

...

It sounds more like things build up in memory for 30 minutes, then something is flushing and/or having to work hard for 5 seconds -- like you've written to memory buffers for 30 minutes then some routine calls to empty buffers because it is out of space.

Any buffers are filled up very quickly when you write 25 MB/Sec. So something happening after 30 minutes is probably not related to general buffering.

...

This is where I think you'll benet most by using O_DIRECT in your open call and either a 2nd thread to handle I/O, or async I/O.

You shouldn't have a 5 second pause like that unless your app is doing lots of little writes and it takes 30 minutes to fill up memory to the point that it has to empty the buffer, but that sounds like a long time.

The standard jpeg compression library does indeed do little writes. But these are via fwrite, so they are buffered via the FILE mechanism. I go not change the buffer sizes.

...

Do you know if your disc is writing continuously while you execute, or do you get disk-i/o only ever 30 minutes? ;-)

The disks are accessed every few seconds.

...

The extra cpu's. If you run iostat, does your output disk see continuous I/O, or only periodically? (i.e. "iostat 5 /dev/sdX", where X is the disk you are writing to).

By writing to /dev/null instead of a disk you are avoiding the kernel's disk-io routines as well as the xfs-io routines. I.e. it could be either one. That's why trying O_DIRECT can help eliminate the kernel's I/O buffering routines.

I do not think the kernel is the culprit. I think it is the hard disk itself. We will be trying some higher performance discs. -- Yours sincerely, Roger Oberholtzer OPQ Systems / Ramböll RST Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

23:45

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

Any buffers are filled up very quickly when you write 25 MB/Sec. So something happening after 30 minutes is probably not related to general buffering.

Probably not on your config... 10GB? That'd take about 400s, but 1800 would take about 45GB of buffering at that rate.

...

The standard jpeg compression library does indeed do little writes. But these are via fwrite, so they are buffered via the FILE mechanism. I go not change the buffer sizes.

Looking at the includes on my system, that implies you are doing pretty small writes -- default buffer size on fwrite is 8192 bytes. I don't know how to change that, but that is a bit on the small size.

...

...
Do you know if your disc is writing continuously while you execute, or do you get disk-i/o only ever 30 minutes? ;-)

The disks are accessed every few seconds.

Well, there ya go -- if you wanted, you could put those extra cpu's to work doing encoding and have 1 process that handles writing to disk.

...

I do not think the kernel is the culprit. I think it is the hard disk itself. We will be trying some higher performance discs.

...

echo "$i: $(<$i)" done age_buffer_centisecs: 1500 error_level: 3 filestream_centisecs: 3000 inherit_noatime: 1 inherit_nodefrag: 1 inherit_nodump: 1 inherit_nosymlinks: 0 inherit_sync: 1 irix_sgid_inherit: 0 irix_symlink_mode: 0

--- Well it wouldn't be the kernel by itself, it would be a combination of how the app is making calls. What type of HD are you using? single platter? low rpm? If you are recording images, it sounds like you don't need it to go much faster, as the images only come in at a certain speed... If you look in /proc/sys/fs/xfs, you see the xfs tunables. One of them "speculative_prealloc_lifetime", -- given that you are only writing at 1GB/400 seconds, that lifetime would timeout on my machine (set for 300 centiseconds if it is like the other timeouts).. so upping that, or .. ummm... Have you changed any defaults in there? Right now mine look like: s> for i in *;do panic_mask: 0 rotorstep: 1 speculative_prealloc_lifetime: 300 stats_clear: 0 xfsbufd_centisecs: 100 xfssyncd_centisecs: 3000 --- On low power machines, those might be set higher to turn on the disk less often...? Those are another area you could fiddle with .. er. experiment with..;-) Good luck -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

5 Jun 5 Jun

05:23

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-04 at 16:45 -0700, Linda Walsh wrote:

...

...
...
Do you know if your disc is writing continuously while you execute, or do you get disk-i/o only ever 30 minutes? ;-)

The disks are accessed every few seconds.

Well, there ya go -- if you wanted, you could put those extra cpu's to work doing encoding and have 1 process that handles writing to disk.

The compression speed is not the issue. The app buffers 50 raw images from each camera (each managed in a separate thread). During normal use, the compression thread is completing compression of the image before then next arrives. Occasionally, the compression takes longer, which is expected and is why the images are buffered. Things like process scheduling and occasional disk flushes are expected. When the expected delays happen (a couple hundred extra milliseconds compressing/writing out an image), the buffer gets like two or five deep. But it immediately recovers. These new delays of five or more seconds during a single compress cause the buffer to fill. I have increased the buffer depth (a circular list), but this causes other problems (memory being one of them). I want to get to the cause of the delay. Hopefully I can correct it. If not then I want to understand it so I have confidence that any corrective actions are the correct ones and can be expected to work. If we send a measurement system to some other country (this new prototype is mysteriously being tested in production on autobahns in Germany), I need to know the collected images are as expected. Re-measurement is a very costly process. With summer road repairs in full swing, sometimes re-measurement is not possible. As an aside, maybe if a system passes SUSE offices someone may be interested seeing what I think is a novel use of openSUSE. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

11 Jun 11 Jun

11:16

New subject: [opensuse] Re: XFS and openSUSE 12.1

Despite being quiet on this, we have not solved the problem. We have: * Tried other file systems (e.g., ext4) * Tried faster "server-grade" SATA disks. * Tried SATA3 interface as well as SATA2. The same thing happens. Periodically, write calls are blocking for 4-5 seconds instead of the usual 20-30 msecs. I have seen one unexpected thing: when running xosview during all this, the MEM usage shows the cache use slowly growing. The machine has 32 GB of RAM. The cache use just grows and grows as file file system is written to. Here is the part I don't get: * If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache. If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately. Why would the cache grow and grow? Since the delay, when it happens, grows and grows, I get the feeling that this file system cache in RAM is slowly getting bigger and bigger, and each time it needs to be flushed, it takes longer and longer. If the cache is being emptied at some reasonable point, why would it continue to grow? Remember that for each mounted file system there is one process writing to a single file. The disk usage remains 100% constant in terms of what is sent to be written. Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system? Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed? I have seen some reports that using O_SYNC when opening the file makes the write times more even. I guess I could open() a wile with this, and then fdopen() it. fcntl() seems not to support O_SYNC... Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Dave Howorth

11:31

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

Despite being quiet on this, we have not solved the problem. ... Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed?

I haven't seen you ask about your problem on the XFS list? As before, I'd suggest that is the best place to find a solution and also to answer questions like this. There is a discussion going on at the moment about delays and flushing. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

11:52

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-11 at 12:31 +0100, Dave Howorth wrote:

...

Roger Oberholtzer wrote:

...
Despite being quiet on this, we have not solved the problem. ... Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed?

I haven't seen you ask about your problem on the XFS list? As before, I'd suggest that is the best place to find a solution and also to answer questions like this.

There is a discussion going on at the moment about delays and flushing.

Indeed I have not. Instead I have demonstrated that it is independent of the file system as the same behavior happens with ext4. If the problem was limited to xfs, then I think they could help. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

12:03

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-11 13:52, Roger Oberholtzer wrote:

...

On Tue, 2013-06-11 at 12:31 +0100, Dave Howorth wrote:

...

Indeed I have not. Instead I have demonstrated that it is independent of the file system as the same behavior happens with ext4. If the problem was limited to xfs, then I think they could help.

They can help in any case :-) Even if it affects other filesystems, they will know why it affects their filesystem. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG3EhAACgkQIvFNjefEBxpBCwCfVVl7uuQet2el0Ja7AUUdavG1 OY8AoLCjrWJp5lJ5nEhgf4qNi1g8rd1g =+GOp -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

12 Jun 12 Jun

10:53

New subject: [opensuse] Re: XFS and openSUSE 12.1

Carlos E. R. said the following on 06/11/2013 08:03 AM:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-11 13:52, Roger Oberholtzer wrote:

...
On Tue, 2013-06-11 at 12:31 +0100, Dave Howorth wrote:

...
Indeed I have not. Instead I have demonstrated that it is independent of the file system as the same behavior happens with ext4. If the problem was limited to xfs, then I think they could help.

They can help in any case :-)

Even if it affects other filesystems, they will know why it affects their filesystem.

Of course it affects all file systems. Of Roger had continued he'd have found it affects ReiserFS, BtrFS and more. Its about the way the SYSTEM caches data. Its not the file system. Of course the cache doesn't go away when he uses sync! Why should it? Its a cache not a buffer. -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:00

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 06:53 -0400, Anton Aylward wrote:

...

Carlos E. R. said the following on 06/11/2013 08:03 AM:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-11 13:52, Roger Oberholtzer wrote:

...
On Tue, 2013-06-11 at 12:31 +0100, Dave Howorth wrote:

...
Indeed I have not. Instead I have demonstrated that it is independent of the file system as the same behavior happens with ext4. If the problem was limited to xfs, then I think they could help.

They can help in any case :-)

Even if it affects other filesystems, they will know why it affects their filesystem.

Of course it affects all file systems. Of Roger had continued he'd have found it affects ReiserFS, BtrFS and more.

When I saw it was XFS and EXT4, and that a VM setting effected it, I felt it was not FS-specific.

...

Its about the way the SYSTEM caches data. Its not the file system.

Of course the cache doesn't go away when he uses sync! Why should it? Its a cache not a buffer.

Instead, it grows and grows... Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

22:51

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer said the following on 06/12/2013 10:00 AM:

...

...
...
Of course the cache doesn't go away when he uses sync! Why should it? Its a cache not a buffer. Instead, it grows and grows...

As others have said, the cache will grow to use all available memory That is a GOOD THING for a cache to to. -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

13 Jun 13 Jun

06:29

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 18:51 -0400, Anton Aylward wrote:

...

Roger Oberholtzer said the following on 06/12/2013 10:00 AM:

...
...
...
Of course the cache doesn't go away when he uses sync! Why should it? Its a cache not a buffer. Instead, it grows and grows...

As others have said, the cache will grow to use all available memory

That is a GOOD THING for a cache to to.

Even when that messes up running app requests for memory? It has been said that if memory is needed by some app, this cache may get smaller to accommodate. I am not sure that is the case. When memory use gets big, some apps die. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

06:42

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 18:51 -0400, Anton Aylward wrote:

...
Roger Oberholtzer said the following on 06/12/2013 10:00 AM:

...
...
...
Of course the cache doesn't go away when he uses sync! Why should it? Its a cache not a buffer. Instead, it grows and grows...

As others have said, the cache will grow to use all available memory

That is a GOOD THING for a cache to to.

Even when that messes up running app requests for memory? It has been said that if memory is needed by some app, this cache may get smaller to accommodate. I am not sure that is the case. When memory use gets big, some apps die.

Roger, if that were true, we'd all be in big trouble. The use of spare memory for file systems cache is purely opportunistic - if an app needs memory, it will get it. Provided there is free memory, of course - but "free memory" includes memory used for cacheing. -- Per Jessen, Zürich (17.9°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

12:28

New subject: [opensuse] Re: XFS and openSUSE 12.1

Per Jessen said the following on 06/13/2013 02:42 AM:

...

Roger Oberholtzer wrote:

...
On Wed, 2013-06-12 at 18:51 -0400, Anton Aylward wrote:

...
Roger Oberholtzer said the following on 06/12/2013 10:00 AM:

...
...
...
Of course the cache doesn't go away when he uses sync! Why should it? Its a cache not a buffer. Instead, it grows and grows...

As others have said, the cache will grow to use all available memory

That is a GOOD THING for a cache to to.

Even when that messes up running app requests for memory? It has been said that if memory is needed by some app, this cache may get smaller to accommodate. I am not sure that is the case. When memory use gets big, some apps die.

Roger, if that were true, we'd all be in big trouble. The use of spare memory for file systems cache is purely opportunistic - if an app needs memory, it will get it. Provided there is free memory, of course - but "free memory" includes memory used for cacheing.

+1 Simple enough to test ... That they system runs at all for a long time running many scripts, browsing and saving, lots of stuff going though /tmp, lots of browser tabs opening and closing process space growing and shrinking, lots of .. well lots off ... What differentiated the UNIX model from "what went before" was that process creation was lightweight. That's what made shell programming so successful! The old mainframe processes like CICS were all long lived and needed constant 'tuning'. The UNIX processes spawned by the shell didn't live long enough to be worth tuning or were spending most of their time sleeping for one reason or another. It wasn't until we started doing things 'the mainframe way' such as running Oracle[1] that we had long lived processes. Things got worse from there. One large military project I worked on back in the Cold War era was based around a VAX cluster. The guidelines required that each application locked itself in core in order to get "adequate performance". I saw this a failure to understand how scheduling and memory use worked. I wrote my app to be very small and modular, so much so that the scheduler never thought to 'swap' it out when there were more productive ways to free up memory. Analysis is always useful and matching the strategy to the 'business needs' is so important that it cannot be over emphasised. Linux, like Windows, doesn't need careful tuning in the common case of a "office desktop', but there are more and more specialised situation and they do need tuning. Unlike CICS they can - usually - be tuned to once, or until the application base changes. As I keep pointing out, Rogers has what amounts to a write-only situation, and its running on a huge brute of a machine. This is nothing like "default for desktop/gui/office" out of the box settings. [1] To be fair ... IBM started with DB2 as the same monolithic model it had on the mainframes when it put DB2 up on AIX. Eventually the figured out the 'right way' was to do it the same way other native UNIX DBs worked, such as Progress, and have a number of small cooperating processes and few more that get spawned and die. The result was more responsive and shifted the 'tuning' from the process to the database and disk and dealing with matters of IO and caching. -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

07:32

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-13 08:29, Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 18:51 -0400, Anton Aylward wrote:

...
Roger Oberholtzer said the following on 06/12/2013 10:00 AM:

...

...
As others have said, the cache will grow to use all available memory

That is a GOOD THING for a cache to to.

Even when that messes up running app requests for memory? It has been said that if memory is needed by some app, this cache may get smaller to accommodate. I am not sure that is the case. When memory use gets big, some apps die.

No, no, that can never happen. Apps can not die because all memory is used by the cache. If you can prove it happened to you, it would be a nasty kernel bug which they will be very interested to have in Bugzilla ASAP >:-) - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG5daAACgkQtTMYHG2NR9Uh2QCcCuAR4Uz4IOn29Lj7T13cSKmw HNUAniCwECp9cBbjypXW25lic9aUkLP0 =pXak -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

11 Jun 11 Jun

11:35

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-11 13:16, Roger Oberholtzer wrote:

...

Despite being quiet on this, we have not solved the problem. We have:

* Tried other file systems (e.g., ext4) * Tried faster "server-grade" SATA disks. * Tried SATA3 interface as well as SATA2.

The same thing happens. Periodically, write calls are blocking for 4-5 seconds instead of the usual 20-30 msecs.

Ah, thus not XFS related.

...

I have seen one unexpected thing: when running xosview during all this, the MEM usage shows the cache use slowly growing. The machine has 32 GB of RAM. The cache use just grows and grows as file file system is written to. Here is the part I don't get:

Ok, as I understand it, it caches what you write in case something wants to read it again. Applications do not matter. Syncing does not matter, you may still want to read it again. And it grows while there is memory to grow it. Then I assume the older blocks get replaced with new contents. I do not know how to limit it, or tell the system not to cache certain operations. I think I remember the xine people talking of using raw read operations to avoid caching - caching a dvd video while playing makes no real sense as you only read it once. maybe you can do the same. Did the XFS people say something yet? You said you were going to ask them. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG3C5EACgkQIvFNjefEBxrVFgCg2brzPGctU810cDCCevhVrH7m G4sAnA/BEBuM+odUQ+givYwxJgVqotrt =ag5W -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

11:36

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer said the following on 06/11/2013 07:16 AM:

...

* If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache. If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

I think you're confusing a cacahe with a write-through buffer. While *YOU* may know that this is being used in write-only mode, the system doens't. A cacahe supposes (temporal at least) locality of access, that the data may be needed again in a short while and that even if it has been written out (as with 'flush') it may be wanted again shortly. In addition, what may be cached might not be data, it might be metadata for the file system. That an unmount clears the cache makes sense; there is nothing more wanted with respect to that drive. https://en.wikipedia.org/wiki/Cache_%28computing%29 <quote> In computer science, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere. If requested data is contained in the cache (cache hit), this request can be served by simply reading the cache, which is comparatively faster. </quote> and https://en.wikipedia.org/wiki/Cache_%28computing%29#The_difference_between_b... <quote> Buffering, on the other hand, a) serves to reduce the number of transfers for otherwise novel data amongst communicating processes, which serves to amortize overhead involved for several small transfers over fewer, larger transfers ..... </quote> -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

12:04

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-11 at 07:36 -0400, Anton Aylward wrote:

...

Roger Oberholtzer said the following on 06/11/2013 07:16 AM:

...
* If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache. If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

I think you're confusing a cacahe with a write-through buffer.

While *YOU* may know that this is being used in write-only mode, the system doens't. A cacahe supposes (temporal at least) locality of access, that the data may be needed again in a short while and that even if it has been written out (as with 'flush') it may be wanted again shortly.

In addition, what may be cached might not be data, it might be metadata for the file system.

That an unmount clears the cache makes sense; there is nothing more wanted with respect to that drive.

https://en.wikipedia.org/wiki/Cache_%28computing%29 <quote> In computer science, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere. If requested data is contained in the cache (cache hit), this request can be served by simply reading the cache, which is comparatively faster. </quote>

and https://en.wikipedia.org/wiki/Cache_%28computing%29#The_difference_between_b... <quote> Buffering, on the other hand, a) serves to reduce the number of transfers for otherwise novel data amongst communicating processes, which serves to amortize overhead involved for several small transfers over fewer, larger transfers ..... </quote>

I have no problem with this. I understand the concept of and benefits of a cache. But what I do not understand is why, as the cache grows (seemingly out of my control), the amount that waits to be flushed to the disk grows, making each successive flush take longer. Or at least this is what it looks like is happening. So, when the OS has obtained, say, 24 GB of cache, each time it needs to flush it takes longer as there is so much. Of course, something else entirely may be happening. But this is what it looks like to me. I do not know how much of the cache is things that are not yet physically on the disk. I would like to know how to see that. I did open the file with O_SYNC. That was not good. The write times, although perhaps more even, were extremely slow - by a factor of 10 or so. Which is rather expected as the system is not caching so much. However, the cache usage still grows the same. It is like there is one set of logic that only looks at the filesystem write rate and available RAM and decides to get more. And an independent set of logic that is flushing as O_SYNC indicates. The former seems 'broken' and the later seems to work as expected.

...

-- How long did the whining go on when KDE2 went on KDE3?

The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013

Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

12:09

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-11 14:04, Roger Oberholtzer wrote:

...

I have no problem with this. I understand the concept of and benefits of a cache. But what I do not understand is why, as the cache grows (seemingly out of my control), the amount that waits to be flushed to the disk grows, making each successive flush take longer. Or at least this is what it looks like is happening. So, when the OS has obtained, say, 24 GB of cache, each time it needs to flush it takes longer as there is so much.

I don't think it is waiting to be flushed. It is cached so that the next _read_ will come from memory. Yes, you will not read, but the kernel doesn't know that. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG3E28ACgkQIvFNjefEBxoS5ACbB3S5soRj3lGS2rHfa2UgaZsn 5GwAn3Qt0/Tgw2S7WLeI1kGIFGHKHvl6 =n+u3 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

12 Jun 12 Jun

10:58

New subject: [opensuse] Re: XFS and openSUSE 12.1

Carlos E. R. said the following on 06/11/2013 08:09 AM:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-11 14:04, Roger Oberholtzer wrote:

...
I have no problem with this. I understand the concept of and benefits of a cache. But what I do not understand is why, as the cache grows (seemingly out of my control), the amount that waits to be flushed to the disk grows, making each successive flush take longer. Or at least this is what it looks like is happening. So, when the OS has obtained, say, 24 GB of cache, each time it needs to flush it takes longer as there is so much.

I don't think it is waiting to be flushed. It is cached so that the next _read_ will come from memory. Yes, you will not read, but the kernel doesn't know that.

+1 -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:02

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 06:58 -0400, Anton Aylward wrote:

...

Carlos E. R. said the following on 06/11/2013 08:09 AM:

...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-11 14:04, Roger Oberholtzer wrote:

...
I have no problem with this. I understand the concept of and benefits of a cache. But what I do not understand is why, as the cache grows (seemingly out of my control), the amount that waits to be flushed to the disk grows, making each successive flush take longer. Or at least this is what it looks like is happening. So, when the OS has obtained, say, 24 GB of cache, each time it needs to flush it takes longer as there is so much.

I don't think it is waiting to be flushed. It is cached so that the next _read_ will come from memory. Yes, you will not read, but the kernel doesn't know that.

+1

No argument here. But that does not mean the kernel should cache things so that there are big delays dealing with these giant caches. You trade one problem for another... Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

22:53

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer said the following on 06/12/2013 10:02 AM:

...

No argument here. But that does not mean the kernel should cache things so that there are big delays dealing with these giant caches. You trade one problem for another...

That's why, as others have pointed out, the defaults from installation on page ageing & flushing and cache retention are inappropriate for your context. As I keep saying Context is Everything -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

11 Jun 11 Jun

11:48

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer said the following on 06/11/2013 07:16 AM:

...

Why would the cache grow and grow? Since the delay, when it happens, grows and grows, I get the feeling that this file system cache in RAM is slowly getting bigger and bigger, and each time it needs to be flushed, it takes longer and longer. If the cache is being emptied at some reasonable point, why would it continue to grow? Remember that for each mounted file system there is one process writing to a single file. The disk usage remains 100% constant in terms of what is sent to be written.

Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed?

Your problem is that you are treating the cache as a buffer. Yes things get written, but the FS doesn't know you are running in 'write-only' mode so its retailing the most recently written and structural metadata just in case you want to read some if it back. *THAT* is what a cache is about. Retaining stuff in case you want it again soon. That you don't, that you are treating the cacche as a buffer and its behaving as a cache, is the root of that you are observing and complaining about. Every now and again the caching algorithm need to 'flush' the 'least recently used' or sync metadata or somehow reorganise, or something, or something like that, depending on the implementation and algorithm. As an earlier commentator pointed out, whiting to a raw disk avoids this :-) Are there ways to control the cache? Probably, but it gets back to my point about you treating the cache as if it was a buffer. You want buffering but not caching because, as far as I can see, you are doing 'write only'. There are other factors such as 'commit times' and and matters to do with how the fie system reorgnaizes its b-trees, for example. All these can introduce the delays of which you speak. -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

12:15

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-11 at 07:48 -0400, Anton Aylward wrote:

...

Roger Oberholtzer said the following on 06/11/2013 07:16 AM:

...
Why would the cache grow and grow? Since the delay, when it happens, grows and grows, I get the feeling that this file system cache in RAM is slowly getting bigger and bigger, and each time it needs to be flushed, it takes longer and longer. If the cache is being emptied at some reasonable point, why would it continue to grow? Remember that for each mounted file system there is one process writing to a single file. The disk usage remains 100% constant in terms of what is sent to be written.

Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed?

Your problem is that you are treating the cache as a buffer. Yes things get written, but the FS doesn't know you are running in 'write-only' mode so its retailing the most recently written and structural metadata just in case you want to read some if it back.

*THAT* is what a cache is about. Retaining stuff in case you want it again soon.

That you don't, that you are treating the cacche as a buffer and its behaving as a cache, is the root of that you are observing and complaining about.

I am not doing this. I am only reporting that it seems that as the file system gets written to, the OS cache grows and grows. As the cache grows, the occasional flushes that are done to the disk seem to take longer and longer. Coincidence? Perhaps. Repeatable? 100%.

...

Every now and again the caching algorithm need to 'flush' the 'least recently used' or sync metadata or somehow reorganise, or something, or something like that, depending on the implementation and algorithm.

As an earlier commentator pointed out, whiting to a raw disk avoids this :-)

Perhaps. But I wonder about the performance of the raw disk compared to O_SYNC, where things are also not buffered. Maybe the raw disk is better because only bigger blocks get written to the disk, whereas O_SYNC is at the mercy of each variable sized write(). Sigh. When writing to a disk at < 20% of it's sustained write rate, on a system with many idle cores and lots of ram, you would think that performance would be okay.

...

Are there ways to control the cache? Probably, but it gets back to my point about you treating the cache as if it was a buffer.

You want buffering but not caching because, as far as I can see, you are doing 'write only'.

There are other factors such as 'commit times' and and matters to do with how the fie system reorgnaizes its b-trees, for example. All these can introduce the delays of which you speak.

-- How long did the whining go on when KDE2 went on KDE3?

The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013

Per Jessen

12:57

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

Sigh. When writing to a disk at < 20% of it's sustained write rate, on a system with many idle cores and lots of ram, you would think that performance would be okay.

Grasping at straws - write a little program that allocates e.g. 30Gb of memory and keeps trawling through it (to keep it from being paged/swapped out). Add a 1sec pause per 1Gb to keep it from maxing out a core. That'll keep your filesystems cache fairly low. Alternatively, yank some of those DIMM sticks. -- Per Jessen, Zürich (19.4°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

12 Jun 12 Jun

10:50

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer said the following on 06/11/2013 08:15 AM:

...

...
...
That you don't, that you are treating the cacche as a buffer and its behaving as a cache, is the root of that you are observing and complaining about. I am not doing this. I am only reporting that it seems that as the file system gets written to, the OS cache grows and grows. As the cache grows, the occasional flushes that are done to the disk seem to take longer and longer. Coincidence? Perhaps. Repeatable? 100%.

What you are describing is the behaviour of a cache. I don't know why you say "I am not doing this". Perhaps you are not actively saying "I have programmed a cache", but the file system is running a cache anyway because that is what the file systems do. It not you, its the system you are running. It applies just as much to XFS as ext4 as any other FS. But what you *are* doing when you flush with sync etc etc that you described in previous email is _expecting_ the system cache to behave like a buffer. When you flush a buffer it empties. When you 'flush' a cache all you are doing is making sure that the disk is in sync with the cache. The cache does not empty. When you mount a FS the cache is empty, but every time you write or read from the FS it sticks in the cache up to the maximum capacity of the cache. The cache algorithm (probably) supposes that the most recently accessed stuff is going to be re-used so only drops "old" stuff. You might be able to set up a 'write-though' cache, that is one which still retains material but does write immediately. That will overcome the delays you describe. The only way to no use the cache is to not mount a file system. As other people have said, write to the raw disk. -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:09

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 06:50 -0400, Anton Aylward wrote:

...

Roger Oberholtzer said the following on 06/11/2013 08:15 AM:

...
...
...
That you don't, that you are treating the cacche as a buffer and its behaving as a cache, is the root of that you are observing and complaining about. I am not doing this. I am only reporting that it seems that as the file system gets written to, the OS cache grows and grows. As the cache grows, the occasional flushes that are done to the disk seem to take longer and longer. Coincidence? Perhaps. Repeatable? 100%.

What you are describing is the behaviour of a cache.

I don't know why you say "I am not doing this". Perhaps you are not actively saying "I have programmed a cache", but the file system is running a cache anyway because that is what the file systems do. It not you, its the system you are running. It applies just as much to XFS as ext4 as any other FS.

I understand that the kernel is doing this as the result of me making a big file. So of course, ultimately, I am making it happen. What I am trying to say is that I am not telling the kernel to act stupid when making this big file so that all my memory goes to cache and that huge delays are incurred when dealing with the cache.

...

But what you *are* doing when you flush with sync etc etc that you described in previous email is _expecting_ the system cache to behave like a buffer. When you flush a buffer it empties. When you 'flush' a cache all you are doing is making sure that the disk is in sync with the cache. The cache does not empty.

When you mount a FS the cache is empty, but every time you write or read from the FS it sticks in the cache up to the maximum capacity of the cache. The cache algorithm (probably) supposes that the most recently accessed stuff is going to be re-used so only drops "old" stuff.

"maximum capacity of the cache" - there is the core of the problem. It seems that the kernel thinks the max capacity is all my RAM! When I limit the cache to at most 4 or so GB (the result of my fix), the system acts as I expect. When I let the system alone to deal with the cache, it grows until the system has a problem. I am happy to let the system cache things as it usually does. But I am unable to let it cache things as it seems to want to when a big file is written at a steady rate and there is lots of available RAM for the taking. Something in the logic seems to break. I just do not know what. I am guessing some of the tunable parameters for the kernel VM may help. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

14:58

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-12 16:09, Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 06:50 -0400, Anton Aylward wrote:

...

"maximum capacity of the cache" - there is the core of the problem. It seems that the kernel thinks the max capacity is all my RAM!

Absolutely! Didn't you know that? :-) - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG4jHkACgkQtTMYHG2NR9X6AwCfS1Nn37gv6b8DXZ2+EqTLLGe0 460Anjvv1gCuA21i+HxFM+NI6fZc0Ux3 =10q9 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

15:30

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 16:58 +0200, Carlos E. R. wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-12 16:09, Roger Oberholtzer wrote:

...
On Wed, 2013-06-12 at 06:50 -0400, Anton Aylward wrote:

...
"maximum capacity of the cache" - there is the core of the problem. It seems that the kernel thinks the max capacity is all my RAM!

Absolutely! Didn't you know that? :-)

Learning all the time. This particular class has been rather a surprise in that I did not think what I am doing with the collection is so very unusual. This system is more than one experiment: * Use multiple high- resolution and high-speed GigEVision cameras synchronized with a high powered xenon gas tube strobe system (we compete with sunlight...). FYI, Allied Vision Technologies (http://www.alliedvisiontec.com/emea/home.html) have great Linux support. I recommend their cameras. * Move this new camera system from a separate computer into our standard measurement system that is doing tons of other things in real time. So, we opted for a very good 16 core CPU and lots of RAM. And it seems to work! I am amazed at how fast things run. It is collecting data from many transducers, as well as controlling location identification in real time with an GPS enabled inertial navigation systems, calculation lots of things so the operator knows all is well. All this runs on openSUSE (currently 12.1 - we take a while to make each jump). The problem I am having here is not typical. Usually, things just sort of work. And this will too! Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

18:15

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-12 17:30, Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 16:58 +0200, Carlos E. R. wrote:

...

...
...
"maximum capacity of the cache" - there is the core of the problem. It seems that the kernel thinks the max capacity is all my RAM!

Absolutely! Didn't you know that? :-)

Learning all the time. This particular class has been rather a surprise in that I did not think what I am doing with the collection is so very unusual.

This system is more than one experiment:

Ah, that's a job I would love. I had once a somewhat similar one and I miss it. :-)~~ (drooling)

...

All this runs on openSUSE (currently 12.1 - we take a while to make each jump). The problem I am having here is not typical. Usually, things just sort of work.

And this will too!

I would consider using Real Time features and kernel. having the time, that is... - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG4utcACgkQtTMYHG2NR9X/BwCeJsT22BQBzxOCRCf9IzKgXMc4 9iwAni1JVIp9FSdYAwescFwF8xxpiAJ3 =nmyR -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

23:09

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer said the following on 06/12/2013 10:09 AM:

...

I understand that the kernel is doing this as the result of me making a big file. So of course, ultimately, I am making it happen. What I am trying to say is that I am not telling the kernel to act stupid when making this big file so that all my memory goes to cache and that huge delays are incurred when dealing with the cache.

In effect you are; you are telling it, by not telling it otherwise, that the ageing and cache handling settings that are set at installation 'out of the box' are Ok, BECAUSE YOU HAVE NOT SET THEM UP TO DEAL WITH YOUR CONTEXT!

...

"maximum capacity of the cache" - there is the core of the problem. It seems that the kernel thinks the max capacity is all my RAM!

Well it is - as others have pointed out. Unless something else - some other process - makes a demand for memory, then using all the available memory for a cache is a sensible DEFAULT strategy. Its the default because you haven't told the system otherwise.

...

When I limit the cache to at most 4 or so GB (the result of my fix), the system acts as I expect. When I let the system alone to deal with the cache, it grows until the system has a problem.

Right, because the default is well suited to my context, a desktop running in 4G, email, a browser, an editor, a xterm. Real "multi-tasking", varied demands, lots of networking. If I were to try and tune my page ageing I'd be wasting my time. I keep saying Context is Everything and my context is different from your context. The out-of-the-box defaults are "just fine, thank you", for my context.

...

I am happy to let the system cache things as it usually does. But I am unable to let it cache things as it seems to want to when a big file is written at a steady rate and there is lots of available RAM for the taking. Something in the logic seems to break. I just do not know what.

I am guessing some of the tunable parameters for the kernel VM may help.

I wish someone would tell us if its possible to set cache activity by file system. Really when it comes down to it, your application is write-only. The cache is of no benefit since you aren't reading anything back. Well, OK, it need to cache the inode segments since its creating new file nodes, oh and the directory data. But so long as all you are ding is writing out the files you don't need to cache data. But you do need the cache for the "other" things about Linux. If and only if both a) you have this data on a separate file system from the 'system', and b) each file system can have its won cache then setting the cache(data) size to zero for the file system where you are storing the images will achieve what you want. But if you can't use separate caches then shrinking the cache to zero will hurt other parts of the performance. I think being aggressive about page ageing and flushing aged pages is the best approach in that case. -- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

13 Jun 13 Jun

04:24

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-13 01:09, Anton Aylward wrote:

...

If and only if both a) you have this data on a separate file system from the 'system', and b) each file system can have its won cache then setting the cache(data) size to zero for the file system where you are storing the images will achieve what you want.

I don't think a zero (write) cache is appropriate either. He needs a cache in case the system is doing something else at the moment, so that it can delay the actual writing for some seconds. Normally the cache would be used all for this application, it writes so much. So being able to separate an appropriate cache for this application or this filesystem (whatever is feasible, if any) would be best. Say 2..4 GB for this, the rest for the system. Is this possible?

...

But if you can't use separate caches then shrinking the cache to zero will hurt other parts of the performance.

Right. And his real work system has more tasks to do.

...

I think being aggressive about page ageing and flushing aged pages is the best approach in that case.

Other tasks will be affected. Linda idea of using "posix_fadvise" and opening the file for write only seems to me a very good one. But she also said "the calls are advisory". - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG5SZoACgkQtTMYHG2NR9Uv3gCfdP+x15ukHQpOolZeN4GXl+LG qC4AniCajinllacGmRk4EaGYjzsdnpPj =MTpC -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

06:02

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 06:24 +0200, Carlos E. R. wrote:

...

Linda idea of using "posix_fadvise" and opening the file for write only seems to me a very good one. But she also said "the calls are advisory".

posix_fadvise(POSIX_FADV_DONTNEED) seems not to be advisory. It clears info for the specified file from the cache when it is called. So it needs to do this regularly. Which is no problem. I just need to figure out the 'start' and 'stop' parameters. This call is finding it's way into apps like tar and rsync that. like my app, may deal with big files/data and need not pollute the read cache since once written they are no longer needed by the app. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

07:41

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-13 08:02, Roger Oberholtzer wrote:

...

On Thu, 2013-06-13 at 06:24 +0200, Carlos E. R. wrote:

...
Linda idea of using "posix_fadvise" and opening the file for write only seems to me a very good one. But she also said "the calls are advisory".

posix_fadvise(POSIX_FADV_DONTNEED) seems not to be advisory. It clears info for the specified file from the cache when it is called. So it needs to do this regularly. Which is no problem. I just need to figure out the 'start' and 'stop' parameters.

That it works does not deny being advisory :-) Maybe it means that it has been implemented and works.

...

This call is finding it's way into apps like tar and rsync that. like my app, may deal with big files/data and need not pollute the read cache since once written they are no longer needed by the app.

Right. By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG5d5YACgkQtTMYHG2NR9V2BACgjr3TVSeE/jz0LSMQdHagMsoQ ybEAniSxIPcJ1MKCZW0jTQOIe2ansx6r =IgGq -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Dave Howorth

10:59

New subject: [opensuse] Re: XFS and openSUSE 12.1

Carlos E. R. wrote:

...

On 2013-06-13 08:02, Roger Oberholtzer wrote:

...
...
Linda idea of using "posix_fadvise" and opening the file for write only seems to me a very good one. But she also said "the calls are advisory".

On Thu, 2013-06-13 at 06:24 +0200, Carlos E. R. wrote: posix_fadvise(POSIX_FADV_DONTNEED) seems not to be advisory. It clears info for the specified file from the cache when it is called. So it needs to do this regularly. Which is no problem. I just need to figure out the 'start' and 'stop' parameters.

That it works does not deny being advisory :-)

Maybe it means that it has been implemented and works.

Exactly. Remember that this is a POSIX call. There are many other operating systems than Linux that accept this call. Some of them may not follow the instruction it provides. Hence it is advisory. It is not a requirement on an operating system to obey this instruction in order to conform to POSIX. But apparently Linux does, at least in these circumstances.

...

By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient.

I think this is misguided advice. It is very important to use use fsync to ensure that the data makes it onto the disk in order to ensure data or whole files don't disappear. On a rename for example, it is necessary to fsync the directory as well as the file. The point is that application correctness is more important than kernel or filesystem convenience. But clearly, fsync checkpoints should be chosen wisely, on transaction boundaries, perhaps, not on every write. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

11:55

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 2013-06-13 12:59, Dave Howorth wrote:

...

Carlos E. R. wrote:

...

...
By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient.

I think this is misguided advice. It is very important to use use fsync to ensure that the data makes it onto the disk in order to ensure data or whole files don't disappear.

Yes, of course the data must be written. But if we force the kernel to write it now, we reduce overall performance. Just my opinion, but you can make a run and compare. -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar)

Dave Howorth

13:52

New subject: [opensuse] Re: XFS and openSUSE 12.1

Carlos E. R. wrote:

...

On 2013-06-13 12:59, Dave Howorth wrote:

...
Carlos E. R. wrote:

...
By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient. I think this is misguided advice. It is very important to use use fsync to ensure that the data makes it onto the disk in order to ensure data or whole files don't disappear.

Yes, of course the data must be written. But if we force the kernel to write it now, we reduce overall performance. Just my opinion, but you can make a run and compare.

Right but performance matters for nothing if the results are incorrect! Please read up about fsync before encouraging people not to use it. There are lots of links in this thread and in the XFS list discussion. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:09

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 14:52 +0100, Dave Howorth wrote:

...

Carlos E. R. wrote:

...
On 2013-06-13 12:59, Dave Howorth wrote:

...
Carlos E. R. wrote:

...
By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient. I think this is misguided advice. It is very important to use use fsync to ensure that the data makes it onto the disk in order to ensure data or whole files don't disappear.

Yes, of course the data must be written. But if we force the kernel to write it now, we reduce overall performance. Just my opinion, but you can make a run and compare.

Right but performance matters for nothing if the results are incorrect!

Please read up about fsync before encouraging people not to use it. There are lots of links in this thread and in the XFS list discussion.

But it seems all those discussions are about what happens if the system fails. For some uses I can see that this is a valid concern. I do not think it is a problem for me. If the system should happen to fail, loosing the last X minutes of data is the smaller hassle. Repositioning yourself on the road to start again and all the attendant details are more the problem. I am happy to say that system failure in a vehicle has been virtually non-existent. And it is a demanding environment. Of course, we have spent a bit of time getting the power clean and steady. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

19:05

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 2013-06-13 16:09, Roger Oberholtzer wrote:

...

On Thu, 2013-06-13 at 14:52 +0100, Dave Howorth wrote:

...
Carlos E. R. wrote:

...

...
Right but performance matters for nothing if the results are incorrect!

Please read up about fsync before encouraging people not to use it. There are lots of links in this thread and in the XFS list discussion.

But it seems all those discussions are about what happens if the system fails. For some uses I can see that this is a valid concern. I do not think it is a problem for me. If the system should happen to fail, loosing the last X minutes of data is the smaller hassle. Repositioning yourself on the road to start again and all the attendant details are more the problem.

Right. I did hear about abusing fsync years ago, from kernel devs probably. Directly out of the horse mouth.

...

I am happy to say that system failure in a vehicle has been virtually non-existent. And it is a demanding environment. Of course, we have spent a bit of time getting the power clean and steady.

That's quite an accomplishment :-) -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar)

Carlos E. R.

19:01

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 2013-06-13 15:52, Dave Howorth wrote:

...

Carlos E. R. wrote:

...

Right but performance matters for nothing if the results are incorrect!

They will be incorrect only if the system crashes. -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar)

Roger Oberholtzer

14:06

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 13:55 +0200, Carlos E. R. wrote:

...

On 2013-06-13 12:59, Dave Howorth wrote:

...
Carlos E. R. wrote:

...
...
By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient.

I think this is misguided advice. It is very important to use use fsync to ensure that the data makes it onto the disk in order to ensure data or whole files don't disappear.

Yes, of course the data must be written. But if we force the kernel to write it now, we reduce overall performance. Just my opinion, but you can make a run and compare.

The problem for me is not that the data may not get written. That seems very stable. I would think that fsync would not really ne needed. As indicated by my workaround that frees cache pages. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

14:04

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, Jun 13, 2013 at 3:41 AM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...

By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient.

The fadvise call only trashes cache that has already been written to disk. So you can call it as often as you want prior to the close, but the last thing that is done needs to be: fsync() fadvise(0,len,DONTNEED) close() Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:12

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 10:04 -0400, Greg Freemyer wrote:

...

On Thu, Jun 13, 2013 at 3:41 AM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...
By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient.

The fadvise call only trashes cache that has already been written to disk. So you can call it as often as you want prior to the close, but the last thing that is done needs to be:

fsync() fadvise(0,len,DONTNEED)

Can the start,stop parameters be 0,0? Meaning all pages for this file?

...

close()

Greg

Greg Freemyer

14:18

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, Jun 13, 2013 at 10:12 AM, Roger Oberholtzer <roger@opq.se> wrote:

...

On Thu, 2013-06-13 at 10:04 -0400, Greg Freemyer wrote:

...
On Thu, Jun 13, 2013 at 3:41 AM, Carlos E. R. <robin.listas@telefonica.net> wrote:

...
By the way, try not to use "fsync". It forces the kernel to write to disk even if not convenient.

The fadvise call only trashes cache that has already been written to disk. So you can call it as often as you want prior to the close, but the last thing that is done needs to be:

fsync() fadvise(0,len,DONTNEED)

Can the start,stop parameters be 0,0? Meaning all pages for this file?

...
close()

Greg

Roger, I'm guessing you can read a man page as well as I can, but from the posix_fadvise man page: == The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. == So, yes 0,0 means trash all cache from the start of the file to the end. fyi: This is a new call to me and I've never used it, so I have no first hand feedback. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

15:53

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 10:18 -0400, Greg Freemyer wrote:

...

I'm guessing you can read a man page as well as I can, but from the posix_fadvise man page:

I know I deserved that. But I used the fadvise man page on the 'net (as there is no openSUSE version of that page), and the values for these were not defined. If I had tried posix_fadvise() I may have fared better.

...

fyi: This is a new call to me and I've never used it, so I have no first hand feedback.

Same here. After a day of meetings, perhaps I can try it too see what happens. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

16:03

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 10:18 -0400, Greg Freemyer wrote:

...

fyi: This is a new call to me and I've never used it, so I have no first hand feedback.

Initial tests look promising. When the call is done, the cache drops by whatever the app has caused to be added. Unlike the brute-force approach I first did, the cache for other activities is not effected. More testing will follow. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

19:09

New subject: [opensuse] Re: XFS and openSUSE 12.1

On 2013-06-13 18:03, Roger Oberholtzer wrote:

...

On Thu, 2013-06-13 at 10:18 -0400, Greg Freemyer wrote:

...
fyi: This is a new call to me and I've never used it, so I have no first hand feedback.

Initial tests look promising. When the call is done, the cache drops by whatever the app has caused to be added. Unlike the brute-force approach I first did, the cache for other activities is not effected.

Let me clarify. The cache initially grows while calls are made, and when the file is closed, it drops. About to gigs, right? Well that's very good, just what you need. -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar)

Roger Oberholtzer

14 Jun 14 Jun

05:32

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, 2013-06-13 at 21:09 +0200, Carlos E. R. wrote:

...

On 2013-06-13 18:03, Roger Oberholtzer wrote:

...
On Thu, 2013-06-13 at 10:18 -0400, Greg Freemyer wrote:

...
fyi: This is a new call to me and I've never used it, so I have no first hand feedback.

Initial tests look promising. When the call is done, the cache drops by whatever the app has caused to be added. Unlike the brute-force approach I first did, the cache for other activities is not effected.

Let me clarify.

The cache initially grows while calls are made, and when the file is closed, it drops. About to gigs, right? Well that's very good, just what you need.

It does not drop when the file is closed. Only if the file is deleted or the file system umounted (neither of which I can reasonably do IRL), or posix_fadvise(fp, 0, 0, POSIX_FADV_DONTNEED) is called. So I now call posix_fadvise() periodically. I do not fiddle with fdatasync as I do not mind a reasonable bit if buffering. If the cache was really mainly a read cache and not tons of stuff waiting to do to disk (as seems to be implied in various discussions), then I do not understand the occasional >4 second fwrite() delays and, more importantly, why the call to posix_fadvise() helps. That call does not cause a flush. It only arranges that pages that do not contain unique information (in mem and not on disk) are freed. Perhaps I am not meant to understand this mystery :) I am satisfied that I have a solution. I am still not satisfied that the dynamics of the situation are what we think. But life must go on. And testing proceed. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Anton Aylward

13 Jun 13 Jun

14:55

New subject: [opensuse] Re: XFS and openSUSE 12.1

Greg Freemyer said the following on 06/13/2013 10:04 AM:

...

The fadvise call only trashes cache that has already been written to disk.

LOL! An what's the call to trash the cache that HASN'T been written to disk? -- Me...a skeptic? I trust you can prove that. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

15:08

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Thu, Jun 13, 2013 at 10:55 AM, Anton Aylward <opensuse@antonaylward.com> wrote:

...

Greg Freemyer said the following on 06/13/2013 10:04 AM:

...
The fadvise call only trashes cache that has already been written to disk.

LOL!

An what's the call to trash the cache that HASN'T been written to disk?

Well, drop_caches first causes the dirty cache pages to be written, then trashes them. One might assume that WONTNEED would also flush any dirty cache pages to disk, then trash them. The man page clearly states that is not true, if they are dirty, they are ignored. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Dave Howorth

15:38

New subject: [opensuse] Re: XFS and openSUSE 12.1

Anton Aylward wrote:

...

Greg Freemyer said the following on 06/13/2013 10:04 AM:

...
The fadvise call only trashes cache that has already been written to disk.

LOL!

An what's the call to trash the cache that HASN'T been written to disk?

halt_and_catch_fire() IIRC -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

11 Jun 11 Jun

12:01

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

I have seen one unexpected thing: when running xosview during all this, the MEM usage shows the cache use slowly growing. The machine has 32 GB of RAM. The cache use just grows and grows as file file system is written to. Here is the part I don't get:

* If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache.

sync flushes changed blocks to disk, but the file systems cache will still hold data cached for reading.

...

If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

Try /proc/sys/vm/drop_caches from https://www.kernel.org/doc/Documentation/sysctl/vm.txt --------------------------------------------------------------------- Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free. To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. -------------------------------------------------------------------- -- Per Jessen, Zürich (20.4°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

12:06

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-11 14:01, Per Jessen wrote:

...

Roger Oberholtzer wrote:

...

Try /proc/sys/vm/drop_caches

from https://www.kernel.org/doc/Documentation/sysctl/vm.txt ---------------------------------------------------------------------

Writing to this will cause the kernel to drop clean caches, dentries and

...

inodes from memory, causing that memory to become free.

To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. --------------------------------------------------------------------

But

...

as this empties all, not only the ones he wants, other processes read/writing to disk will delay. For instance, syslog. Or library loading. Whatever. :-) - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG3EtQACgkQIvFNjefEBxrIvwCcDJGoQ6nKsfVlScAWJcljI0uE +y4AmwUGg+JyBU1Ap2XfTbYXkn112F/4 =tDII -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

12:52

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-11 at 14:06 +0200, Carlos E. R. wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-11 14:01, Per Jessen wrote:

...
Roger Oberholtzer wrote:

...
Try /proc/sys/vm/drop_caches

from https://www.kernel.org/doc/Documentation/sysctl/vm.txt ---------------------------------------------------------------------

Writing to this will cause the kernel to drop clean caches, dentries and

...
inodes from memory, causing that memory to become free.

To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. --------------------------------------------------------------------

But

...
as this empties all, not only the ones he wants, other processes read/writing to disk will delay. For instance, syslog. Or library loading. Whatever. :-)

In my strange use, I have only one single writer per disk, no reader. No files are deleted. Just one growing file (up to 2GB) on a pristine disk. As a little test, I made a 60 second loop that runs the following each iteration: echo 1 > /proc/sys/vm/drop_caches This results in the cache going away. More interestingly, it seems that this also results in even write times. There are small variations, which I expect. But I have yet to see the longer ones. There is no discernible delay when the command is run. Just less memory given over to cache, and (fingers crossed) less periodic housekeeping as a result. Of course, I have been fooled before. Longer tests are needed. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

auxsvr＠gmail.com

12 Jun 12 Jun

09:33

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tuesday 11 of June 2013 14:52:36 Roger Oberholtzer wrote:

...

In my strange use, I have only one single writer per disk, no reader. No files are deleted. Just one growing file (up to 2GB) on a pristine disk.

As a little test, I made a 60 second loop that runs the following each iteration:

echo 1 > /proc/sys/vm/drop_caches

This results in the cache going away. More interestingly, it seems that this also results in even write times. There are small variations, which I expect. But I have yet to see the longer ones. There is no discernible delay when the command is run. Just less memory given over to cache, and (fingers crossed) less periodic housekeeping as a result.

Have you tried reducing the maximum size of dirty data before they are written to disk? See https://www.kernel.org/doc/Documentation/sysctl/vm.txt , http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-p..., in particular /proc/sys/vm/dirty_ratio or dirty_bytes. On my system, dirty_ratio is at 20%, which is too high for a system with 32GB of RAM. What is the throughput of your disks?

...

Yours sincerely,

Roger Oberholtzer

Regards, Peter -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

10:23

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 12:33 +0300, auxsvr@gmail.com wrote:

...

On Tuesday 11 of June 2013 14:52:36 Roger Oberholtzer wrote:

...
In my strange use, I have only one single writer per disk, no reader. No files are deleted. Just one growing file (up to 2GB) on a pristine disk.

As a little test, I made a 60 second loop that runs the following each iteration:

echo 1 > /proc/sys/vm/drop_caches

This results in the cache going away. More interestingly, it seems that this also results in even write times. There are small variations, which I expect. But I have yet to see the longer ones. There is no discernible delay when the command is run. Just less memory given over to cache, and (fingers crossed) less periodic housekeeping as a result.

Have you tried reducing the maximum size of dirty data before they are written to disk? See https://www.kernel.org/doc/Documentation/sysctl/vm.txt , http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-p..., in particular /proc/sys/vm/dirty_ratio or dirty_bytes. On my system, dirty_ratio is at 20%, which is too high for a system with 32GB of RAM.

What is the throughput of your disks?

Seems to be 100 MB/s in our rests. The app is generating 25 MB/sec. I will check what values I have for this. They are whatever is the installed default. My cache use goes way above 20%. It gets to 50 or 60% of RAM. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Lars Müller

11 Jun 11 Jun

12:37

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, Jun 11, 2013 at 02:01:05PM +0200, Per Jessen wrote:

...

Roger Oberholtzer wrote: [ 8< ]

...
If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

Try /proc/sys/vm/drop_caches

Plus several other settings from /proc/sys/vm/ vfs_cache_pressure, max_map_count, dirty_expire_centisecs for example. You already played with /proc/sys/vm/swappiness? default value is 60 It might also be of benefit to compare the kernel-desktop vs kernel-default settings. See /boot/config-VESRION-[default,desktop] See /etc/sysctl.conf with a hint to man pages in the header. Very likely you'll get much better feedback on the kernel list. Cheers, Lars -- Lars Müller [ˈlaː(r)z ˈmʏlɐ] Samba Team + SUSE Labs SUSE Linux, Maxfeldstraße 5, 90409 Nürnberg, Germany

Anton Aylward

12 Jun 12 Jun

10:56

New subject: [opensuse] Re: XFS and openSUSE 12.1

Per Jessen said the following on 06/11/2013 08:01 AM:

...

Roger Oberholtzer wrote:

...
* If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache.

sync flushes changed blocks to disk, but the file systems cache will still hold data cached for reading.

...

Try /proc/sys/vm/drop_caches

from https://www.kernel.org/doc/Documentation/sysctl/vm.txt --------------------------------------------------------------------- Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.

To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. --------------------------------------------------------------------

That sounds like a more sensible path to try.

...

-- How long did the whining go on when KDE2 went on KDE3? The only universal constant is change. If a species can not adapt it goes extinct. That's the law of the universe, adapt or die. -- Billie Walsh, May 18 2013 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:12

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 06:56 -0400, Anton Aylward wrote:

...

Per Jessen said the following on 06/11/2013 08:01 AM:

...

...
Try /proc/sys/vm/drop_caches

from https://www.kernel.org/doc/Documentation/sysctl/vm.txt --------------------------------------------------------------------- Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.

To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation and dirty objects are not freeable, the user should run `sync' first. --------------------------------------------------------------------

That sounds like a more sensible path to try.

It is where my hack that seems to help came from. Thanks Per! Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

11 Jun 11 Jun

16:14

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, Jun 11, 2013 at 7:16 AM, Roger Oberholtzer <roger@opq.se> wrote:

...

Despite being quiet on this, we have not solved the problem. We have:

* Tried other file systems (e.g., ext4) * Tried faster "server-grade" SATA disks. * Tried SATA3 interface as well as SATA2.

The same thing happens. Periodically, write calls are blocking for 4-5 seconds instead of the usual 20-30 msecs.

I have seen one unexpected thing: when running xosview during all this, the MEM usage shows the cache use slowly growing. The machine has 32 GB of RAM. The cache use just grows and grows as file file system is written to. Here is the part I don't get:

* If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache. If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

Why would the cache grow and grow? Since the delay, when it happens, grows and grows, I get the feeling that this file system cache in RAM is slowly getting bigger and bigger, and each time it needs to be flushed, it takes longer and longer. If the cache is being emptied at some reasonable point, why would it continue to grow? Remember that for each mounted file system there is one process writing to a single file. The disk usage remains 100% constant in terms of what is sent to be written.

Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed?

I have seen some reports that using O_SYNC when opening the file makes the write times more even. I guess I could open() a wile with this, and then fdopen() it. fcntl() seems not to support O_SYNC...

Roger, O_SYNC does not bypass the cache, it just flushes continuously, but it is not the same as drop_cache. You need O_DIRECT to bypass the cache. If you want a write buffer and not a cache, why don't you just do that? A very basic attempt would be: - create a named pipe per output file - dd if=named_pipe of=file oflag=direct bs=64K In your program, have it create the named_pipe, then launch dd as required. Hopefully when you close the named_pipe dd will see that and write out the last partial block. When I've actually had to have a dedicated buffer in a real scenario, I used mbuffer instead of dd: http://www.maier-komor.de/mbuffer.html btw, mbuffer is in the opensuse distribution. My use was writing to tape and I wanted to queue up a GB of data before I started sending any of it to the tape, so I was able to have just one invocation of mbuffer, but I think it would work for you as well. If mbuffer, doesn't currently use the O_DIRECT flag in it's open call to the destination file, then you should be easily able to add it or whatever other customizations you need, after all you have the source!. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

12 Jun 12 Jun

06:09

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Tue, 2013-06-11 at 12:14 -0400, Greg Freemyer wrote:

...

On Tue, Jun 11, 2013 at 7:16 AM, Roger Oberholtzer <roger@opq.se> wrote:

...
Despite being quiet on this, we have not solved the problem. We have:

* Tried other file systems (e.g., ext4) * Tried faster "server-grade" SATA disks. * Tried SATA3 interface as well as SATA2.

The same thing happens. Periodically, write calls are blocking for 4-5 seconds instead of the usual 20-30 msecs.

I have seen one unexpected thing: when running xosview during all this, the MEM usage shows the cache use slowly growing. The machine has 32 GB of RAM. The cache use just grows and grows as file file system is written to. Here is the part I don't get:

* If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache. If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

Why would the cache grow and grow? Since the delay, when it happens, grows and grows, I get the feeling that this file system cache in RAM is slowly getting bigger and bigger, and each time it needs to be flushed, it takes longer and longer. If the cache is being emptied at some reasonable point, why would it continue to grow? Remember that for each mounted file system there is one process writing to a single file. The disk usage remains 100% constant in terms of what is sent to be written.

Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Is there a way to see how much of the RAM cache for a file system is actually containing data waiting to be flushed?

I have seen some reports that using O_SYNC when opening the file makes the write times more even. I guess I could open() a wile with this, and then fdopen() it. fcntl() seems not to support O_SYNC...

Roger,

O_SYNC does not bypass the cache, it just flushes continuously, but it is not the same as drop_cache. You need O_DIRECT to bypass the cache.

If you want a write buffer and not a cache, why don't you just do that? A very basic attempt would be:

I think everyone is misunderstanding the situation. I am not doing anything with or expecting or manipulating a cache. The cache I see is a totally private thing being done by the OS. The existence of the cache is not the problem. In fact, if there was no cache I would think something was wrong. The problem I am seeing is that the cache is growing and growing to eat all my memory. In addition, as the cache grows, the periodic writes to disk take longer and longer. 100% reproducible. To be clear: I do not ask for, manipulate or in any other way influence the cache through any direct action in my application. I am only writing a single file by a single process. This file is growing at 25 MB a second (more or less). The file is opened with fopen, written to, and then closed with fclose(). Files can be big, but never more than 2GB each. My initial thought was that the file system was doing something that led to the longer write delays. So I asked about XFS, which is the file system we use for this. As I later reported, it seems the issue exists for all block devices (ext4, but not /dev/null as the file). I understand that the cache is there so I can possibly read data that has been recently written. However, I do not see how the kernel can just grow this cache until my memory is gone. Especially if the bigger cache also results in significant and increasingly longer delays in write completions. The workaround that seems to correct the situation is to run this: while [ 1 ] do echo 1 > /proc/sys/vm/drop_caches sleep 60 done & Obviously a brute-force approach that is really only possible on my system as it does not seem to mess up general usage. The rate of 60 seconds is arbitrary. But each time this is run, the cache has grown to almost 3 GB. I wrote a small app that simulates the problem. I will verify that it really does do so and then can post the C source (very tiny) if anyone wants to see what happens on their system.

...

- create a named pipe per output file - dd if=named_pipe of=file oflag=direct bs=64K

This is an interesting approach to getting direct I/O. I will have to file this for future reference. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

06:24

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

On Tue, 2013-06-11 at 12:14 -0400, Greg Freemyer wrote:

I think everyone is misunderstanding the situation. I am not doing anything with or expecting or manipulating a cache. The cache I see is a totally private thing being done by the OS. The existence of the cache is not the problem. In fact, if there was no cache I would think something was wrong.

The problem I am seeing is that the cache is growing and growing to eat all my memory.

It only grows as long as nothing else needs the memory. Using up otherwise unused memory as file systems cache seems quite prudent.

...

In addition, as the cache grows, the periodic writes to disk take longer and longer. 100% reproducible.

This is more likely the problem.

...

To be clear: I do not ask for, manipulate or in any other way influence the cache through any direct action in my application. I am only writing a single file by a single process. This file is growing at 25 MB a second (more or less). The file is opened with fopen, written to, and then closed with fclose(). Files can be big, but never more than 2GB each.

If this is a systemic problem, I ought to be able to reproduce it, so yesterday I wrote a little test program to do exactly that. I never saw your 4-5 second delay, but I did see the IO-rate dropping to about half a couple of times. Not regularly though. Much smaller system, single core, 1.5GB RAM, one single harddrive.

...

I understand that the cache is there so I can possibly read data that has been recently written. However, I do not see how the kernel can just grow this cache until my memory is gone.

If something else needs the memory, the kernel will invalidate the cache and give the memory away.

...

Especially if the bigger cache also results in significant and increasingly longer delays in write completions.

Yes, that is the odd thing. Especially your 4-5 second delays.

...

The workaround that seems to correct the situation is to run this:

while [ 1 ] do echo 1 > /proc/sys/vm/drop_caches sleep 60 done &

It would seem to substantiate your point about the cache being the issue. How about if you only tried limiting the cache size? (just as another option). If the file systems cache was kept to e.g. 1Gb. -- Per Jessen, Zürich (14.6°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

07:37

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 08:24 +0200, Per Jessen wrote:

...

It only grows as long as nothing else needs the memory. Using up otherwise unused memory as file systems cache seems quite prudent.

Would that this were the case. The memory use increases until process start to be killed. Which is the standard way Linux deals with memory shortages. And it begs the question, why should any file system think it is ok to cache, say 16GB? At least I would not expect this as the default. I have checked and the behavior is the same on openSUSE 12.3 as I see on 12.1

...

...
In addition, as the cache grows, the periodic writes to disk take longer and longer. 100% reproducible.

This is more likely the problem.

It is what first caught my attention!

...

If this is a systemic problem, I ought to be able to reproduce it, so yesterday I wrote a little test program to do exactly that. I never saw your 4-5 second delay, but I did see the IO-rate dropping to about half a couple of times. Not regularly though. Much smaller system, single core, 1.5GB RAM, one single harddrive.

In fact my single process has two files open, each on a separate disk. Maybe that is part of the dynamic. If you let your test app run, and the file it creates grow and grow, how does the cache usage progress? The test app needs to write to open a new file when the previous one is

...

than the file system file size limit. And just keep doing this.

And the cache usage will grow and grow...

...

...
I understand that the cache is there so I can possibly read data that has been recently written. However, I do not see how the kernel can just grow this cache until my memory is gone.

If something else needs the memory, the kernel will invalidate the cache and give the memory away.

I suspect the memory has been invalidated. But it has not been given away. I base that assumption on the fact that my workaround frees the cache immediately. Whether the cache size stops at some reasonable point or not is perhaps a side issue. The question remains: why, as the cache grows, do write calls periodically have longer and longer delays (in the magnitude of seconds)? If the cache is not causing this, then why does freeing it with the workaround result in these delays not happening? -- Roger Onerholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

09:04

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

On Wed, 2013-06-12 at 09:37 +0200, Roger Oberholtzer wrote: I don't want to say it is solved. But with the following I can at least make the system work as intended. The workaround that seems to correct the situation is to run this: while [ 1 ] do echo 1 > /proc/sys/vm/drop_caches sleep 60 done & Obviously a brute-force approach that is really only possible on my system as it does not seem to mess up general usage. The rate of 60 seconds is arbitrary. But each time this is run, the cache has grown to almost 3 GB. I have now asked the question on the XFS mailing list, and if I hear anything interesting, and if anyone here is interested, I will inform here. Thanks to all for all the help. Great list here! -- Roger Oberholtzer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

09:24

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-12 11:04, Roger Oberholtzer wrote:

...

I have now asked the question on the XFS mailing list, and if I hear anything interesting, and if anyone here is interested, I will inform here.

Yes, please :-) - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG4PmEACgkQIvFNjefEBxrNtgCdEN1t95OPUYrwtU7eQWYj0u9W hGwAniTXHqZER7/AgipB24cyv9sLIEGe =K06e -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

11:02

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-12 11:04, Roger Oberholtzer wrote:

...

Obviously a brute-force approach that is really only possible on my system as it does not seem to mess up general usage. The rate of 60 seconds is arbitrary. But each time this is run, the cache has grown to almost 3 GB.

As you generate 25E6 (roughly) per second, after 60" you gave generated 1.5GB (not 1.5GiB). The delay you had was 5". In that time the disk system can write 0.5GB at 100MB/s - so that delay is not flushing the entire cache. Not enough time at all. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG4VWAACgkQIvFNjefEBxq/MgCfWK34fgS/x45AADOz50qKc0n6 18cAoIdt6/EadSz/pjy0VcxP29RInkdh =t4ao -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

14:10

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

On Wed, 2013-06-12 at 13:02 +0200, Carlos E. R. wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-12 11:04, Roger Oberholtzer wrote:

...
Obviously a brute-force approach that is really only possible on my system as it does not seem to mess up general usage. The rate of 60 seconds is arbitrary. But each time this is run, the cache has grown to almost 3 GB.

As you generate 25E6 (roughly) per second, after 60" you gave generated 1.5GB (not 1.5GiB). The delay you had was 5". In that time the disk system can write 0.5GB at 100MB/s - so that delay is not flushing the entire cache. Not enough time at all.

I know. It is like the cache grows, and a small part is typically flushed. And then, occasionally, a bit more gets flushed that takes more time. I don't really have a problem with how it flushes things, as long as two things could be met: * the cache does not grow endlessly to full RAM * the flush, when it happens, does not take too long. A suggestion on the XFS list is that I may need to tweak some variables, as suggested by Lars Müller and others earlier in the thread. I just have not figured them out to know what to change. I am working on that... Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

15:06

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-12 16:10, Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 13:02 +0200, Carlos E. R. wrote:

...

...
As you generate 25E6 (roughly) per second, after 60" you gave generated 1.5GB (not 1.5GiB). The delay you had was 5". In that time the disk system can write 0.5GB at 100MB/s - so that delay is not flushing the entire cache. Not enough time at all.

I know. It is like the cache grows, and a small part is typically flushed. And then, occasionally, a bit more gets flushed that takes more time. I don't really have a problem with how it flushes things, as long as two things could be met:

I don't think it is that. It could be that the filesystem metadata is not written, perhaps the journal. It makes no sense not to write all the data as it gets it. Maybe the metadata is not written till, say, 5 seconds of no activity, or till it can not be delayed longer.

...

* the cache does not grow endlessly to full RAM

But this is what the Linux cache does by design :-)

...

* the flush, when it happens, does not take too long.

Normal Linux is not a Real Time Operating System by default, unless you change some options in the kernel and add some utilities. Maybe what you need is that.

...

A suggestion on the XFS list is that I may need to tweak some variables, as suggested by Lars Müller and others earlier in the thread. I just have not figured them out to know what to change. I am working on that...

Ok... - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG4jncACgkQtTMYHG2NR9X0pQCaA3o22kMx8q7AqcdbHGPE2nhz XKUAnAiRWbJAV/lJJAJDNqCeMYDAS9or =WggG -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

13 Jun 13 Jun

04:28

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

Carlos E. R. wrote:

...

I don't think it is that. It could be that the filesystem metadata is not written, perhaps the journal. It makes no sense not to write all the data as it gets it.

Maybe the metadata is not written till, say, 5 seconds of no activity, or till it can not be delayed longer.

The write the journal whenever it gets something to write -- XFS journal only journals metadata -- and they can't change the meta data on disk until the journal has sync w/the disk or they risk fs corruption. A data stream to disk -- one continuous stream waits /proc/sys/fs/xfs/filestream_centisecs ^^ that long (on my system, it's 3000 of 'em, so 30 seconds. The file system should be synced every 30 seconds as well from: xfssyncd_centisecs in the same dir.

...

...
* the cache does not grow endlessly to full RAM

---- If you don't want to use the cache then don't. It is doing what you told it to do -- you could write O_DIRECT and the cache wouldn't grow due to your program at all. But if you keep writing to the cache, it will mark those buffers as containing **COPIES** of what is on disk -- so if something asks for them it can return the info from memory. But 30 seconds is the longest it should normally wait to write out data. It can be set higher -- if you are on a laptop, 60-120 seconds isn't bad... be sure to have a full charge on your battery though.

...

But this is what the Linux cache does by design :-)

...
* the flush, when it happens, does not take too long.

--- It takes -- he said less than 120ms/call. It's not flushing to disk every half hour.

...

Normal Linux is not a Real Time Operating System by default, unless you change some options in the kernel and add some utilities. Maybe what you need is that.

---- It sounded like he was on a not too fast computer that might bog down if it had to scan 10G all at once -- might even take 5 seconds!

...

...
A suggestion on the XFS list is that I may need to tweak some variables, as suggested by Lars Müller and others earlier in the thread. I just have not figured them out to know what to change. I am working on that...

--- Oi...It's NOT going to be changed by tweaking XFS vars. XFS isn't what is holding it in memory. It's the kernel. Linux is saving up scanning for more memory for all at once, as it is more efficient that way. You can play with values in /proc/sys -- but it will all be a kludge, as what you really need is to avoid the cache altogether. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

05:02

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-13 06:28, Linda Walsh wrote:

...

Carlos E. R. wrote:

...

...
Normal Linux is not a Real Time Operating System by default, unless you change some options in the kernel and add some utilities. Maybe what you need is that. ---- It sounded like he was on a not too fast computer that might bog down if it had to scan 10G all at once -- might even take 5 seconds!

- From what I remember of what he said, it is a heavy-weight cruncher: 16 cores and lots of ram and I don't remember what more. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG5UkwACgkQtTMYHG2NR9WmIwCghAIe0RGhVGM7g2JvMd58Np+a Z6kAniCNYSJQAdF7Gm/0O7jFaLqP+7tn =yTaE -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

06:07

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

On Wed, 2013-06-12 at 21:28 -0700, Linda Walsh wrote:

...

It sounded like he was on a not too fast computer that might bog down if it had to scan 10G all at once -- might even take 5 seconds!

In fact it is on an extremely fast 16 core with oodles of RAM. The processor is not taxed by all the activity.

...

Oi...It's NOT going to be changed by tweaking XFS vars. XFS isn't what is holding it in memory. It's the kernel.

The suggestion was from the XFS list. But the variables to tweak are VM.

...

Linux is saving up scanning for more memory for all at once, as it is more efficient that way.

You can play with values in /proc/sys -- but it will all be a kludge, as what you really need is to avoid the cache altogether.

Or just find a way to keep it from thinking that all that tasty RAM is available for it's evil purposes :) The cache system needs to curb RAM intake. Go in a diet. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

07:42

New subject: [opensuse] Re: XFS and openSUSE 12.1 [SOLVEDish]

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-13 08:07, Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 21:28 -0700, Linda Walsh wrote:

...
It sounded like he was on a not too fast computer that might bog down if it had to scan 10G all at once -- might even take 5 seconds!

In fact it is on an extremely fast 16 core with oodles of RAM. The processor is not taxed by all the activity.

The processor is not the only thing to be taxed - memory i/o can be, for instance ;-) - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG5d/0ACgkQtTMYHG2NR9VfDwCgi4UkhsWKpmDlFPjfottEMnZt +t0AoIF+PyFMS4nZbH151+slt4D9MKOj =Mb65 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

12 Jun 12 Jun

09:24

New subject: [opensuse] Re: XFS and openSUSE 12.1

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-06-12 09:37, Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 08:24 +0200, Per Jessen wrote:

...
It only grows as long as nothing else needs the memory. Using up otherwise unused memory as file systems cache seems quite prudent.

Would that this were the case. The memory use increases until process start to be killed. Which is the standard way Linux deals with memory shortages.

No, never. I have never seen that. What I have seen is, that if a process requests more for itself, and there is not enough memory, it is taken from the system cache, which reduces size till almost nil. Then, as the process demands more memory, some processes get killed because there is no free and no cache memory to take from. Not the other way round.

...

Whether the cache size stops at some reasonable point or not is perhaps a side issue. The question remains: why, as the cache grows, do write calls periodically have longer and longer delays (in the magnitude of seconds)? If the cache is not causing this, then why does freeing it with the workaround result in these delays not happening?

This may be related or coincidental. - -- Cheers / Saludos, Carlos E. R. (from 12.3 x86_64 "Dartmouth" at Telcontar) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlG4PjQACgkQIvFNjefEBxqA/wCfdvm24X4P+MH/zhO7jMKhoPLa YUAAoLChrNsBh0OGTVn3X8ZWRQi/10fz =B9AR -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

10:21

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 11:24 +0200, Carlos E. R. wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 2013-06-12 09:37, Roger Oberholtzer wrote:

...
On Wed, 2013-06-12 at 08:24 +0200, Per Jessen wrote:

...
It only grows as long as nothing else needs the memory. Using up otherwise unused memory as file systems cache seems quite prudent.

Would that this were the case. The memory use increases until process start to be killed. Which is the standard way Linux deals with memory shortages.

...

No, never. I have never seen that.

What I have seen is, that if a process requests more for itself, and there is not enough memory, it is taken from the system cache, which reduces size till almost nil. Then, as the process demands more memory, some processes get killed because there is no free and no cache memory to take from. Not the other way round.

This could be why the app goes away. But the lack of memory that causes it is this damnable page cache for the file system...

...

...
Whether the cache size stops at some reasonable point or not is perhaps a side issue. The question remains: why, as the cache grows, do write calls periodically have longer and longer delays (in the magnitude of seconds)? If the cache is not causing this, then why does freeing it with the workaround result in these delays not happening?

This may be related or coincidental.

But predictable and repeatable.

...

-----END PGP SIGNATURE-----

Andrey Borzenkov

10:36

New subject: [opensuse] Re: XFS and openSUSE 12.1

В Wed, 12 Jun 2013 12:21:14 +0200 Roger Oberholtzer <roger@opq.se> пишет:

...

This could be why the app goes away. But the lack of memory that causes it is this damnable page cache for the file system...

No. If this were the case, it would be a bug. Cache does not cause lack of memory.

...

...
...
Whether the cache size stops at some reasonable point or not is perhaps a side issue. The question remains: why, as the cache grows, do write calls periodically have longer and longer delays (in the magnitude of seconds)?

It was not what you said in your original post. You said "every once in a while the write takes 100x longer". This is very different from what you say now.

...

If the cache is not causing this,

...
...
then why does freeing it with the workaround result in these delays not happening?

Did you EVER try to tune dirty_* VM parameters? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

12:06

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 08:24 +0200, Per Jessen wrote:

...
It only grows as long as nothing else needs the memory. Using up otherwise unused memory as file systems cache seems quite prudent.

Would that this were the case. The memory use increases until process start to be killed. Which is the standard way Linux deals with memory shortages.

Right, but the memory used for filesystems cacheing is still available for processes to use.

...

And it begs the question, why should any file system think it is ok to cache, say 16GB? At least I would not expect this as the default.

In principle it seems okay I would say, but if the size causes a problem, there ought to be way of limiting it.

...

...
If this is a systemic problem, I ought to be able to reproduce it, so yesterday I wrote a little test program to do exactly that. I never saw your 4-5 second delay, but I did see the IO-rate dropping to about half a couple of times. Not regularly though. Much smaller system, single core, 1.5GB RAM, one single harddrive.

In fact my single process has two files open, each on a separate disk. Maybe that is part of the dynamic.

Certainly possible. I might try that too.

...

If you let your test app run, and the file it creates grow and grow, how does the cache usage progress?

It stayed at about 1G.

...

The test app needs to write to open a new file when the previous one is than the file system file size limit. And just keep doing this.

Yes, that what my test does: open file#0 write 2048x1M blocks close file#0 open file#1 write 2048x1M blocks close file#1 etc. I presume your test with two files would look like this: open file#0 open file#1 do 2048 times write 1M block to file#0 write 1M block to file#1 done close file#0 close file#1 etc.

...

And the cache usage will grow and grow...

But my testbox is quite limited in memory, so it doesn't.

...

...
...
I understand that the cache is there so I can possibly read data that has been recently written. However, I do not see how the kernel can just grow this cache until my memory is gone.

If something else needs the memory, the kernel will invalidate the cache and give the memory away.

I suspect the memory has been invalidated. But it has not been given away. I base that assumption on the fact that my workaround frees the cache immediately.

If no other process needs it, there's no reason to give it away.

...

Whether the cache size stops at some reasonable point or not is perhaps a side issue. The question remains: why, as the cache grows, do write calls periodically have longer and longer delays (in the magnitude of seconds)? If the cache is not causing this, then why does freeing it with the workaround result in these delays not happening?

Good question. There seems to be a direct link between the cache size and the delay. That's why I suggested you write a little program to allocate most of the memory such that the cache is kept small. #include <stdio.h> #include <string.h> #include <unistd.h> #include <stdlib.h> int main( int argc, char **argv ) { int a; char *m; long sz; sz=atol(argv[1]); m=malloc(sz*1024*1024); a=1; while( a++ ) { memset( m, a, sz*1024*1024 ); sleep(5); } } -- Per Jessen, Zürich (21.4°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

15:07

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 14:06 +0200, Per Jessen wrote:

...

Roger Oberholtzer wrote:

...

...
If you let your test app run, and the file it creates grow and grow, how does the cache usage progress?

It stayed at about 1G.

Here is my test app, which tyies to capture the spirit of what the real app is doing. It demonstrates the cache growing issue on both 12.1 and 12.3. cc diskio.c -o diskio It makes big files in the directory from which it is run. It runs until you kill it. The trace statements are periodic listings of the data rate, max and most recent write times. aka, what was interesting to me. I used xosview to monitor the memory cache. Deleting the files it makes frees the cache. Of course, in my real world use, deleting them is not an option... In the real app, the write calls are whatever the compression library does. We used more than one and each does this differently: one does lots of little fwrites, and one does bit chunks at a time. As they are opened as FILE and thus buffered, I do not think the specific fwrites makes a difference. It is the data rate, Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se

Per Jessen

18:24

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

On Wed, 2013-06-12 at 14:06 +0200, Per Jessen wrote:

...
Roger Oberholtzer wrote:

...
...
If you let your test app run, and the file it creates grow and grow, how does the cache usage progress?

It stayed at about 1G.

Here is my test app, which tyies to capture the spirit of what the real app is doing. It demonstrates the cache growing issue on both 12.1 and 12.3.

That is not the real issue though - the file system cache will grow to max available on any system, without causing a problem. I think your problem is described in the link that Andrey Borzenkov sent. Your app needs a sustained, high IO-rate, and that is not quite what the standard Linux settings are set for.

...

In the real app, the write calls are whatever the compression library does. We used more than one and each does this differently: one does lots of little fwrites, and one does bit chunks at a time. As they are opened as FILE and thus buffered, I do not think the specific fwrites makes a difference. It is the data rate,

I agree. See Andrey's posting. -- Per Jessen, Zürich (25.9°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

13 Jun 13 Jun

06:10

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, 2013-06-12 at 20:24 +0200, Per Jessen wrote:

...

Roger Oberholtzer wrote:

...
On Wed, 2013-06-12 at 14:06 +0200, Per Jessen wrote:

...
Roger Oberholtzer wrote:

...
...
If you let your test app run, and the file it creates grow and grow, how does the cache usage progress?

It stayed at about 1G.

Here is my test app, which tyies to capture the spirit of what the real app is doing. It demonstrates the cache growing issue on both 12.1 and 12.3.

That is not the real issue though - the file system cache will grow to max available on any system, without causing a problem. I think your problem is described in the link that Andrey Borzenkov sent. Your app needs a sustained, high IO-rate, and that is not quite what the standard Linux settings are set for.

Apparently. The surprising thing is that the app has many other files open as well. None are as big as these. But they are not so small either. There is some trigger point that seems to make this happen. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrey Borzenkov

11 Jun 11 Jun

16:44

New subject: [opensuse] Re: XFS and openSUSE 12.1

В Tue, 11 Jun 2013 13:16:59 +0200 Roger Oberholtzer <roger@opq.se> пишет:

...

Despite being quiet on this, we have not solved the problem. We have:

* Tried other file systems (e.g., ext4) * Tried faster "server-grade" SATA disks. * Tried SATA3 interface as well as SATA2.

The same thing happens. Periodically, write calls are blocking for 4-5 seconds instead of the usual 20-30 msecs.

I have seen one unexpected thing: when running xosview during all this, the MEM usage shows the cache use slowly growing. The machine has 32 GB of RAM. The cache use just grows and grows as file file system is written to. Here is the part I don't get:

* If I close all apps that have a file open on the file system, the cache use remains. * If I run the 'sync(1)' command, the cache use remains. I would have thought that the cache would be freed as there is nothing left to cache. If not immediately, over a decent amount of time. But this is not the case. * Only when I unmount the file system does the cache get freed. Immediately.

Why would the cache grow and grow?

Because unused memory is wasted memory. It is better to use it as cache than to not use it at all. Data in cache has low priority and RAM consumed by filesystem cache can be considered "free" for all practical purposes.

...

Since the delay, when it happens, grows and grows, I get the feeling that this file system cache in RAM is slowly getting bigger and bigger, and each time it needs to be flushed, it takes longer and longer.

This is probably misinterpretation. What more likely happens, is - your program writes to memory - very fast - until dirty memory threshold kicks in, at which point system forces writeback to disk.

...

If the cache is being emptied at some reasonable point, why would it continue to grow? Remember that for each mounted file system there is one process writing to a single file. The disk usage remains 100% constant in terms of what is sent to be written.

It has nothing really to do with cache growing. When you write to a file, data is going to memory cache. If you program writes very fast, faster that data can be written to disk in background, at some point your program will be suspended until there is enough space.

...

Is there some policy or setting that controls how the file system deals with file system cache in RAM? More specifically, is there any way to limit it's size for a file system?

Not really. You can try to lower /proc/sys/vm/dirty_background_ratio; it should make kernel to start write back earlier. But at the end, if you generate data faster than it can be written to disks you hit the same issue, only later. Or use O_DIRECT as already suggested. Solaris throttles programs writing to UFS file when they are "too fast" ... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

12 Jun 12 Jun

20:30

New subject: [opensuse] Re: XFS and openSUSE 12.1

Roger Oberholtzer wrote:

...

On Tue, 2013-06-04 at 11:26 -0700, Linda Walsh wrote: Any buffers are filled up very quickly when you write 25 MB/Sec. So something happening after 30 minutes is probably not related to general buffering.

...
This is where I think you'll benet most by using O_DIRECT in your open call and either a 2nd thread to handle I/O, or async I/O

By writing to /dev/null instead of a disk you are avoiding the kernel's disk-io routines as well as the xfs-io routines. I.e. it could be either one. That's why trying O_DIRECT can help eliminate the kernel's I/O buffering routines.

I do not think the kernel is the culprit. I think it is the hard disk itself. We will be trying some higher performance discs.

It sounds like you are seeing the overhead of the buffering in the kernel. If you don't like O_DIRECT, you could try giving the kernel hints with the POSIX calls I.e. calling posix_fadvise(tfd, 0, 0, POSIX_FADV_DONTNEED); on each file descriptor you are writing might help? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

22:04

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, Jun 12, 2013 at 4:30 PM, Linda Walsh <suse@tlinx.org> wrote:

...

If you don't like O_DIRECT, you could try giving the kernel hints with the POSIX calls I.e. calling posix_fadvise(tfd, 0, 0, POSIX_FADV_DONTNEED); on each file descriptor you are writing might help?

Learn something new every day and this is clearly the "right" thing to do. Roger, the man page says you should call posix_fadvise() after the pages are written to disk. == POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region. This is useful, for example, while streaming large files. A program may periodically request the kernel to free cached data that has already been used, so that more useful cached pages are not discarded instead. Pages that have not yet been written out will be unaffected, so if the application wishes to guarantee that pages will be released, it should call fsync(2) or fdatasync(2) first. == Not sure how your dataflow works, but something like this might do the trick: open() write() fsync() posix_fadvise(fd, 0, len, POSIX_FADV_DONTNEED) close() Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

22:38

New subject: [opensuse] Re: XFS and openSUSE 12.1

Greg Freemyer wrote:

...

open() write() fsync() posix_fadvise(fd, 0, len, POSIX_FADV_DONTNEED) close()

FWIW -- the calls are *advisory*, meaning the kernel may still not do what you want, versus, O_DIRECT you are avoiding the membuffer and its associated "garbage collection", altogether. So if it works for you, yeay! -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

13 Jun 13 Jun

01:58

New subject: [opensuse] Re: XFS and openSUSE 12.1

On Wed, Jun 12, 2013 at 6:38 PM, Linda Walsh <suse@tlinx.org> wrote:

...

Greg Freemyer wrote:

...
open() write() fsync() posix_fadvise(fd, 0, len, POSIX_FADV_DONTNEED) close()

---- FWIW -- the calls are *advisory*, meaning the kernel may still not do what you want, versus, O_DIRECT you are avoiding the membuffer and its associated "garbage collection", altogether. So if it works for you, yeay!

Asking if it works seems like an easy question for the xfs list, but it seems like it should be filesystem independent to me. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

02:13

New subject: [opensuse] Re: XFS and openSUSE 12.1

Greg Freemyer wrote:

...

On Wed, Jun 12, 2013 at 6:38 PM, Linda Walsh <suse@tlinx.org> wrote:

...
Greg Freemyer wrote:

...
open() write() fsync() posix_fadvise(fd, 0, len, POSIX_FADV_DONTNEED) close()

FWIW -- the calls are *advisory*, meaning the kernel may still not do what you want, versus, O_DIRECT you are avoiding the membuffer and its associated "garbage collection", altogether. So if it works for you, yeay!

Asking if it works seems like an easy question for the xfs list, but it seems like it should be filesystem independent to me.

I will likely work, but I used that call to fix xfs_fsr to not eat my memory and submitted the patch. The bit where it reads files from the old position and writes to the new -- it used kernel buff reads and I would notice my buffers get thrashed. So I tried that and it worked great. The xfs people didn't really want to add a posix call as there was no reason to write through the kernel in the first place, so later one of them came out with a similar patch that used O_direct for the file reading as well as the file writing(which due to how xfs_fsr reorganizes the disk was already O_DIRECT). So their preference was to use O_DIRECT which is probably better for this use case -- known, fixed-writes occurring at regular intervals. IF the write call took any real time to perform (which would likely be negligible), it could be done in a separate thread in background (roll-yur-own AIO -- or it could be done in linux AID calls, but I doubt either would be necessary the usage). Since there's no reason to buffer the data in ram, there doesn't seem to be much benefit to writing through the kernel file-buffers. Anyway, in their use case, O_DIRECT was a more efficient choice. I was just tossing out other options cuz it sounded like he didn't like O_DIRECT. Meh. ;-) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

4 Jun 4 Jun

09:11

Roger Oberholtzer wrote:

...

I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds.

This may not be of much use to you, but we have a cctv system that also writes individual jpegs. They're only 640x480, but from a quick glance at the timing, writing one jpeg takes an average of 0.2ms (elapse). Very occasional jumps to maybe 20ms. The recording system is running 12.2 on a 2.8GHz Celeron with 1.5Gb RAM. I guess your jpegs are much bigger - otherwise it's a pretty significant difference. (oh, and we're using JFS, not XFS). -- Per Jessen, Zürich (11.7°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

09:50

On Tue, 2013-06-04 at 11:11 +0200, Per Jessen wrote:

...

Roger Oberholtzer wrote:

...
I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds.

This may not be of much use to you, but we have a cctv system that also writes individual jpegs. They're only 640x480, but from a quick glance at the timing, writing one jpeg takes an average of 0.2ms (elapse). Very occasional jumps to maybe 20ms. The recording system is running 12.2 on a 2.8GHz Celeron with 1.5Gb RAM. I guess your jpegs are much bigger - otherwise it's a pretty significant difference. (oh, and we're using JFS, not XFS).

The disk system seems to maintain around 40-50M/sec sustained writes. Our images should be utilizing around 50% of that. I am trying this on some SSD disks with good old ext4. Just to have a point of comparison. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrey Borzenkov

12 Jun 12 Jun

11:44

В Mon, 03 Jun 2013 11:14:07 +0200 Roger Oberholtzer <roger@opq.se> пишет:

...

I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds. I have looked in all the usual suspect places, and nothing seems to point at anything. For one test, I wrote to /dev/null instead of the real file, The delays do not happen. They do seem to be related to actually writing to the physical disk.

I expect some delay occasionally when disks are physically flushed. There is buffering in our application to allow this. But 5 seconds is simply wrong.

BTW this may explain one possible cause: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-p... -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

14:30

Andrey Borzenkov wrote:

...

В Mon, 03 Jun 2013 11:14:07 +0200 Roger Oberholtzer <roger@opq.se> пишет:

...
I am using XFS on a 12.1 system. The system records jpeg data to large files in real time. We have used XFS for this for a while since it has as a listed feature that it is well suited to writing streaming media data. We have used this for quite a while on openSUSE 11.2.

We have developed a new version of this system that collects more data. What I have found is that the jpeg data is typically written at the speed I expect. Every once in a while, the write takes 100x longer. Instead of the expected 80 msecs or so to do the compress and write, it takes, say, 4 or 5 seconds. I have looked in all the usual suspect places, and nothing seems to point at anything. For one test, I wrote to /dev/null instead of the real file, The delays do not happen. They do seem to be related to actually writing to the physical disk.

I expect some delay occasionally when disks are physically flushed. There is buffering in our application to allow this. But 5 seconds is simply wrong.

BTW this may explain one possible cause:

http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-p...

Very interesting read, I think Roger might well find some help there. -- Per Jessen, Zürich (23.9°C) http://www.dns24.ch/ - free DNS hosting, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

21:13

Per Jessen wrote:

...

...
BTW this may explain one possible cause:

http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-p...

Very interesting read, I think Roger might well find some help there.

I don't think his system is writing the buffers out only once per 30 minutes. It sounds more like a the buffers getting full with data he doesn't need anymore but the kernel does it's normal memory trim, trying to keep as much of it in memory as it can. It's probably walking some memory structure of the file cache looking for things too empty -- because before it can give the buffers to you it likely has to zero them. What mode are you opening the files in, BTW? Write-only or RW? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Roger Oberholtzer

13 Jun 13 Jun

06:27

On Wed, 2013-06-12 at 14:13 -0700, Linda Walsh wrote:

...

Per Jessen wrote:

...
...
BTW this may explain one possible cause:

http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-p...

Very interesting read, I think Roger might well find some help there.

I don't think his system is writing the buffers out only once per 30 minutes. It sounds more like a the buffers getting full with data he doesn't need anymore but the kernel does it's normal memory trim, trying to keep as much of it in memory as it can.

It's probably walking some memory structure of the file cache looking for things too empty -- because before it can give the buffers to you it likely has to zero them. What mode are you opening the files in, BTW? Write-only or RW?

fopen with "w+" The file is expected not to exist. Yours sincerely, Roger Oberholtzer Ramböll RST / Systems Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 roger.oberholtzer@ramboll.se ________________________________________ Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden www.rambollrst.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

4204

Age (days ago)

4215

Last active (days ago)

List overview

Download

112 comments

11 participants

participants (11)

Andrey Borzenkov
Anton Aylward
auxsvr＠gmail.com
Bernhard Voelker
Carlos E. R.
Dave Howorth
Greg Freemyer
Lars Müller
Linda Walsh
Per Jessen
Roger Oberholtzer

[opensuse] XFS and openSUSE 12.1

tags

participants (11)