Mailinglist Archive: opensuse (1470 mails)

< Previous Next >
Re: [opensuse] experiences with bache / logic of caching
  • From: Greg Freemyer <greg.freemyer@xxxxxxxxx>
  • Date: Sun, 14 Feb 2016 14:20:02 -0500
  • Message-id: <>
On Sun, Feb 14, 2016 at 10:54 AM, Anton Aylward
<opensuse@xxxxxxxxxxxxxxxx> wrote:
On 02/13/2016 07:51 PM, Olav Reinert wrote:
Because the cache is small, it only serves to speed up random-access
reads of the active working set, and to speed up most writes (writeback
cache mode).

The net effect is that, even though all data (including user home
directories) is stored on a pair of spinning disks, the cache manages
to hide that fact well enough to make the machine feel very snappy in
use - it's close enough to a real SSD for my needs.

Sorry, I can't give any concrete numbers to back up my experiences.

Its worth considering some aspects of the logic of caching.

What might be an extreme case is one I saw in the last century and
discussed at the North York UNIX Users group, back in the days when SCO
UNIX was the dominant system for SMB and personal as an alternative to
Microsoft, certainly if one wanted to support a multi-user environment
or run a "proper" database system.

One SMB manager described the use of a disc accelerator board, which was
really a cache. It too had write-though. The write-though basically
released the system call immediately and wrote the block when convenient.

Now historically, one of the changes from classical V6 UNIX to classical
V7 UNIX back in the 1970s was that the file system changed the order of
the writes to preserve the structural integrity. Data was written to
the disk before the metadata and pointers. I won't go into why that
ensured that FSCK could work properly, google for that yourself. But it
is important that it was done that way. The use of this board could
not guarantee that. A CompSci type from UofT warned of this. A couple
of months later the SMB guy reported corrupted and unrecoverable disk.
There was a "I told you so" feeling.

When I was young a dinosaur ate my server, so now I always keep them
in dinosaur proof cages.

I know, you say dinosaurs are extinct, but I say better safe than sorry.

In the modern era (as opposed to the pre-historic era of 40 years
ago), we have huge controller disk caches that work similar to what
you describe. One solution for at least the last 2 decades is to use
an on board battery to maintain that disk controller cache.

Around 2000 I was working with always up (24x365) disk subsystems.
The ones I knew well were made by DEC, (I mean Compaq, (I guess I mean
HP)). The battery could maintain the cache contents for 7 days during
a hardware fault / power outage. They had the battery connected to the
system via a Y-connector. The y-connector was accessible with the
system operational. Thus if the battery died due to old age, you
connected up the new battery before you disconnected the old, ensuring
you didn't have a window of vulnerability during the battery
replacement. Further, if for some reason 7 days wasn't long enough,
You could charge a spare battery offline, then swap it out and keep
the cache valid indefinitely.

Committing to disk, and committing in the right order is important.

Yes, and in the last 40 years a lot of effort has gone into making
sure the order is right.

Even only 20 years ago, the only status of a cache was clean or dirty.
If it was dirty you had to issue a cache flush command to ensure any
given piece of data was written to data.

Fortunately, 15 years ago the ATA command set was enhanced to have a
"force unit access" (FUA) command that allowed a specific data write
to not be acknowledged as complete until the data written was on
"stable storage".

If you read a XFS or RAID mailing list you will see they strongly urge
against hardware cache systems that falsely report success for a FUA
write operation. The exception is if the user has a battery backup
system like the above that they trust and therefore the cache itself
is considered stable.

For my normal activity, I need reliable operation, but not 7x365 so I
can easily accept a hardware cache that requires downtime to swap out
the battery.

The Linux kernel cache definitely allows FUA type writes from the
filesystem level to propagate down the software stack to force a
specific write out to a disk controller.

A write cache that 'accelerates' by releases the system call while
deferring the actual transfer from the cache to storage simply cannot
guarantee the integrity of the file system in the event of, for example,
a power loss.

Thus the introduction of FUA writes 15 years ago. Hopefully any
modern hardare based disk caching solution honors FUA write requests.

Certainly a SSD, which will retain content, can, if the
supporting software is there to do the job, ameliorate this. It gets
back to the issue of the window/timing. But there are a few too many
'ifs" in this to inspire confidence in an old cynic like me.

As long as the FUA commands are honored, the filesystem should be
trustworthy even with a large data cache.

Caching the read is also something that should be looked at critically.
All in all the internal algorithms for memory allocation and mapping
can probably make better use of memory than if it is externalized such
as the memory on those accelerator boards.
This is true for a number of reason but most of all that the OS knows
better what it needs to cache based on the process activity.

Agreed except in the case of a shared external storage subsystem. I
assume you've seen Per's posts about iSCSI systems.

Basically you setup a server with lots of disk and ram to act as a
dedicated storage server and export out iSCSI volumes. Then you have
other machines (typically servers) that import those iSCSI volumes and
mount them just as they would a physical ATA / SCSI disk.

In the case of "virtual memory" more memory means the page-out happens
less, retrieval from the page-out queue is in effect a cache operation.
Late model *NIX uses memory mapping for just about all open files, or
can. Dynamic linking, along with shared libraries, are opened memory
mapped files with a load-on-demand. Such pages may be on the out queue
but they never need to be re-written.


Adding RAM is often the cheapest way to improve performance with a *NIX
system that can make use of it. Again there are a few "if" in that.
This may not be your bottleneck. Profiling is important. You may also
come up against hardware limits. Many PC motherboards have too few
memory slots or their chip set won't allow the use of larger capacity
memory cards. But before you spend >$50 on a SSD
consider spending the same on RAM

A worthy comment, but RAM is still $7/GB or so and high-speed SSD is
down below $0.50/GB

I've bought at least 4TB of SSD in the last 6 months. At $7/GB that
would be a $28K expense vs the $2K expense it was.

I'm sure you can find cheaper SSD and more expensive RAM. Speed,
capacity, scale, all factor in.

Data pages, mapped files, are a yes-no-maybe situation. A program might
need to read config files at start-up, but those are never read again
and never written. Caching them is a waste of time, but how do you tell
the cache that?

DPO - Disable Page Out - Yet another low-level ATA / SCSI type
command (actually a flag bit used in conjunction with other commands).
It tells the disk that data is not likely to be re-used so don't
bother keeping it in cache after you get it to stable storage.

A disk controller cache that sits between the kernel and the stable
storage should honor the DPO flag and not keep that data around after
it is written to disk.

The interface to the 'rotating rust' is also a cache in read mode, in
that it reads a whole track at once. File system abstractions don't
always permit this to be useful. Its all very well to have 4K pages and
store files as contiguous blocks, but in reality a file access involves
metatadata access as well, and we're talking about a multi-user system
with asynchronous demands. Sometime I think that the design of file
system such as ReiserFS where small files can have the data in the
metadata block make sense, especially when dealing with small start-up
config files. Anyway, we are back to the situation where only the
application knows that this "data" file is a "read-once then discard"
start-up config. even the OS doesn't know there is no need to cache the
mapped pages. They don't even need to be put on the out queue, they can
be marked 'free' the moment the file is closed. Oh? Perhaps they are?
But how does the OS communicate this to the SSD bcache or the disk
controller buffer?

I believe both of the ATA commands "FLUSH CACHE" and "FLUSH CACHE
EXTENDED" allow a block range to be defined.

I further "believe" the DPO feature is actually implemented as a
single bit flag. So when the kernel is done with a file, it can send
FLUSH CACHE commands with the DPO flag set.

I don't know if it does, but I certainly wouldn't be surprised if it did.

The discipline of the hierarchy of organization and standardization of
interfaces and separation of function may have led to many good things,
but the old day of the IBM1040 and the like where the application had
control all the way down the stack to the disk had a lot of optimization
that is now simply not possible because the detailed information isn't
available from the higher levels all the way down.

You haven't convinced me.

There are low-level ATA/SCSI commands that you seem to ignore. If an
app wants to control the caching scheme it can use O_DIRECT open()
flag to bypass the kernel caching and implement its own.

I'd be surprised if there isn't a way for userspace to trigger FUA /
DPO / FLUSH_CACHE commands.

After all, this is Linux we are talking about.

To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
To contact the owner, e-mail: opensuse+owner@xxxxxxxxxxxx

< Previous Next >