Mailinglist Archive: opensuse (1470 mails)

< Previous Next >
Re: [opensuse] experiences with bache / logic of caching
On 02/14/2016 02:20 PM, Greg Freemyer wrote:
On Sun, Feb 14, 2016 at 10:54 AM, Anton Aylward
<opensuse@xxxxxxxxxxxxxxxx> wrote:

... [snip] ...
In the modern era (as opposed to the pre-historic era of 40 years
ago), we have huge controller disk caches that work similar to what
you describe. One solution for at least the last 2 decades is to use
an on board battery to maintain that disk controller cache.

I am so glad you decided to follow up further on my note about that.
Yes, one of those vendors for cache board for SCO UNIX devloped a baord
with batter backup ... I thought I mentioned that.

Around 2000 I was working with always up (24x365) disk subsystems.
The ones I knew well were made by DEC, (I mean Compaq, (I guess I mean

I know what you mean. I have some equipment in the basement, designed
by DEC, built by Compaq but with a HP label.

The battery could maintain the cache contents for 7 days during
a hardware fault / power outage. They had the battery connected to the
system via a Y-connector. The y-connector was accessible with the
system operational. Thus if the battery died due to old age, you
connected up the new battery before you disconnected the old, ensuring
you didn't have a window of vulnerability during the battery

That's a good mechanism. Was there something that noticed f the batter
died of old age? Was there a means for testing that mechanisms worked
(How far do you want to regress this?)

Did the mechanism nag, nag, nag endlessly?

'Cos I'm sure I'm not alone when I speak of a set-up like this that the
sysop decided to turn off the alarm, and turn off the alarm .. and
um-plug the annoying alarm ... and forgot about it.

What is it they say about idiot proof systems?

Its humans that are the problem :-(

Committing to disk, and committing in the right order is important.

Yes, and in the last 40 years a lot of effort has gone into making
sure the order is right.

Beginning, as I said with the momentous step in the change from the V6
to V7 UNIX and trying hard never to look back.

I do wonder, however, just how much mistakes like this are taught in CS
courses? I've noted many times that the #1 and #1 vulnerabilities in
the SANS Top 20 list, SQL Injection and Bugger Overflow, have been
around for more than 20 years. Buffer Overflow, if you recall, was the
root cause of the Morris Worm of 1988 which took down an appreciable
part of the Internet-as-it-then-was. My point here is that when I
interview new intakes of programmers or even talk with one who've been
working for my client for some years, even the ones that are aware of
these tell me their schools & colleege course never mentioned them.

Even only 20 years ago, the only status of a cache was clean or dirty.
If it was dirty you had to issue a cache flush command to ensure any
given piece of data was written to data.

Not quite.
The VM mechanisms of which I speak, which grew out of the VM of original
BSD4.1 for the VAX of 1982 era, had to deal with the lack of a few
crucial 'bits' in the VAX VM hardware support. That's one reason for
the different types of ageing/available/"dirty" queues. There are many
internal pointers that give context to a 'dirty'. Only a data page can
be dirty, and there may be a file pointer to a buffer in virtual memory
that is linear, but uses the VM's scatter-gather mechanisms to point to
paces in the write-out queue.

Perhaps "on the write out queue" is more meaningful, better descriptive,
when speaking of VM mechanisms than "dirty".

Fortunately, 15 years ago the ATA command set was enhanced to have a
"force unit access" (FUA) command that allowed a specific data write
to not be acknowledged as complete until the data written was on
"stable storage".

And of course every writer of disk controllers unerring makes use of
that ...

If you read a XFS or RAID mailing list you will see they strongly urge
against hardware cache systems that falsely report success for a FUA
write operation.

Buwee, Sort of. "Strongly urge" seems a bit lilly-livered[1].

The exception is if the user has a battery backup
system like the above that they trust and therefore the cache itself
is considered stable.

BAD :-(
The security types have a saying "Trust, but verify".
Unless there is that verification, one that somehow gets passed human
failing, I would not trust.

For my normal activity, I need reliable operation, but not 7x365 so I
can easily accept a hardware cache that requires downtime to swap out
the battery.

We have a wide variety of users here. My home system ... I'm not sure I
could justify a hardware cache anyway! But most of my clients have
financial systems or web services where 7x365 *is* required and "Five
Nines" may not be enough, and precautions and checks against human error
or oversight is one of the first things that's considered.

I've read that wall Street pretty much runs on Linux. They aren't the
only financial hub in the world that uses Linux!

Hopefully any
modern hardare based disk caching solution honors FUA write requests.


Certainly a SSD, which will retain content, can, if the
supporting software is there to do the job, ameliorate this. It gets
back to the issue of the window/timing. But there are a few too many
'ifs" in this to inspire confidence in an old cynic like me.

As long as the FUA commands are honored, the filesystem should be
trustworthy even with a large data cache.

I agree with your proviso there.
But it is a case of "if".

Adding RAM is often the cheapest way to improve performance with a *NIX
system that can make use of it. Again there are a few "if" in that.
This may not be your bottleneck. Profiling is important. You may also
come up against hardware limits. Many PC motherboards have too few
memory slots or their chip set won't allow the use of larger capacity
memory cards. But before you spend >$50 on a SSD
consider spending the same on RAM

A worthy comment, but RAM is still $7/GB or so and high-speed SSD is
down below $0.50/GB

We've had this argument before with tape storage vs DASD 50 years ago.
Its not the cost.
If it were the cost computers would not use RAM, they would access the
SSD directly.

My point is that the computer itself can make good use of more RAM ahead
of faster disk. It can use the RAM because its under the immediate and
complete control of the OS. Storage on disk is always "by proxy'.

When you've run out of capacity on your motherboard, you look to other
ways. Hmm, My desktop shipped from Dell with a since core CPU. One of
the first things I did was get a 4-core chip for it. Hmm. that had a
larger CPU cache as well :-) But oh, dear. Its the wrong socket type
for one of the 20-core chips :-(

Data pages, mapped files, ...

DPO - Disable Page Out - Yet another low-level ATA / SCSI type
command (actually a flag bit used in conjunction with other commands).
It tells the disk that data is not likely to be re-used so don't
bother keeping it in cache after you get it to stable storage.

Nice idea.
The problem is that we're getting to the point where there needs to be
an 'elevator' on the side of the call hierarchy between the application,
the libraries, the OS call, the OS interface to file system model, the
file system interface to the device drivers.

I think I have tried to be vocal about keeping work at the right level
in the stack. let me make it clear again if I haven't.

Its a while since I examined the Virtual File System Switch code,
perhaps I better do that once again to see if there is such an
'elevator' mechanism.. It strikes me that if a file system is mounted
'read only' then all access to that should have the DPO flag set.

I note in passing that the VFS was invented by Sun for SunOS-3.x in the

A disk controller cache that sits between the kernel and the stable
storage should honor the DPO flag and not keep that data around after
it is written to disk.

I'm not saying it shouldn't, I'm just questioning how the information
what a file has been opened RO at the application level communicates
that all the way down to the device driver, somehow knowing that this is
an ATA/SCSI device. A key part of the whole UNIX model was always that
it was "device agnostic".

Let's face it, the whole VFS model means that there is no presumption as
to what file system the file being opened RO is on. Heck, you could
have your whole system running on BtrFS, minixFS, qnx4FS or some
experimental-yet-to-be-releasedFS -- which is the whole point, isn't it?

I don't know if it does, but I certainly wouldn't be surprised if it did.

It's worth looking into or raising on a different list.
D**n-it-all-to-h**, another demand on my time that I can't afford.

The discipline of the hierarchy of organization and standardization of
interfaces and separation of function may have led to many good things,
but the old day of the IBM1040 and the like where the application had
control all the way down the stack to the disk had a lot of optimization
that is now simply not possible because the detailed information isn't
available from the higher levels all the way down.

You haven't convinced me.

I haven't convinced myself until i go look.
Back to "Trust, but verify".
I trust I'm right about there not being an 'elevator', but the VFS may
be flexible enough.

Don't expect me to find time to drill down though kernel code to explore
this soon. Right now I have to deal doing the laundry to have clean
clothes for tomorrow!

There are low-level ATA/SCSI commands that you seem to ignore. If an
app wants to control the caching scheme it can use O_DIRECT open()
flag to bypass the kernel caching and implement its own.

So the app is now taking on responsibility for 'knowing' what the
underlying device is?

I'd be surprised if there isn't a way for userspace to trigger FUA /
DPO / FLUSH_CACHE commands.

After all, this is Linux we are talking about.

There's nothing to say that you can't write an app that access some
weird piece of hardware directly to the 'raw' device. Oh wait, isn't
that how things like 'fdisk' and the older version of X11 work?


Any philosophy that can be put in a nutshell belongs there.
-- Sydney J. Harris
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
To contact the owner, e-mail: opensuse+owner@xxxxxxxxxxxx

< Previous Next >