On 02/14/2016 02:20 PM, Greg Freemyer wrote:
On Sun, Feb 14, 2016 at 10:54 AM, Anton Aylward
wrote:
... [snip] ... === In the modern era (as opposed to the pre-historic era of 40 years ago), we have huge controller disk caches that work similar to what you describe. One solution for at least the last 2 decades is to use an on board battery to maintain that disk controller cache.
I am so glad you decided to follow up further on my note about that. Yes, one of those vendors for cache board for SCO UNIX devloped a baord with batter backup ... I thought I mentioned that.
Around 2000 I was working with always up (24x365) disk subsystems. The ones I knew well were made by DEC, (I mean Compaq, (I guess I mean HP)).
I know what you mean. I have some equipment in the basement, designed by DEC, built by Compaq but with a HP label.
The battery could maintain the cache contents for 7 days during a hardware fault / power outage. They had the battery connected to the system via a Y-connector. The y-connector was accessible with the system operational. Thus if the battery died due to old age, you connected up the new battery before you disconnected the old, ensuring you didn't have a window of vulnerability during the battery replacement.
That's a good mechanism. Was there something that noticed f the batter died of old age? Was there a means for testing that mechanisms worked (How far do you want to regress this?) Did the mechanism nag, nag, nag endlessly? 'Cos I'm sure I'm not alone when I speak of a set-up like this that the sysop decided to turn off the alarm, and turn off the alarm .. and um-plug the annoying alarm ... and forgot about it. What is it they say about idiot proof systems? Its humans that are the problem :-(
Committing to disk, and committing in the right order is important.
Yes, and in the last 40 years a lot of effort has gone into making sure the order is right.
Beginning, as I said with the momentous step in the change from the V6 to V7 UNIX and trying hard never to look back. I do wonder, however, just how much mistakes like this are taught in CS courses? I've noted many times that the #1 and #1 vulnerabilities in the SANS Top 20 list, SQL Injection and Bugger Overflow, have been around for more than 20 years. Buffer Overflow, if you recall, was the root cause of the Morris Worm of 1988 which took down an appreciable part of the Internet-as-it-then-was. My point here is that when I interview new intakes of programmers or even talk with one who've been working for my client for some years, even the ones that are aware of these tell me their schools & colleege course never mentioned them.
Even only 20 years ago, the only status of a cache was clean or dirty. If it was dirty you had to issue a cache flush command to ensure any given piece of data was written to data.
Not quite. The VM mechanisms of which I speak, which grew out of the VM of original BSD4.1 for the VAX of 1982 era, had to deal with the lack of a few crucial 'bits' in the VAX VM hardware support. That's one reason for the different types of ageing/available/"dirty" queues. There are many internal pointers that give context to a 'dirty'. Only a data page can be dirty, and there may be a file pointer to a buffer in virtual memory that is linear, but uses the VM's scatter-gather mechanisms to point to paces in the write-out queue. Perhaps "on the write out queue" is more meaningful, better descriptive, when speaking of VM mechanisms than "dirty".
Fortunately, 15 years ago the ATA command set was enhanced to have a "force unit access" (FUA) command that allowed a specific data write to not be acknowledged as complete until the data written was on "stable storage".
And of course every writer of disk controllers unerring makes use of that ...
If you read a XFS or RAID mailing list you will see they strongly urge against hardware cache systems that falsely report success for a FUA write operation.
Good. Buwee, Sort of. "Strongly urge" seems a bit lilly-livered[1].
The exception is if the user has a battery backup system like the above that they trust and therefore the cache itself is considered stable.
BAD :-( The security types have a saying "Trust, but verify". Unless there is that verification, one that somehow gets passed human failing, I would not trust.
For my normal activity, I need reliable operation, but not 7x365 so I can easily accept a hardware cache that requires downtime to swap out the battery.
We have a wide variety of users here. My home system ... I'm not sure I could justify a hardware cache anyway! But most of my clients have financial systems or web services where 7x365 *is* required and "Five Nines" may not be enough, and precautions and checks against human error or oversight is one of the first things that's considered. I've read that wall Street pretty much runs on Linux. They aren't the only financial hub in the world that uses Linux!
Hopefully any modern hardare based disk caching solution honors FUA write requests.
Hopefully.
Certainly a SSD, which will retain content, can, if the supporting software is there to do the job, ameliorate this. It gets back to the issue of the window/timing. But there are a few too many 'ifs" in this to inspire confidence in an old cynic like me.
As long as the FUA commands are honored, the filesystem should be trustworthy even with a large data cache.
I agree with your proviso there. But it is a case of "if".
Adding RAM is often the cheapest way to improve performance with a *NIX system that can make use of it. Again there are a few "if" in that. This may not be your bottleneck. Profiling is important. You may also come up against hardware limits. Many PC motherboards have too few memory slots or their chip set won't allow the use of larger capacity memory cards. But before you spend >$50 on a SSD http://www.amazon.com/SanDisk-6-0GB-2-5-Inch-Height-SDSSDP-064G-G25/dp/B007ZWLRSU/ref=sr_1_3?ie=UTF8&qid=1455464006&sr=8-3&keywords=64g+ssd consider spending the same on RAM
A worthy comment, but RAM is still $7/GB or so and high-speed SSD is down below $0.50/GB
We've had this argument before with tape storage vs DASD 50 years ago. Its not the cost. If it were the cost computers would not use RAM, they would access the SSD directly. My point is that the computer itself can make good use of more RAM ahead of faster disk. It can use the RAM because its under the immediate and complete control of the OS. Storage on disk is always "by proxy'. When you've run out of capacity on your motherboard, you look to other ways. Hmm, My desktop shipped from Dell with a since core CPU. One of the first things I did was get a 4-core chip for it. Hmm. that had a larger CPU cache as well :-) But oh, dear. Its the wrong socket type for one of the 20-core chips :-(
Data pages, mapped files, ...
DPO - Disable Page Out - Yet another low-level ATA / SCSI type command (actually a flag bit used in conjunction with other commands). It tells the disk that data is not likely to be re-used so don't bother keeping it in cache after you get it to stable storage.
Nice idea. The problem is that we're getting to the point where there needs to be an 'elevator' on the side of the call hierarchy between the application, the libraries, the OS call, the OS interface to file system model, the file system interface to the device drivers. I think I have tried to be vocal about keeping work at the right level in the stack. let me make it clear again if I haven't. Its a while since I examined the Virtual File System Switch code, perhaps I better do that once again to see if there is such an 'elevator' mechanism.. It strikes me that if a file system is mounted 'read only' then all access to that should have the DPO flag set. I note in passing that the VFS was invented by Sun for SunOS-3.x in the mid-1980s.
A disk controller cache that sits between the kernel and the stable storage should honor the DPO flag and not keep that data around after it is written to disk.
I'm not saying it shouldn't, I'm just questioning how the information what a file has been opened RO at the application level communicates that all the way down to the device driver, somehow knowing that this is an ATA/SCSI device. A key part of the whole UNIX model was always that it was "device agnostic". Let's face it, the whole VFS model means that there is no presumption as to what file system the file being opened RO is on. Heck, you could have your whole system running on BtrFS, minixFS, qnx4FS or some experimental-yet-to-be-releasedFS -- which is the whole point, isn't it?
I don't know if it does, but I certainly wouldn't be surprised if it did.
It's worth looking into or raising on a different list. D**n-it-all-to-h**, another demand on my time that I can't afford.
The discipline of the hierarchy of organization and standardization of interfaces and separation of function may have led to many good things, but the old day of the IBM1040 and the like where the application had control all the way down the stack to the disk had a lot of optimization that is now simply not possible because the detailed information isn't available from the higher levels all the way down.
You haven't convinced me.
I haven't convinced myself until i go look. Back to "Trust, but verify". I trust I'm right about there not being an 'elevator', but the VFS may be flexible enough. Don't expect me to find time to drill down though kernel code to explore this soon. Right now I have to deal doing the laundry to have clean clothes for tomorrow!
There are low-level ATA/SCSI commands that you seem to ignore. If an app wants to control the caching scheme it can use O_DIRECT open() flag to bypass the kernel caching and implement its own.
So the app is now taking on responsibility for 'knowing' what the underlying device is?
I'd be surprised if there isn't a way for userspace to trigger FUA / DPO / FLUSH_CACHE commands.
After all, this is Linux we are talking about.
Indeed. There's nothing to say that you can't write an app that access some weird piece of hardware directly to the 'raw' device. Oh wait, isn't that how things like 'fdisk' and the older version of X11 work? [1] http://www.merriam-webster.com/dictionary/lily%E2%80%93livered -- Any philosophy that can be put in a nutshell belongs there. -- Sydney J. Harris -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org