On Wed, May 1, 2013 at 2:06 AM, Cristian Rodríguez <crrodriguez@opensuse.org> wrote:
El 01/05/13 00:41, Claudio Freire escribió:
And I mean crash, not even
power outage.
The journal TODO list includes :
-import and delete pstore filesystem content at startup
That means, in systems that have the adequate firmware, the journal will import the "last messages" your system produced before rebooting due to crash, straight from the dead..spooky :-)
No, storage systems don't work like that when you're mapping. Don't think I don't understand how the system works, because I've read the source. I've written patches for postgres (not upstreamed yet), so I kinda know a little bit about this stuff. Let me tell you. When you mmap files, you tell the kernel's page cache to do all I/O for you. That's fast, because you've got a zero-copy environment like that. But it's also quite out of your control how writes happen in the end. When you write to a mapped memory location, you make that page dirty. The kernel will flush it whenever it thinks it's necessary, in no particular order. Pages you dirtied 30 minutes ago can still be unwritten when a page that has been written 1 second ago is flushed, because somehow the kernel thinks that old page is hot and so won't flush it unless it's told to. You tell it to with msync. Systemd just started using msync for this very reason. But msync flushes everything. And, again, in no particular order. If there's a crash, or a power outage, you don't know which pages have been written and which haven't. Postgres and most reasonably reliable databases you mention will have a WAL to make sure pages are written in the right order, and no torn pages appear. The WAL is a file opened in sync I/O mode. It's slow, but safe. And it's not too slow since it's append-only, which lets the operating system optimize sync I/O quite a bit. In append-only mode, you can have only one torn page: the last one. So WAL entries are checksummed, and when WAL is replayed, you know everything up to the first bad checksum is OK. You cannot do that with memory mapped I/O, you have no way to force a specific write order. So you're left with a system that will work most of the time, when the thing crashing isn't the I/O. But if somehow I/O is interrupted (say a bad interface, a kernel crash, a power outage), the binary format will be messed with little hope of repair. Torn pages will have old info interspersed with new info. Not all dirty pages will have been written, and you have no way to know. If you implemented some kind of page time-stamping, you'd have to be careful to match page size to sector size, and assuming you get that right, you'd loose a lot more info than you would if you hadn't mapped, because you'll have to discard all valid pages after the first corrupted stamp. So, memory mapped I/O is wrong if you need reliability. It's great read-only, but that's about it. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org