Re: [opensuse-factory] Re: What is the purpose of the systemd journal service?

1 May 2013

      On Wed, May 1, 2013 at 2:06 AM, Cristian Rodríguez
<crrodriguez@opensuse.org> wrote:
...
El 01/05/13 00:41, Claudio Freire escribió:
...
And I mean crash, not even
power outage.
The journal TODO list includes :
-import and delete pstore filesystem content at startup
That means, in systems that have the adequate firmware, the journal will
import the "last messages" your system produced before rebooting due to
crash, straight from the dead..spooky :-)
No, storage systems don't work like that when you're mapping.

Don't think I don't understand how the system works, because I've read
the source. I've written patches for postgres (not upstreamed yet), so
I kinda know a little bit about this stuff. Let me tell you.

When you mmap files, you tell the kernel's page cache to do all I/O
for you. That's fast, because you've got a zero-copy environment like
that. But it's also quite out of your control how writes happen in the
end. When you write to a mapped memory location, you make that page
dirty. The kernel will flush it whenever it thinks it's necessary, in
no particular order.

Pages you dirtied 30 minutes ago can still be unwritten when a page
that has been written 1 second ago is flushed, because somehow the
kernel thinks that old page is hot and so won't flush it unless it's
told to. You tell it to with msync. Systemd just started using msync
for this very reason.

But msync flushes everything. And, again, in no particular order. If
there's a crash, or a power outage, you don't know which pages have
been written and which haven't. Postgres and most reasonably reliable
databases you mention will have a WAL to make sure pages are written
in the right order, and no torn pages appear. The WAL is a file opened
in sync I/O mode. It's slow, but safe. And it's not too slow since
it's append-only, which lets the operating system optimize sync I/O
quite a bit.

In append-only mode, you can have only one torn page: the last one. So
WAL entries are checksummed, and when WAL is replayed, you know
everything up to the first bad checksum is OK. You cannot do that with
memory mapped I/O, you have no way to force a specific write order.

So you're left with a system that will work most of the time, when the
thing crashing isn't the I/O. But if somehow I/O is interrupted (say a
bad interface, a kernel crash, a power outage), the binary format will
be messed with little hope of repair. Torn pages will have old info
interspersed with new info. Not all dirty pages will have been
written, and you have no way to know. If you implemented some kind of
page time-stamping, you'd have to be careful to match page size to
sector size, and assuming you get that right, you'd loose a lot more
info than you would if you hadn't mapped, because you'll have to
discard all valid pages after the first corrupted stamp.

So, memory mapped I/O is wrong if you need reliability. It's great
read-only, but that's about it.
-- 
To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse-factory+owner@opensuse.org