[opensuse] About relatime and lazytime
Hi, relatively recently, "lazytime" has been added as the default for mounts - from man mount: lazytime Only update times (atime, mtime, ctime) on the in-memory version of the file inode. This mount option significantly reduces writes to the inode table for workloads that perform frequent random writes to preallocated files. The on-disk timestamps are updated only when: - the inode needs to be updated for some change unrelated to file timestamps - the application employs fsync(2), syncfs(2), or sync(2) - an undeleted inode is evicted from memory - more than 24 hours have passed since the i- node was written to disk. I understand that using this option (which is the default) supersedes using "relatime". However, I noticed that most mounts, not all, are using both options:
Telcontar:~ # mount | grep relatime | grep lazytime /dev/sdd2 on /data/storage_d type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sdb8 on /opt type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sdb10 on /data/storage_b type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sda13 on /data/storage_a type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sdc13 on /var/spool/news type reiserfs (rw,relatime,lazytime,user_xattr,acl) /dev/sda12 on /data/vmware type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sdc14 on /data/homedvl type reiserfs (rw,relatime,lazytime,user_xattr,acl) /dev/sdc7 on /usr/local type reiserfs (rw,relatime,lazytime,user_xattr,acl) /dev/sdc12 on /usr/gamedata type reiserfs (rw,relatime,lazytime,user_xattr,acl) /dev/sdc6 on /usr/src type reiserfs (rw,relatime,lazytime,user_xattr,acl) /dev/sdc16 on /data/storage_c type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sdc5 on /home_aux type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sdc8 on /home1 type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/md0 on /data/raid type xfs (rw,relatime,lazytime,attr2,inode64,logbsize=128k,sunit=256,swidth=512,noquota) /dev/mapper/cr_cripta on /data/cripta type xfs (rw,relatime,lazytime,attr2,inode64,noquota) /dev/sda7 on /other/test_a1 type reiserfs (rw,relatime,lazytime,user_xattr,acl) Telcontar:~ #
Mounts that only use relatime:
Telcontar:~ # mount | grep relatime | grep -v lazytime sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd) pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime) cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids) cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct) cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory) cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer) cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio) cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices) cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset) cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event) configfs on /sys/kernel/config type configfs (rw,relatime) /dev/sde5 on / type ext4 (rw,relatime,data=ordered) mqueue on /dev/mqueue type mqueue (rw,relatime) debugfs on /sys/kernel/debug type debugfs (rw,relatime) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime) hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime) systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct) nfsd on /proc/fs/nfsd type nfsd (rw,relatime) /dev/sde4 on /boot type ext2 (rw,relatime,block_validity,barrier,user_xattr,acl) /dev/sdc17 on /home type ext4 (rw,relatime,data=ordered) binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=817444k,mode=700) tmpfs on /var/run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=817444k,mode=700) tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=817444k,mode=700,uid=1000,gid=100) tmpfs on /var/run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=817444k,mode=700,uid=1000,gid=100) gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=100) gvfsd-fuse on /var/run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=100) fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime) Telcontar:~ #
Ie, ext4 parttions use only relatime, and non real disk partitions. xfs and reiserfs use both relatime and lazytime. It is not my doing:
Telcontar:~ # grep relatime /etc/fstab # default is "relatime", no need to add it (or atime,nodiratime) #/dev/mapper/cr_storage_e /data/test xfs nofail,relatime,noauto 1 5 Telcontar:~ #
relatime Update inode access times relative to mod- ify or change time. Access time is only updated if the previous access time was earlier than the current modify or change time. (Similar to noatime, but it doesn't break mutt or other applications that need to know if a file has been read since the last time it was modified.) Since Linux 2.6.30, the kernel defaults to the behavior provided by this option (unless noatime was specified), and the strictatime option is required to obtain traditional semantics. In addition, since Linux 2.6.30, the file's last access time is always updated if it is more than 1 day old. Should I use "norelatime" manually on fstab entries? -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 26/09/17 02:36 PM, Carlos E. R. wrote:
Hi,
relatively recently, "lazytime" has been added as the default for mounts - from man mount:
lazytime Only update times (atime, mtime, ctime) on the in-memory version of the file inode.
This mount option significantly reduces writes to the inode table for workloads that perform frequent random writes to preallocated files.
1. What is 'relatively' in the "relatively recently"? 2. This "added to mounts" is how? As part of the system call? That would make it part of the kernel. What version of the kernel are you talking about? 3. What, in your opinion, is "frequent random writes to preallocated files"? I ask because in my opinion that means databases, as in updating fields or rows. while I realise that Thunderbird and Firefox make use of small SQLite3 databases, how useful is this to those of is who just use those file systems for 'regular' stuff like word processing and the regular file traffic: postfix, OpenOffice, gimp, or updating the myriad of small configuration dot files in the home directory? 4. If a file is closed does that constitute its inode being "evicted from memory" in this context? or is there some caching done on the principle that you _might_ want to open it again soon? -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-09-26 21:17, Anton Aylward wrote:
On 26/09/17 02:36 PM, Carlos E. R. wrote:
Hi,
relatively recently, "lazytime" has been added as the default for mounts - from man mount:
lazytime Only update times (atime, mtime, ctime) on the in-memory version of the file inode.
This mount option significantly reduces writes to the inode table for workloads that perform frequent random writes to preallocated files.
1. What is 'relatively' in the "relatively recently"? 2. This "added to mounts" is how? As part of the system call? That would make it part of the kernel. What version of the kernel are you talking about?
I'm talking of Leap. 42.1 did not have it, 42.2 has it. The above is extracted directly from the man page, it is not my text. I don't know if the defaults are applied by the mount command or by systemd.
3. What, in your opinion, is "frequent random writes to preallocated files"? I ask because in my opinion that means databases, as in updating fields or rows. while I realise that Thunderbird and Firefox make use of small SQLite3 databases, how useful is this to those of is who just use those file systems for 'regular' stuff like word processing and the regular file traffic: postfix, OpenOffice, gimp, or updating the myriad of small configuration dot files in the home directory?
Wait. It is not data writes. It is about writing the timestamp or updating it each time some data is written to the file (or read). In traditional unix, each time you read a bit from a file, one timestamp for that file is changed. Ie, one write operation on directory table per data read access. Linux inherited this.
4. If a file is closed does that constitute its inode being "evicted from memory" in this context? or is there some caching done on the principle that you _might_ want to open it again soon?
I don't know and it is not related to this issue. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 26/09/17 03:27 PM, Carlos E. R. wrote:
I'm talking of Leap. 42.1 did not have it, 42.2 has it.
I'm running a notional 42.2 with 4.13.3-2.g38f3021-default and I just tried unmounting a Reiserfs mounted FS and cli remounting it with an additional 'lazytime'. It is now reported to have that option -- OK. So that makes it a "mount(1)" command OK. Bt I can't see what code in could be in the user code that tweaks the inode. It has to be in the "mount(2)' code, that is the kernel. It would take examining kernel source to verify. Anyone? But it fails to answer my other question which you misinterpret. I understand that this in in-memory. The inode is not flushed just because it has a time field updated. Now that makes sense in the 'preallocated" sense of a database. What is on disk is a 'slot' in the database and the contents of the 'slot' has changed. The allocation has not. The size has not. Apart from that singular case the semantic get less obvious. If I correct a spelling mistake in a document of some form, be it plain text of OpenOffice or HTML, that is just a character inversion such as 'hte' -> 'the' or 'adn' -> 'and', which are remarkably common, or a spelling mistake arising from hitting an adjacent key, such as am 'i' instead of an 'o' or vice versa, there is no change in file length. I realise that, algorithmically, the code could read in the whole of the file, then write out the whole of the file (possibly renaming the old on a ".bak" if you are DEC VMS obsessed. But does it? Should it? The _file_ *is* preallocated. But what about files that aren't on a filesystem mounted this way? I understand that 'lazytime' amounts to a deferred flush of the inode. I am assuming that closing the file enforces a flush; there's no way the file semantics makes sense otherwise. The cache is there for what purpose? The inodes of open files are in memory and are staying in memory. PURGE-ing an inode for an open file out of memory makes no sense. So why have an inode cache? There is the trivial case of directory traversal. If I open /home/anton/MyPhotos/ByYear/2017/march/24/DFU00124.jpg then the system has to opedeir() & readdir() internally along the way. It also does a closedir(). Maybe, just maybe, a moment later I will view /home/anton/MyPhotos/ByYear/2017/march/24/DFU00125.jpg If it has cached the directory fragments and inodes then this lookup will be fast. We saw this progression of caching added and refined in UNIX SYSIII, SYSV, and SUN Solaris over the years, so of course it ended up in Linux. There are quite a few _semantic_ implications that are, in my mind, not resolved by that man page entry. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-09-27 00:03, Anton Aylward wrote:
On 26/09/17 03:27 PM, Carlos E. R. wrote:
I'm talking of Leap. 42.1 did not have it, 42.2 has it.
I'm running a notional 42.2 with 4.13.3-2.g38f3021-default and I just tried unmounting a Reiserfs mounted FS and cli remounting it with an additional 'lazytime'. It is now reported to have that option -- OK.
The "lazytime" option is active by default. in Leap 42.2.
So that makes it a "mount(1)" command OK. Bt I can't see what code in could be in the user code that tweaks the inode. It has to be in the "mount(2)' code, that is the kernel. It would take examining kernel source to verify. Anyone?
The activation is handled by the "mount" command. "/usr/bin/mount". Not the kernel. The handling is of course done by the kernel. This was explained in a thread in the mail list some months ago.
But it fails to answer my other question which you misinterpret.
I understand that this in in-memory. The inode is not flushed just because it has a time field updated.
It is, soon enough.
Now that makes sense in the 'preallocated" sense of a database. What is on disk is a 'slot' in the database and the contents of the 'slot' has changed. The allocation has not. The size has not.
Apart from that singular case the semantic get less obvious.
If I correct a spelling mistake in a document of some form, be it plain text of OpenOffice or HTML, that is just a character inversion such as 'hte' -> 'the' or 'adn' -> 'and', which are remarkably common, or a spelling mistake arising from hitting an adjacent key, such as am 'i' instead of an 'o' or vice versa, there is no change in file length. I realise that, algorithmically, the code could read in the whole of the file, then write out the whole of the file (possibly renaming the old on a ".bak" if you are DEC VMS obsessed. But does it? Should it? The _file_ *is* preallocated. But what about files that aren't on a filesystem mounted this way?
It doesn't matter. Even if you just change a byte in a single sector of the file, and only that sector is written, still the timestamp of the file changes and has to be written. In traditional unix and linux this action is more or less immediate. With lazytime it delays. https://lwn.net/Articles/621046/ None of that applies to my OP, is offtopic. The subject is: If lazytime is default option, why is relatime also default? Should we remove it? Should I report a bug? -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 26/09/17 07:13 PM, Carlos E. R. wrote:
The "lazytime" option is active by default. in Leap 42.2.
So that makes it a "mount(1)" command OK. Bt I can't see what code in could be in the user code that tweaks the inode. It has to be in the "mount(2)' code, that is the kernel. It would take examining kernel source to verify. Anyone?
The activation is handled by the "mount" command. "/usr/bin/mount". Not the kernel.
Of you mean that it is in "-o" option list, then yess, but that's not what I meant.
The handling is of course done by the kernel.
I would all the "-o option" the handling and what the kernel does the "implementation".
This was explained in a thread in the mail list some months ago.
Do you have a date? Author? url? -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 26/09/17 07:13 PM, Carlos E. R. wrote:
It doesn't matter. Even if you just change a byte in a single sector of the file, and only that sector is written, still the timestamp of the file changes and has to be written.
In traditional unix and linux this action is more or less immediate. With lazytime it delays.
None of this addresses my question. And it is related to your question. The man page says This mount option significantly reduces writes to the inode table for workloads that perform frequent random writes to preallocated files. The key there is "preallocated". The way I read 'relatime' is that if set if drags 'access' up when modify is changes so that the condition accesstime < modifytime can never occur. The way I read that, it means that if 'lazytime' is set then it doesn't matter one way or another. If 'lazytime' is set then unless the man page conditions are met there will not be a write of the inode. But... it does say; The on-disk timestamps are updated only when: - the inode needs to be updated for some change unrelated to file timestamps Now one view is that the inode contains the file allocation map. So if that alters, then the timestamps (and the map if you want to retain integrity) get written. Then relatime matters as well. But if the file is preallocated, the example I gave of a database, the allocation map isn't changed as the file is updated by the field being modified. Yes, this constitutes modifying the file so mtime alters. But surely many database operations are read-modify-write so the atime is altered anyway? Which gets back to my question about the other kinds of files - that is, not database files -- where the condition for lazytime to apply exist. When you can identify those you can look into the semantics of the interaction of lazytime and relatime. But as far as I can see all the other apps, even if just changing 'het' to 'the' in a text document, end up modifying the file map 'cos they rewrite the whole file and allocate it anew in one of few possible ways. So lazytime doesn't apply. Find me a non-database use case and lets see if it can be write-without-read. =================== Let my hyotheseise. Suppose you have a FS that allocates 4K blocks at a time, so the smallest allocation is 4K even if the file size is 1 byte. Then:- echo -n "hello " > hw.txt echo "world\n" >> hw.txt We know that the final "hello world" fits in 4K so there will only be the one block. But will the system append the "world\n" in the original block without altering the map, or will an new file be created and the the old, original one discarded as a new one is created? $ echo -n "hello " > hw.txt $ ls -li hw.txt 101 -rw-r--r-- 1 aja aja 6 Sep 26 21:14 hw.txt $ echo "world\n" >> hw.txt $ ls -il hw.txt 101 -rw-r--r-- 1 aja aja 14 Sep 26 21:14 hw.txt WOW! So new we get back to my question about caching. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
26.09.2017 21:36, Carlos E. R. пишет:
I understand that using this option [lazytime] (which is the default) supersedes using "relatime".
Neither is it default nor does it supersede relatime.
However, I noticed that most mounts, not all, are using both options:
Two options control different things. relatime controls when inode time is updated and lazytime controls when updated time is written to disk.
Should I use "norelatime" manually on fstab entries?
I guess it is up to you.
On 2017-09-27 05:53, Andrei Borzenkov wrote:
26.09.2017 21:36, Carlos E. R. пишет:
I understand that using this option [lazytime] (which is the default) supersedes using "relatime".
Neither is it default nor does it supersede relatime.
However, I noticed that most mounts, not all, are using both options:
Two options control different things. relatime controls when inode time is updated and lazytime controls when updated time is written to disk.
Should I use "norelatime" manually on fstab entries?
I guess it is up to you.
Ok, let's say that I would like the timestamps to be "recent", but not written to disk often :-) -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 27/09/17 10:37 AM, Carlos E. R. wrote:
Ok, let's say that I would like the timestamps to be "recent", but not written to disk often :-)
Then in the generic case 'lazytime' won't help you. As I've pointed out, it only works with preallocated files, and as far as I can tell even with the case of the small file example where there was no alteration to the file allocation map (such an alteration would necessitate a inode write as well to ensure integrity), the inode still gets flushed when the file closes. So if you have a shell script that build something by lots of appends to a file it it going to result in a lot of inode flushing. I _think_ there is a use case in your favour where, if you have a FS that allocates nice big 4k blocks at a time, then opening a file (in C or Perl or Ruby of Python) and and doing lots of small writes that are not enough to allocated a new new block, the the bytes hit the disk[1], unless you are buffering :-) But if you are buffering then the contents of the buffer aren't going to hit the disk. In the former case, there will be updates to mtime but not to the allocation map so lazytime helps. In the latter case mtime gets updated when the buffer is flushed. Now if you have a 4K disk and use a 512 byte buffer, you'll get a block allocation and inode write on the first flush, then not for the second, third or fourth, then the fifth will need a new block, so you update the map and get an inode flush as well. And so it goes. The issue is that "preallocation" clause in the semantics of lazytime. Anyway, that's how I interpret it for the non-database situation. I may be wrong about my interpretation. But if I'm not, then that's the best you'll get. Remember: if you close the file, the the inode gets flushed, so a shell script of: .... echo blah blah blah >> textfile.txt echo more blah >> textfile.txt. .... will result in the inode being flushed on each occasion. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-09-27 18:30, Anton Aylward wrote:
On 27/09/17 10:37 AM, Carlos E. R. wrote:
Ok, let's say that I would like the timestamps to be "recent", but not written to disk often :-)
Then in the generic case 'lazytime' won't help you.
Yes it does. It is precisely what I want. The only doubt is whether I also want relatime or not. I think not. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
27.09.2017 17:37, Carlos E. R. пишет:
On 2017-09-27 05:53, Andrei Borzenkov wrote:
26.09.2017 21:36, Carlos E. R. пишет:
I understand that using this option [lazytime] (which is the default) supersedes using "relatime".
Neither is it default nor does it supersede relatime.
However, I noticed that most mounts, not all, are using both options:
Two options control different things. relatime controls when inode time is updated and lazytime controls when updated time is written to disk.
Should I use "norelatime" manually on fstab entries?
Contrary to expectation, "norelatime" does *not* disable relatime flag. Use "strictatime" to do it.
I guess it is up to you.
Ok, let's say that I would like the timestamps to be "recent", but not written to disk often :-)
participants (3)
-
Andrei Borzenkov
-
Anton Aylward
-
Carlos E. R.