[opensuse-factory] btrfs fun / disk full
Hello, I just had an interesting[tm] problem - updating to the latest tumbleweed failed at random places. It turned out that my btrfs / partition was full. Not with df -h (which reported some GB free), but "btrfs fi show" showed 100% usage. (Sorry for not including the exact output - I didn't save it.) I moved 15 GB of libvirt images to a different partition and deleted some old snapshots, but both didn't help. After some searching (and temporaryly breaking my /, which I could luckily repair with snapper rollback), I found out [1] that I should run btrfs balance start / -dlimit=3 which freed quite some space. Another run freed even more space, so I decided to run it without the limit: btrfs balance start / After that, I'm down to # btrfs fi show Label: none uuid: 9f4a918d-fcd4-45d3-a1dc-7b887300eabc Total devices 1 FS bytes used 22.91GiB devid 1 size 50.00GiB used 25.31GiB path /dev/mapper/cboltz-root Even if you re-add the 15 GB libvirt images in the calculation, this still means the rebalance freed about 10 GB = 20% of the partition size. So the good news is that I could solve the problem. The bad news is that it happened at all. Should there be a cronjob that does the rebalancing or other btrfs maintenance regularly? Bonus question: snapper list shows that snapshot #1 ("first root filesystem") is still kept. Is this snapshot needed for something, or can I safely delete it? BTW: a funny detail is: # btrfs balance start / WARNING: Full balance without filters requested. This operation is very intense and takes potentially very long. It is recommended to use the balance filters to narrow down the balanced data. Use 'btrfs balance start --full-balance' option to skip this warning. The operation printf will start in 10 seconds. ^^^^^^ I slightly doubt "printf" is the word that was wanted here ;-) Regards, Christian Boltz [1] https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.... --
New Versions usually bring bug fixes and security fixes. ...and new bugs. Funny how people tend to forget about that. [> Michael Melcher and Michal Kubecek in opensuse-factory]
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
23.01.2016 01:33, Christian Boltz пишет:
Hello,
I just had an interesting[tm] problem - updating to the latest tumbleweed failed at random places.
It turned out that my btrfs / partition was full. Not with df -h (which reported some GB free), but "btrfs fi show" showed 100% usage. (Sorry for not including the exact output - I didn't save it.)
I moved 15 GB of libvirt images to a different partition and deleted some old snapshots, but both didn't help.
After some searching (and temporaryly breaking my /, which I could luckily repair with snapper rollback), I found out [1] that I should run btrfs balance start / -dlimit=3 which freed quite some space. Another run freed even more space, so I decided to run it without the limit: btrfs balance start /
After that, I'm down to
# btrfs fi show Label: none uuid: 9f4a918d-fcd4-45d3-a1dc-7b887300eabc Total devices 1 FS bytes used 22.91GiB devid 1 size 50.00GiB used 25.31GiB path /dev/mapper/cboltz-root
Even if you re-add the 15 GB libvirt images in the calculation, this still means the rebalance freed about 10 GB = 20% of the partition size.
So the good news is that I could solve the problem. The bad news is that it happened at all.
Should there be a cronjob that does the rebalancing or other btrfs maintenance regularly?
There is btrfsmaintenance package that should be installed by default that provides cron script. Somewhat interesting implementation is, these cron scripts are not installed directly but there is service that does it. And *this* service is disabled by default :) systemctl enable btrfsmaintenance-refresh systemctl start btrfsmaintenance-refresh and check /etc/cron.{daily,weekly,monthly} It is configurable in /etc/sysconfig/btrfsmaintenance
Bonus question: snapper list shows that snapshot #1 ("first root filesystem") is still kept. Is this snapshot needed for something, or can I safely delete it?
You can't. This is *the* root filesystem that you are actually using. If you mount root of btrfs you will find that it is practically empty. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Hello, Am Samstag, 23. Januar 2016 schrieb Andrei Borzenkov:
23.01.2016 01:33, Christian Boltz пишет:
Should there be a cronjob that does the rebalancing or other btrfs maintenance regularly?
There is btrfsmaintenance package that should be installed by default that provides cron script. Somewhat interesting implementation is, these cron scripts are not installed directly but there is service that does it. And *this* service is disabled by default :)
I'd consider the "disabled by default" a bug - the btrfs maintenance is obviously needed, so having it disabled by default is, well, strange. I just reopened https://bugzilla.opensuse.org/show_bug.cgi?id=904518
systemctl enable btrfsmaintenance-refresh systemctl start btrfsmaintenance-refresh
and check /etc/cron.{daily,weekly,monthly}
It is configurable in /etc/sysconfig/btrfsmaintenance
I enabled it now (with the default config), which created cron.weekly/btrfs-balance.sh and cron.monthly/btrfs-scrub.sh symlinks. Thanks for pointing me to the right direction! Regards, Christian Boltz --
Kann ich auf einen Bootloader (lilo oder grub) verzichten, falls auf der Festplatte nur 2 Partitionen sind Klar kannst du. Vorausgesetzt du kannst auch darauf verzichten das Betriebssystem zu booten. [> Wolfgang Erlenkötter und Hartmut Meyer in suse-linux]
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am 22.01.2016 um 23:33 schrieb Christian Boltz:
Hello,
I just had an interesting[tm] problem - updating to the latest tumbleweed failed at random places.
It turned out that my btrfs / partition was full. Not with df -h (which reported some GB free), but "btrfs fi show" showed 100% usage. (Sorry for not including the exact output - I didn't save it.)
I faced the problem some weeks ago and reported it here, but did not know the cause. I assumed my 50GB SSD was going bad, because zypper could not unpack the rpms it downloaded anymore. df showed enough free space, and I did not know about "btrfs fi show". I did know how to limit snapshots to a smaller than default number - so I thought that I would already "know" how to use btrfs. I had no idea something could occupy space that df would not show despite maybe bad blocks on the SSD. Nevertheless my gut-feeling told me to delete some stuff, what helped temporarily. But I went back to ext4 at the point I could no longer boot / due to a "half-installed" zypper dup.
I found out [1] that I should run btrfs balance start / -dlimit=3 which freed quite some space. Another run freed even more space, so I decided to run it without the limit: btrfs balance start / After that, I'm down to
# btrfs fi show Label: none uuid: 9f4a918d-fcd4-45d3-a1dc-7b887300eabc Total devices 1 FS bytes used 22.91GiB devid 1 size 50.00GiB used 25.31GiB path /dev/mapper/cboltz-root
snip To understand, I read https://btrfs.wiki.kernel.org/index.php/Balance_Filters and the FAQ. However I do not understand much. What is all those "metadata" and what exactly does balance filter? I do understand what btrfs balance tries to fix (full file system errors), but not how or why the problem exists. Can someone explain in less technical terms?
Should there be a cronjob that does the rebalancing or other btrfs maintenance regularly?
Andrei wrote in a reply to your mail that there is a cron job but disabled by default. Should this be considered a bug? Shouldn´t the default protect the "normal" btrfs user from silently (not shown by df) filling up the filesystem? Or would be a meaningful error message possible? Or a "warning-popup" that says "please consider running btrfs balance to free up btrfs-occupied space"? Thanks Christian and Andrei for the interesting insights so far :) snip
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space....
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Le 23/01/2016 09:35, Thomas Langkamp a écrit :
space, and I did not know about "btrfs fi show". I did know how to limit snapshots to a smaller than default number - so I thought that I would
I wonder if in BTRFS installs, we shouldn't trigger the replacement of df by a small script: !/bin/sh btrfs fi show df because this problem occurs again and again and original df can be useful anyway may be some alias could be enough but trying this I notice "btrfs" is root only so there is a problem there, because df is not jdd -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sat, Jan 23, 2016 at 09:35:13AM +0100, Thomas Langkamp wrote:
Am 22.01.2016 um 23:33 schrieb Christian Boltz:
It turned out that my btrfs / partition was full. Not with df -h (which reported some GB free), but "btrfs fi show" showed 100% usage. (Sorry for not including the exact output - I didn't save it.)
I faced the problem some weeks ago and reported it here, but did not know the cause. I assumed my 50GB SSD was going bad, because zypper could not unpack the rpms it downloaded anymore. df showed enough free space, and I did not know about "btrfs fi show". I did know how to limit snapshots to a smaller than default number
The defaults are taken from /etc/snapper/config-templates/default and are for Factory with the regular stream of updates anything than optimal. My modified /etc/snapper/configs/root sets: TIMELINE_LIMIT_HOURLY="10" TIMELINE_LIMIT_DAILY="5" TIMELINE_LIMIT_MONTHLY="1" TIMELINE_LIMIT_YEARLY="0" I'm no longer sure if that's the conclusion from discussions on this or some different list. Cheers, Lars -- Lars Müller [ˈlaː(r)z ˈmʏlɐ] Samba Team + SUSE Labs SUSE Linux, Maxfeldstraße 5, 90409 Nürnberg, Germany
On Sat, Jan 23, 2016 at 1:35 AM, Thomas Langkamp <thomas.lassdiesonnerein@gmx.de> wrote:
df showed enough free space, and I did not know about "btrfs fi show".
Ideally you shouldn't have to. But Btrfs still has some rough edges, and really it's quite different and the regular df command doesn't have the granularity to distinguish between types of free space that Btrfs can have: space only for metadata, only for data, and for either.
To understand, I read https://btrfs.wiki.kernel.org/index.php/Balance_Filters and the FAQ. However I do not understand much. What is all those "metadata" and what exactly does balance filter? I do understand what btrfs balance tries to fix (full file system errors), but not how or why the problem exists.
Can someone explain in less technical terms?
Basically there are two kinds of allocations in Btrfs: extents and chunks. A chunk is kinda like an uberextent or block group. Chunks are contiguously allocated from free space. Metadata (the file system itself) only gets written to metadata chunks, data (contents of files) only get written to data chunks. The ratio of metadata to data isn't fixed, it depends on file size. So there isn't a way to know in advance the proportion of data to metadata chunks, which is why they're created on demand. The problem is they aren't deallocated as efficiently as they're allocated, just because users tend not to delete files in the order they added them. As files get added and deleted, there can be many chunks that are partially full. You can end up in a situation where the file system first becomes nearly full (no unallocated space for a chunk to be created), but via file deletion it can seem there's plenty of unused space, but only for one type of chunk. The typical case is when data chunks have free space, but metadata chunks don't. Because Btrfs is CoW, there is no overwriting. So even file deletion requires free space in a metadata chunk for the delete to happen. That's why it's possible to seemingly have enough space (in data chunks) but be unable to delete because that requires free space in a metadata chunk. Currently only completely empty chunks automatically revert to unallocated space, which is somewhat recent kernel code, and can then be allocated to a different chunk type. Still not yet automatic is a limited balance that can migrate extents in nearly empty chunks to full chunks, then delete the now empty chunks so they can be reallocated as a different chunk type. That's why if this problem crops up, the solution is still a manual balance. The filters limit the balance in different ways, again for efficiency. A 'btrfs balance start -dusage=15' will limit the balance to data chunks that are 15% full, or less. That'll be pretty fast even on spinning disks, and yet it'll free up quite a few data chunks back into unallocated space from which metadata chunks can be created. Balance reorganizes the extents in chunks to be more efficiently organized. It's a kind of defragmentation at a chunk level, rather than file defragmentation which is at an extent level. There are plans to make all of this more automatic on the kernel side. It's a work in progress. Why have different data and metadata chunks? Data is replicated at the chunk level. If you check out a leaf that contains a bunch of file extent info, all of the lookups are logical addresses. It has nothing to say about what device or physical sector it's on or what the replication level is. That's what the chunk tree and device tree are for. And each chunk type can have different profiles to allocate chunks on multiple devices, with or without replication (single, raid0), and with different kinds of replication (dup, raid1/10/5/6). Balance also manages changing the profile for chunks and doing the proper replication. I'd expect most general purpose users have only system and application files on Btrfs, being regularly snapshot, with a mix of files including big file downloads on /home on XFS. That's why most people don't have this problem. But if you happen to use virt-manager and qcow2 files, by default those will go in /var/lib/libvirt/images which is Btrfs. And they are subject to snapshots. So deleting the qcow2 doesn't free up much space since a good portion of their extents are anchored in snapshots. Free up enough snapshots and it may just free up space in data chunks without completely emptying them so that their space can revert to unallocated space from which a metadata chunk can be allocated and thus avoid a seemingly bogus out of space error. So my guess is that users using virt-manager should opt for ext4/XFS entirely, or flip things around and use the XFS volume meant for /home and mount it at /var/lib/libvirt/images instead, and use a Btrfs /home. Another way around it, if you're not familiar with GNOME Boxes, it's worth a look. It uses libvirt on the backend, but its qcow2 files are stored in /home. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/24/2016 03:22 PM, Chris Murphy wrote:
On Sat, Jan 23, 2016 at 1:35 AM, Thomas Langkamp <thomas.lassdiesonnerein@gmx.de> wrote:
df showed enough free space, and I did not know about "btrfs fi show".
Ideally you shouldn't have to. But Btrfs still has some rough edges, and really it's quite different and the regular df command doesn't have the granularity to distinguish between types of free space that Btrfs can have: space only for metadata, only for data, and for either.
To understand, I read https://btrfs.wiki.kernel.org/index.php/Balance_Filters and the FAQ. However I do not understand much. What is all those "metadata" and what exactly does balance filter? I do understand what btrfs balance tries to fix (full file system errors), but not how or why the problem exists.
Can someone explain in less technical terms?
Basically there are two kinds of allocations in Btrfs: extents and chunks. A chunk is kinda like an uberextent or block group. Chunks are contiguously allocated from free space. Metadata (the file system itself) only gets written to metadata chunks, data (contents of files) only get written to data chunks.
As someone who has always been curious about btrfs, thank you for this post it was very informative! This actually makes sense to me :)
I'd expect most general purpose users have only system and application files on Btrfs, being regularly snapshot, with a mix of files including big file downloads on /home on XFS. That's why most people don't have this problem. But if you happen to use virt-manager and qcow2 files, by default those will go in /var/lib/libvirt/images which is Btrfs. And they are subject to snapshots. So deleting the qcow2 doesn't free up much space since a good portion of their extents are anchored in snapshots. Free up enough snapshots and it may just free up space in data chunks without completely emptying them so that their space can revert to unallocated space from which a metadata chunk can be allocated and thus avoid a seemingly bogus out of space error.
So my guess is that users using virt-manager should opt for ext4/XFS entirely, or flip things around and use the XFS volume meant for /home and mount it at /var/lib/libvirt/images instead, and use a Btrfs /home.
While /var/lib/libvirt/images is on btrfs, its put in a subvolume by default. Doesn't this mean it wont be part of the regular snapshots? - From what I understood things in subvolumes aren't subject to snapshots users take with snapper. - -- Regards, Uzair Shamim -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJWpUTEAAoJEKk4ocK8Jbkd930P/RWhNXcyHdPI8E86VUof1HEZ +Nsq4ZnmUWH2JnStiA+zys3msjbYaUrebkRUO1I5SKfNJ8Jb0MVqlYpuij6lOLdu PtA/4v1Jl1YUqqfg2A3xOBJ49yLfDWd3kPRK6fpsyZ3g7lfcjGjZNFHxKlyq26s3 VIk5jtVJsiz8eIk82O5O4jV1lyziJROYM6kb/cSgpd6mKQ4yzxo60lWuIn63bWcz pVJ4B0Lt+lbpgHfDOoWMfaPJ3eNxBYsn0iCpTCb5/z7yrIFtjJ9wsn43qTFXJ7mQ aBV6/jCyvVOew51EJjfhOTd6aJ2ZfS9ii+FSbQZ9wiQqFVHzJ8Pe8EMVnFBbtGh1 FZUhywdQV64GSCdkprGJ5ngs5sgnhESmKDDsUPQTnsrQECEq2/TW5dMFCRs9RT4n YrzNEkos4lcYhetu2Bu6q4JhI3pY3MOrT+qZaPqNfs5cFmL75/0cPxf7X88404Da HNau8UNawJ6cTNEVNmgLUqS2qEPO50DLE1Hs4cTlpi9s6wQ/bQyTNXxytOlCekAO vvUcfeSh3incr1Muf0NIChU8SIGXXhrns4oYrsOQLaiTB4n2eDs0HYrfZxnQhaEs Qo+SAL9POABFKs7s+yJis+9gUPQ8MsgO/WfJ2rpJEi/WRu8g7kUGxzjCjyjOw7iN Dm2zV60uhdX2EddCYYRC =01Z3 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sun, Jan 24, 2016 at 2:40 PM, Uzair Shamim <uzashamim@gmail.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 01/24/2016 03:22 PM, Chris Murphy wrote:
On Sat, Jan 23, 2016 at 1:35 AM, Thomas Langkamp <thomas.lassdiesonnerein@gmx.de> wrote:
df showed enough free space, and I did not know about "btrfs fi show".
Ideally you shouldn't have to. But Btrfs still has some rough edges, and really it's quite different and the regular df command doesn't have the granularity to distinguish between types of free space that Btrfs can have: space only for metadata, only for data, and for either.
To understand, I read https://btrfs.wiki.kernel.org/index.php/Balance_Filters and the FAQ. However I do not understand much. What is all those "metadata" and what exactly does balance filter? I do understand what btrfs balance tries to fix (full file system errors), but not how or why the problem exists.
Can someone explain in less technical terms?
Basically there are two kinds of allocations in Btrfs: extents and chunks. A chunk is kinda like an uberextent or block group. Chunks are contiguously allocated from free space. Metadata (the file system itself) only gets written to metadata chunks, data (contents of files) only get written to data chunks.
As someone who has always been curious about btrfs, thank you for this post it was very informative! This actually makes sense to me :)
There is a "mixed block group" which is a chunk used for both metadata and data. It's available only at mkfs time with the -M flag. This is automatic for 1GiB and smaller file systems, it's recommended for 5GiB and smaller. I'm not sure why there aren't patches yet to make that automatic too. There's some discussion that maybe it's advantageous for anything about the size of a USB stick (16GiB) may be better suited using -M mixed block group. It tends to be smaller file systems that end up in situations where there's space in data chunks but no space left for metadata chunks, and mixed block group avoids this problem. The negative of always doing this is it's less efficient. When mixed, the nodesize must be the same as the sector size (this is a logical sector in Btrfs terms, not a physical sector, and right now it's fixed to pagesize so on x86 that's 4096 bytes). That means less liklihood small files can be stored inline with their metadata. Example: item 75 key (4450 INODE_ITEM 0) itemoff 6425 itemsize 160 inode generation 189 transid 189 size 3964 nbytes 3964 block group 0 mode 100700 links 1 uid 1000 gid 1000 rdev 0 flags 0x0 item 76 key (4450 INODE_REF 907) itemoff 6402 itemsize 23 inode ref index 22 namelen 13 name: uefi-grub.cfg item 77 key (4450 XATTR_ITEM 3817753667) itemoff 6319 itemsize 83 location key (0 UNKNOWN.0 0) type XATTR namelen 16 datalen 37 name: security.selinux data unconfined_u:object_r:unlabeled_t:s0 item 78 key (4450 EXTENT_DATA 0) itemoff 2334 itemsize 3985 inline extent data size 3964 ram 3964 compress 0 This 16KiB node actually contains references for 79 items, four of which are for a small file (4KiB) whose data is stored inline in this node along with its metadata (you can see the selinux label context and filename). Had the nodesize been 4KiB, this file's data extent would be elsewhere rather than inline.
While /var/lib/libvirt/images is on btrfs, its put in a subvolume by default. Doesn't this mean it wont be part of the regular snapshots? - From what I understood things in subvolumes aren't subject to snapshots users take with snapper.
Ahh, I don't have an openSUSE install handy at the moment. Oops. Yes if the images are contained in a subvolume other than the top level of the file system, then top level snapshots done by snapper do not recursively snapshot the contents of nested subvolumes. But what does still happen is VM images cause data chunks to be allocated faster than metadata chunks as the VM image grows. So it's possible to fill up a volume with a high data to metadata chunk ratio that isn't released even if most of the VM's are deleted. You'd have a lot of nearly empty data chunks but can still have totally full metadata chunks and end up with ENOSPC. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Hello all, On 2016-01-24 T 16:25 -0700 Chris Murphy wrote:
On Sun, Jan 24, 2016 at 2:40 PM, Uzair Shamim wrote:
While /var/lib/libvirt/images is on btrfs, its put in a subvolume by default. Doesn't this mean it wont be part of the regular snapshots? - From what I understood things in subvolumes aren't subject to snapshots users take with snapper.
Ahh, I don't have an openSUSE install handy at the moment. Oops. Yes if the images are contained in a subvolume other than the top level of the file system, then top level snapshots done by snapper do not recursively snapshot the contents of nested subvolumes.
indeed.
But what does still happen is VM images cause data chunks to be allocated faster than metadata chunks as the VM image grows. So it's possible to fill up a volume with a high data to metadata chunk ratio that isn't released even if most of the VM's are deleted. You'd have a lot of nearly empty data chunks but can still have totally full metadata chunks and end up with ENOSPC.
That's correct in theory, however as a "proactive remedy", subvolumes such as /var/lib/libvirt should be marked as "NoCOW", to reduce fragmentation and metadata overhead. If this is not the case in Tumbleweed yet, I suggest to open a bugreport (you can check with "lsattr -a /var/lib/libvirt"). So long - MgE -- Matthias G. Eckermann, Director Product Management SUSE Linux Enterprise SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Mon, Jan 25, 2016 at 12:30 AM, Matthias G. Eckermann <mge@suse.com> wrote:
Hello all,
On 2016-01-24 T 16:25 -0700 Chris Murphy wrote:
But what does still happen is VM images cause data chunks to be allocated faster than metadata chunks as the VM image grows. So it's possible to fill up a volume with a high data to metadata chunk ratio that isn't released even if most of the VM's are deleted. You'd have a lot of nearly empty data chunks but can still have totally full metadata chunks and end up with ENOSPC.
That's correct in theory, however as a "proactive remedy", subvolumes such as /var/lib/libvirt should be marked as "NoCOW", to reduce fragmentation and metadata overhead. If this is not the case in Tumbleweed yet, I suggest to open a bugreport (you can check with "lsattr -a /var/lib/libvirt").
The Tumbleweed installation/partitioning overview shows nocow for a several subvolumes including /var/lib/libvirt/images. This is mainly for performance reasons though as in short order raw or qcow2 files will have tens of thousands of fragments. This is much less of a problem on SSD than a HDD. The fragmentation also depends on the guest file system, there are reports on the Btrfs list that NTFS is particularly brutal, and also the libvirt/qemu caching method. I've been using Btrfs in the guest, qcow2 file on Btrfs on the host with unsafe caching for quite some time with good performance, not super hideous fragmentation, and no corruption even when the VM crashes or is force quit. But it's not a configuration recommended for production. Honestly I think it's better to point guest Btrfs to an VG/LV. A drawback to nocow is it also implies nodatasum and disables compression. So it's a tradeoff. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Hello, Am Montag, 25. Januar 2016, 08:30:23 CET schrieb Matthias G. Eckermann:
subvolumes such as /var/lib/libvirt should be marked as "NoCOW", to reduce fragmentation and metadata overhead. If this is not the case in Tumbleweed yet, I suggest to open a bugreport (you can check with "lsattr -a /var/lib/libvirt").
# lsattr -a /var/lib/libvirt ---------------- /var/lib/libvirt/. ---------------- /var/lib/libvirt/.. ---------------C /var/lib/libvirt/images ---------------- /var/lib/libvirt/boot ---------------- /var/lib/libvirt/filesystems ---------------- /var/lib/libvirt/uml ---------------- /var/lib/libvirt/dnsmasq ---------------- /var/lib/libvirt/network ---------------- /var/lib/libvirt/libxl ---------------- /var/lib/libvirt/qemu ---------------- /var/lib/libvirt/lxc The initial installation of this Tumbleweed was about 3 months ago. So only the 'images' subdirectory is NoCOW. I don't know if the other subdirectories would also benefit from NoCOW because I don't use libvirt. Well, actually I had some trouble to get started (might be my bug finding luck again), and no time to dig deeper. Regards, Christian Boltz --
Bitkollisionen finden v.a. in Kneipen und Festzelten statt, würde ich mal annehmen. [H. Bengen] Und zu viele davon führen zu einem stomach overflow? [L. Barth]
-- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 02/02/2016 04:32 PM, Christian Boltz wrote:
Hello,
Am Montag, 25. Januar 2016, 08:30:23 CET schrieb Matthias G. Eckermann:
subvolumes such as /var/lib/libvirt should be marked as "NoCOW", to reduce fragmentation and metadata overhead. If this is not the case in Tumbleweed yet, I suggest to open a bugreport (you can check with "lsattr -a /var/lib/libvirt").
# lsattr -a /var/lib/libvirt ---------------- /var/lib/libvirt/. ---------------- /var/lib/libvirt/.. ---------------C /var/lib/libvirt/images ---------------- /var/lib/libvirt/boot ---------------- /var/lib/libvirt/filesystems ---------------- /var/lib/libvirt/uml ---------------- /var/lib/libvirt/dnsmasq ---------------- /var/lib/libvirt/network ---------------- /var/lib/libvirt/libxl ---------------- /var/lib/libvirt/qemu ---------------- /var/lib/libvirt/lxc
The initial installation of this Tumbleweed was about 3 months ago.
So only the 'images' subdirectory is NoCOW.
I don't know if the other subdirectories would also benefit from NoCOW because I don't use libvirt. Well, actually I had some trouble to get started (might be my bug finding luck again), and no time to dig deeper.
Regards,
Christian Boltz
I doubt it will make much of a difference for the other /var/lib/libvirt subdirectories because images is the one which has very high write operations to disk (since VM files will constantly change as it runs) so its not really useful to have CoW enabled. -- Regards, Uzair Shamim -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Am 24.01.2016 um 21:22 schrieb Chris Murphy:
On Sat, Jan 23, 2016 at 1:35 AM, Thomas Langkamp <thomas.lassdiesonnerein@gmx.de> wrote:
df showed enough free space, and I did not know about "btrfs fi show".
Ideally you shouldn't have to. But Btrfs still has some rough edges, and really it's quite different and the regular df command doesn't have the granularity to distinguish between types of free space that Btrfs can have: space only for metadata, only for data, and for either.
To understand, I read https://btrfs.wiki.kernel.org/index.php/Balance_Filters and the FAQ. However I do not understand much. What is all those "metadata" and what exactly does balance filter? I do understand what btrfs balance tries to fix (full file system errors), but not how or why the problem exists.
Can someone explain in less technical terms?
Basically there are two kinds of allocations in Btrfs: extents and chunks.
snip Wow :D Very nice read. Thank you for this simple but detailed explanation. This should go into the btrfs-wiki :D -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sun, Jan 24, 2016 at 1:22 PM, Chris Murphy <lists@colorremedies.com> wrote: You can end up in a situation where the file system
first becomes nearly full (no unallocated space for a chunk to be created), but via file deletion it can seem there's plenty of unused space, but only for one type of chunk. The typical case is when data chunks have free space, but metadata chunks don't.
A better qualification is to say "when people have bad disk full Btrfs experiences" the typical case is because data chunks have free space but metadata chunks don't. If metadata chunks are completely full, and there's no unallocated space to create more metadata chunks, the file system will ENOSPC and some really unpredictable things can happen including inability to delete files. OK so today, the opposite happened to me. Data chunks became totally full, but there was still unused space in metadata chunks. Result? Graceful failure and I could easily delete some files and didn't even have to reboot. What happened was an intentionally misconfigured VM went a bit crazy, overfilled some qcow2 files on the host, and then this: [root@f23m images]# btrfs fi usage / Overall: Device size: 56.00GiB Device allocated: 56.00GiB Device unallocated: 1.00MiB Device missing: 0.00B Used: 55.33GiB Free (estimated): 1.22MiB (min: 1.22MiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 128.00MiB (used: 0.00B) Data,single: Size:54.97GiB, Used:54.97GiB /dev/sda7 54.97GiB Metadata,single: Size:1.00GiB, Used:370.31MiB /dev/sda7 1.00GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/sda7 32.00MiB Unallocated: /dev/sda7 1.00MiB And of course (regular) df says it's 100% full, which is more true than not. But see there's 700MiB unused in metadata chunks. So it was not a problem for the file system to delete some big qcow2 files and continue on without further problems. It may be this is really common because it's so graceful, no one would complain about it. Some time later however: [root@f23m images]# btrfs fi usage / Overall: Device size: 56.00GiB Device allocated: 52.97GiB Device unallocated: 3.03GiB Device missing: 0.00B Used: 22.45GiB Free (estimated): 32.87GiB (min: 32.87GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 128.00MiB (used: 0.00B) Data,single: Size:51.94GiB, Used:22.10GiB /dev/sda7 51.94GiB Metadata,single: Size:1.00GiB, Used:358.73MiB /dev/sda7 1.00GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/sda7 32.00MiB Unallocated: /dev/sda7 3.03GiB A lot of "free" space is not unallocated. Those are data chunks only, and they aren't reverting back to unallocated. Only 3GiB of unallocated is available for becoming metadata chunks. So if I were to do a bunch of small file writes, very metadata intensive writing, it's possible I'll end up in the originally described situation: Plenty of unused space in data chunks, but full metadata chunks and no more unallocated space to create more metadata chunks. That's asking for what I call a "wedged into a corner" file system. There is almost always a way out though, which is to 'btrfs device add /dev/sdXY /" by just pointing this to a small USB stick or partition on it, it can be just enough unallocated space for a metadata chunk, now files can be deleted and the fs can be balanced with a -dusage=15 limit, and then the stick removed 'btrfs dev del /dev/sdXY /" and any little bit of data or metadata that's since been put on the stick will get moved back to the original/big partition. Done. But for now I'm going to pre-empt that: [root@f23m images]# btrfs balance start -dusage=15 / Done, had to relocate 25 out of 54 chunks This took about 20 seconds on an SSD. And now: [root@f23m images]# btrfs fi usage / Overall: Device size: 56.00GiB Device allocated: 29.03GiB Device unallocated: 26.97GiB Device missing: 0.00B Used: 22.45GiB Free (estimated): 32.87GiB (min: 32.87GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 128.00MiB (used: 0.00B) Data,single: Size:28.00GiB, Used:22.10GiB /dev/sda7 28.00GiB Metadata,single: Size:1.00GiB, Used:358.55MiB /dev/sda7 1.00GiB System,single: Size:32.00MiB, Used:16.00KiB /dev/sda7 32.00MiB Unallocated: /dev/sda7 26.97GiB That's much better. There's now only 6GiB unused data chunks, and there's no point doing a more aggressive balance for that anyway. I could even have done -dusage=5 or 10 and any future concern would be gone. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tue, 26 Jan 2016 09:18, Chris Murphy wrote:
On Sun, Jan 24, 2016 at 1:22 PM, Chris Murphy wrote:
You can end up in a situation where the file system
first becomes nearly full (no unallocated space for a chunk to be created), but via file deletion it can seem there's plenty of unused space, but only for one type of chunk. The typical case is when data chunks have free space, but metadata chunks don't.
A better qualification is to say "when people have bad disk full Btrfs experiences" the typical case is because data chunks have free space but metadata chunks don't. If metadata chunks are completely full, and there's no unallocated space to create more metadata chunks, the file system will ENOSPC and some really unpredictable things can happen including inability to delete files.
OK so today, the opposite happened to me. Data chunks became totally full, but there was still unused space in metadata chunks. Result? Graceful failure and I could easily delete some files and didn't even have to reboot. What happened was an intentionally misconfigured VM went a bit crazy, overfilled some qcow2 files on the host, and then this:
[root@f23m images]# btrfs fi usage / Overall: Device size: 56.00GiB Device allocated: 56.00GiB Device unallocated: 1.00MiB Device missing: 0.00B Used: 55.33GiB Free (estimated): 1.22MiB (min: 1.22MiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 128.00MiB (used: 0.00B)
Data,single: Size:54.97GiB, Used:54.97GiB /dev/sda7 54.97GiB
Metadata,single: Size:1.00GiB, Used:370.31MiB /dev/sda7 1.00GiB
System,single: Size:32.00MiB, Used:16.00KiB /dev/sda7 32.00MiB
Unallocated: /dev/sda7 1.00MiB
And of course (regular) df says it's 100% full, which is more true than not. But see there's 700MiB unused in metadata chunks. So it was not a problem for the file system to delete some big qcow2 files and continue on without further problems. It may be this is really common because it's so graceful, no one would complain about it.
Some time later however:
[root@f23m images]# btrfs fi usage / Overall: Device size: 56.00GiB Device allocated: 52.97GiB Device unallocated: 3.03GiB Device missing: 0.00B Used: 22.45GiB Free (estimated): 32.87GiB (min: 32.87GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 128.00MiB (used: 0.00B)
Data,single: Size:51.94GiB, Used:22.10GiB /dev/sda7 51.94GiB
Metadata,single: Size:1.00GiB, Used:358.73MiB /dev/sda7 1.00GiB
System,single: Size:32.00MiB, Used:16.00KiB /dev/sda7 32.00MiB
Unallocated: /dev/sda7 3.03GiB
A lot of "free" space is not unallocated. Those are data chunks only, and they aren't reverting back to unallocated. Only 3GiB of unallocated is available for becoming metadata chunks. So if I were to do a bunch of small file writes, very metadata intensive writing, it's possible I'll end up in the originally described situation: Plenty of unused space in data chunks, but full metadata chunks and no more unallocated space to create more metadata chunks. That's asking for what I call a "wedged into a corner" file system. There is almost always a way out though, which is to 'btrfs device add /dev/sdXY /" by just pointing this to a small USB stick or partition on it, it can be just enough unallocated space for a metadata chunk, now files can be deleted and the fs can be balanced with a -dusage=15 limit, and then the stick removed 'btrfs dev del /dev/sdXY /" and any little bit of data or metadata that's since been put on the stick will get moved back to the original/big partition. Done.
But for now I'm going to pre-empt that:
[root@f23m images]# btrfs balance start -dusage=15 / Done, had to relocate 25 out of 54 chunks
This took about 20 seconds on an SSD. And now:
[root@f23m images]# btrfs fi usage / Overall: Device size: 56.00GiB Device allocated: 29.03GiB Device unallocated: 26.97GiB Device missing: 0.00B Used: 22.45GiB Free (estimated): 32.87GiB (min: 32.87GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 128.00MiB (used: 0.00B)
Data,single: Size:28.00GiB, Used:22.10GiB /dev/sda7 28.00GiB
Metadata,single: Size:1.00GiB, Used:358.55MiB /dev/sda7 1.00GiB
System,single: Size:32.00MiB, Used:16.00KiB /dev/sda7 32.00MiB
Unallocated: /dev/sda7 26.97GiB
That's much better. There's now only 6GiB unused data chunks, and there's no point doing a more aggressive balance for that anyway. I could even have done -dusage=5 or 10 and any future concern would be gone.
+1 This gets added to my des-rec-file. IMHO this post of yours should be part of the btrfs-man-page, the upstream btrfs wiki, and the opensuse btrfs wiki page. B/C in the manpage there are to little recipies for desaster recovery like you have shown here. The inclusion of recipies like this, in the aforementioned places would enable help-searching users to FIND this information (please mention a date and kernelversion for easier maintainance). Thank you very much, - Yamaban -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tuesday 26 Jan 2016 13:05:42 Yamaban wrote:
+1 This gets added to my des-rec-file.
IMHO this post of yours should be part of the btrfs-man-page, the upstream btrfs wiki, and the opensuse btrfs wiki page.
B/C in the manpage there are to little recipies for desaster recovery like you have shown here.
The inclusion of recipies like this, in the aforementioned places would enable help-searching users to FIND this information (please mention a date and kernelversion for easier maintainance).
Thank you very much,
- Yamaban
I would agree to even its inclusion as an advance quick start to btrfs for users who are trying it out for the first time in openSUSE. I was able to free up 3 GB of space thanks to this thread (which is non trivial on a 40 GB / partition which is already 23 GB full).
On Tue, Jan 26, 2016 at 5:05 AM, Yamaban <foerster@lisas.de> wrote:
On Tue, 26 Jan 2016 09:18, Chris Murphy wrote:
That's much better. There's now only 6GiB unused data chunks, and there's no point doing a more aggressive balance for that anyway. I could even have done -dusage=5 or 10 and any future concern would be gone.
+1 This gets added to my des-rec-file.
IMHO this post of yours should be part of the btrfs-man-page, the upstream btrfs wiki, and the opensuse btrfs wiki page.
B/C in the manpage there are to little recipies for desaster recovery like you have shown here.
The inclusion of recipies like this, in the aforementioned places would enable help-searching users to FIND this information (please mention a date and kernelversion for easier maintainance).
It's not concise enough for that in my opinion. Plus there so much heavy development in Btrfs now that anything said about it can be obsolete quickly. If it were possible to not only date content, but set an expiration date for it in advance, then that'd be helpful. Stale content can get people into trouble, and it leads to confusion. My suggestion is the bulk of this should be on the Btrfs wiki. Like all things open source that wiki is a volunteer effort. Maybe it needs some reorganization, and indexing, just because of the way it's grown. And then add an examples page. A slight bit of difficulty is upstream is on very new kernels and btrfs-progs, and while important bug fixes are backported, new features may not be. So there can sometimes be considerable differences in behavior just due to versions. What would be great, would be to have the organization such that a subset of upstream pages have content that are very user focused, that apply to all distributions, and then distributions can "subscribe" to that content. It's completely reasonable that most openSUSE users will want to use primarily openSUSE documentation and resources, rather than having to scour the net finding various upstreams for reliable information. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Chris Murphy wrote:
[root@f23m images]# btrfs balance start -dusage=15 / Done, had to relocate 25 out of 54 chunks
This took about 20 seconds on an SSD.
I did something similar, ‘btrfs balance start -dusage=10’, on a HDD a few days ago. That took > 5 hours! And the disk wasn’t even close to full – around 80 GiB free (out of about 1,8 TiB). -- Karl Ove Hufthammer -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Hello Karl and all, On 2016-01-26 T 22:48 +0100 Karl Ove Hufthammer wrote:
Chris Murphy wrote:
[root@f23m images]# btrfs balance start -dusage=15 / Done, had to relocate 25 out of 54 chunks
This took about 20 seconds on an SSD.
I did something similar, ‘btrfs balance start -dusage=10’, on a HDD a few days ago. That took > 5 hours! And the disk wasn’t even close to full – around 80 GiB free (out of about 1,8 TiB).
well, this means the filesystem is filled for about 96%, and this is really full. All known filesystems start to become "nervous", if filled more than 90%. With respect to the 5 hours: yes, if you run a balance the first time and on a full filesystem, this may happen, unfortunately. The scripts in the "btrfsmaintenance" package are meant to run regularily, and doing so will prevent any balance job running long. Starting btrfsmaintenance runs with an empty filesystem will lead to happy experience. So long - MgE -- Matthias G. Eckermann, Director Product Management SUSE Linux Enterprise SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Tue, Jan 26, 2016 at 2:48 PM, Karl Ove Hufthammer <karl@huftis.org> wrote:
Chris Murphy wrote:
[root@f23m images]# btrfs balance start -dusage=15 / Done, had to relocate 25 out of 54 chunks
This took about 20 seconds on an SSD.
I did something similar, ‘btrfs balance start -dusage=10’, on a HDD a few days ago. That took > 5 hours! And the disk wasn’t even close to full – around 80 GiB free (out of about 1,8 TiB).
Well that is in fact close to full, percentage wise. It's true that even 1% of such large drives these days translates into a lot of free space remaining, so the file system should still work, of course. But the fs+drive do write from outside to inside, and the inside performance dramatically less than the outside of the platters. The writes can be 50% slower. And if the files are above nodesize (16KiB by default) but are still small, and there are many of them, or if they are highly fragmented files, then it can take even longer to do the balance. It's also possible you've run into some kind of edge case bug. However, it's probably more of an optimization issue, rather than an overt bug. Before it could be fixed, it would need a lot of modeling: what's the current state of the file system on-disk (btrfs-image and/or btrfs-debug-tree), what is the workload at the time other than balance, what sorts of files are being balanced (harder to figure out), and then how to gather kernel data to see if there are unusual delays, i.e. sysrq + t while the balance is running slowly. And then a developer has to take an active interest in this information, and see about optimizing it. While it might seem they'd sooner choose to develop features, or do bug fixes, sometimes the optimization edge cases expose a problem that's causing other problems. So it's not a bad idea to have all the modeling information in a bug report. But, also keep in mind that even at 96% full the file system hasn't face planted! That's good. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Chris Murphy writes:
A better qualification is to say "when people have bad disk full Btrfs experiences" the typical case is because data chunks have free space but metadata chunks don't. If metadata chunks are completely full, and there's no unallocated space to create more metadata chunks, the file system will ENOSPC and some really unpredictable things can happen including inability to delete files. […]
I just wanted to let you know that this post saved my bacon yesterday when the root fs ran out of metadata chunks while downloading the update packages and systemd decided it should roll over the logs at the same time. With over 30GiB free according to df I would not have known what to make of all these ENOSPC errors spewing from dmesg. A rebalance didn't work until I deleted a few old snapshots and then it took quite some time, but it did recover some 20GiB to "unallocated", so it should be a while before this happens again (I hope). Regards, Achim. -- +<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+ Factory and User Sound Singles for Waldorf Q+, Q and microQ: http://Synth.Stromeko.net/Downloads.html#WaldorfSounds -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Fri, Jan 22, 2016 at 3:33 PM, Christian Boltz <opensuse@cboltz.de> wrote:
Hello,
I just had an interesting[tm] problem - updating to the latest tumbleweed failed at random places.
It turned out that my btrfs / partition was full. Not with df -h (which reported some GB free), but "btrfs fi show" showed 100% usage. (Sorry for not including the exact output - I didn't save it.)
I'd say this shouldn't happen. But more information is needed to understand what's happening and give an explanation. There are two related issues that come up in these cases. 1. Near the time the volume becomes close to full, large files are being written. This means unallocated space is allocated as data chunks, which means they can't be used for metadata. One or more files are deleted, and then smaller files are written such as application or system updates, which are metadata heavy changes. But existing metadata chunks don't have enough space for these changes. Now the file system is considered full even though there's unused space in data chunks. 2. Btrfs is a copy-on-write file system which means even file deletion requires space to write the change, since there's no overwriting. So it's even possible to get into a rare but particularly annoying situation where it's not possible to delete files. The work around for this is to add a small device, delete files, balance with -dusage=15 (other values will work OK too, this is a suggestion that should go pretty fast but also free up a lot of space) then remove the device from the Btrfs volume. For space related problems, it's best to include in the post: df btrfs filesystem show btrfs filesystem df btrfs filesystem usage It is a bit tedious. The idea is that df should be reliable, lots of discussions have happened on the Btrfs list about it and the behavior has changed a few times based on those conversations. And btrfs fi usage is meant to be used to get more information than the (normal) df command. The 'fi df' and 'fi show' subcommands are mainly used now for troubleshooting and explaining behavior that don't meet expectations from df and 'fi usage'. So usually for most users 'fi usage' should be enough.
I moved 15 GB of libvirt images to a different partition and deleted some old snapshots, but both didn't help.
After some searching (and temporaryly breaking my /, which I could luckily repair with snapper rollback), I found out [1] that I should run btrfs balance start / -dlimit=3 which freed quite some space.
That's normal right now. The kernel code only deletes chunks when they become completely empty. Another run freed even more space, so I
decided to run it without the limit: btrfs balance start /
After that, I'm down to
# btrfs fi show Label: none uuid: 9f4a918d-fcd4-45d3-a1dc-7b887300eabc Total devices 1 FS bytes used 22.91GiB devid 1 size 50.00GiB used 25.31GiB path /dev/mapper/cboltz-root
Even if you re-add the 15 GB libvirt images in the calculation, this still means the rebalance freed about 10 GB = 20% of the partition size.
So the good news is that I could solve the problem. The bad news is that it happened at all.
I'd say the behavior you describe is becoming less common, is suboptimal when it happens, and Btrfs is being improved.
Should there be a cronjob that does the rebalancing or other btrfs maintenance regularly?
If something does this, it's really just masking the problem. I think the developers would prefer for there to be some minor problems like this and get user reports so they can try to fix and fine tune the behavior rather than having it masked. It's been a challenging problem to solve. So I can see why the maintenance script is provided but not enabled by default. -- Chris Murphy -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
participants (13)
-
Achim Gratz
-
Andrei Borzenkov
-
Chan Ju Ping
-
Chris Murphy
-
Christian Boltz
-
jdd
-
Karl Ove Hufthammer
-
Lars Müller
-
Matthias G. Eckermann
-
Thomas Langkamp
-
tomtomme
-
Uzair Shamim
-
Yamaban