f2fs root enablement for VM use case?

Greetings, I have been engaged in various discussions about having f2fs re-added to Tumbleweed to support the most common use case I see for SUSE-family systems, which is flash-backed VMs. When I started using SUSE family systems in 2015, most of the systems were physical and snapper / btrfs were an attractive option coming from Solaris with LiveUpgrade/ZFS. In the past year, though, I haven't worked with any SUSE physical systems, but have worked with hundreds of systems on VMs. At this scale, things like journaling, CoW and disk-backed swap start to add up to a lot of unneeded storage IO to the shared backing storage on the storage arrays and HCI used by the virtualization hosts. As a result, I would like to start proving out f2fs on some VMs in a lab environment mirroring production. Would the openSUSE project be open to adding this filesystem back as a supported option from an upstream-first perspective? If so, are there any steps I could take such as filing BZs or otherwise that could help work toward that goal? So far, I have opened a few bugs to work toward this. It appears that 1184415 is on track to have the module blacklist lifted, but the others, 1184416 and 1184417 were closed WONTFIX. The rationale for the latter two being "With YaST we target both distributions and therefore we need consistent support everywhere." What I'm trying to figure out now is whether this means that the first opportunity for f2fs enablement in Tumbleweed would be when openSUSE LEAP 16 is announced or f2fs root as an option in the installer would be off the table indefinitely. Peter

On Tuesday 2021-04-13 20:41, peter.clark@ngic.com wrote:
You could just transfer the files to a new filesystem after installation. Since your bugzilla reports mentioned a VM environment, doing should be even easier than on real hardware. In fact, maybe you can consider a KIWI image over yast. Branching kiwi-template-JeOS and stuffing f2fs in there might just be sufficient, and it's a lot faster to install than from a yast-based medium too.

Jan Engelhardt wrote: [...]
Having not used any of the tooling involving kiwi directly before, would using a procedure along the lines of the one described in this Debian guide be a good way to get started? https://howtos.davidsebek.com/debian-f2fs.html I have been running f2fs on a number of distributions that support deploying onto an f2fs root filesystem in the installer, so proving out f2fs in general isn't the sticking point. Rather, being able to eventually purchase support from SUSE for a system deployed that way. If I were to build a Tumbleweed system using a process similar to the one described for Debian, would that bring official f2fs support in Tumbleweed closer to fruition?

On Tuesday 2021-04-13 21:59, peter.clark@ngic.com wrote:
Kinda. It's sprinkled with ancient knowledge and slow feats. If you know in advance you want f2fs, the partition table can be created befitting of the situation from the start. This avoids the use of terribly SLOW USB flash things like the guide advocates. It also avoids the double-copying that would be incurred. Even more, you'd have both filesystems side by side and could switch back if the f2fs one fails to boot. nvme0n1p1 /boot ext2 512M nvme0n1p2 notmounted notformatted AllRemainingGB nvme0n1p3 / xfs 4? G dd is one of those stupid commands people should just use ddrescue already. wipefs is unnecessary (because dd/ddrescue was already run!). The commands for initrd and grub2 are different between distros. Once you have a shell, the incantation would go like.. mkfs.f2fs nvme0n1p2 mount nvme0n1p2 /mnt rsync -HPXax / /mnt/ for i in /boot /dev /proc /sys; do mount $i /mnt/$i --bind; done blkid vi /mnt/etc/fstab; #set new UUID chroot /mnt mkinitrd -m xfs exit; #from chroot umount /mnt/* /mnt reboot p3 still exists, so provided the initrd can still mount it (hence the extra -m parameter), one can always go back (given some manual input into the bootloader) and fix the situation with more commands. Once it's all good, p3 can be deleted, p2 enlarged (usually needs a gratuitious reboot for the kernel to pick up the new table fully), then the p2 _fs_ enlarged, and it's done.

Jan, Thanks for listing out the steps! I got that to work on a VM in Virtualbox with the caveat that the installer required a 5G root partition. I haven't tried the Debian procedure - since it had been written over a year ago and f2fs enablement appears to have a DD sponsor, I figured they might have gotten the installer for their upcoming release to offer it, but no such luck at least for alpha3. That said, I would assume the wipefs command was used because it issues a BLKRRPART ioctl which presumably would eliminate the need for the additional reboot to pick up partition table changes. I haven't encountered a system with ddrescue installed before - the package description makes it sound like its use applies more to physical machine use cases. Since sysadmins often have to work on systems from many different sources and often when they are in a state where a full suite of tools is unavailable, the first impulse is usually to use what is most likely to be portable, present and understood well enough that unexpected behavior isn't encountered while working against the clock of an SLA during an outage. Speaking of which, would you happen to be aware of anything that would prevent this procedure from converting some systems from btrfs root to XFS root on SLES 12? Possibly somewhat OT here since it's an older version of downstream, but very on topic for the present discussion of supporting the VM use case. On that note, I have considered the possibility of filing a bug against /etc/sysconfig/btrfsmaintenance to set the default of BTRFS_BALANCE_PERIOD="none" at least when a VM installation is being performed or possibly when flash storage is detected. Of course, that would depend on whether the on-premises VM use case is being targeted by openSUSE, or whether efforts like this would be better directed downstream. For my part, after virtualizing a number of systems with btrfs root, it was quite a surprise to see a thundering herd of noisy neighbors all running btrfs balance at once. Then data corruption started occurring on large VMs apparently triggered by VMware DRS moving them around due to high IO, leading to unhandled timeouts leading to BTRFS going read-only and recovery efforts necessitating restores from backup.

On Wednesday 2021-04-14 20:53, peter.clark@ngic.com wrote:
BLKRRPART can be issued with `blockdev --rereadpt`. You do not need to wipe a disk fully and repeatedly just for that ioctl. But the point is: if /dev/sda3 is currently mounted as per my howto, reloading /dev/sda after adjusting the table will return EBUSY. Hence the reboot. (Else you'd have to switch to a magic tmpfs.. all complicated and not worth pursuing time-wise.)
Well this is exactly the argument _for_ using ddrescue: not having to figure out bs=, skip= and seek= crap under pressure, and actually getting a time estimate on how long the operation will take (at least for larger transfers).
Why prevent? Switching filesystem does not happen automatically, you have to really work towards it.

Jan Engelhardt wrote: [...]
Good to know. The only thing I have used dd in the past several years has been as a storage IO generator without having to install anything. I had recently read an article about conv=sync,noerror being used when working with raw disk devices so the ddrescue alternative is valuable knowledge.
Sounds like a great tool, and I will give it a try next time I need a dd equivalent and pass along the info, since I suspect not too many sysadmins are aware it exists. To that point, the only results Google finds on suse.com are from HW 20 and Packagehub, for example, while this KB is an obvious case where ddrescue would probably be preferable https://www.suse.com/support/kb/doc/?id=000018742 Since as a sysadmin, it's an expectation to follow instructions provided by a vendor, it's hard to see ddrescue having much general adoption when official documentation exclusively specifies dd.
Why prevent? Switching filesystem does not happen automatically, you have to really work towards it.
I had considered swapping the root disk with one from a template with XFS and applying any changes to /etc and /var that might be needed by hand because I have seen that work before. This rsync/mkinitrd method using another partition on the same disk looks much cleaner, more straightforward and repeatable. I just wasn't sure if there had been any recent changes to these steps that may have been required since downstream's 2014 GA release and it will take a fair amount of effort to get everything put in place to lab it out.

I have been prepping to work on this over the next several days. It was a nice surprise to see during a dry run that rsync -HPXax / /mnt/ doesn't try to copy /.snapshots to the target. It looks like that applies to all subvolumes, though, so /var, /root, et al will need to be individually rsynced. On the ddrescue command - for the typical datacenter environment, this isn't really appropriate now that virtualization has become the norm since cloning disks is typically done at the virtualization platform level. To use ddrescue in addition to generating a lot of storage IO on a shared link would also require a backing array to re-compress and deduplicate the data. When a clone operation is performed using the Virtualization platform, a copy is instantaneosly spun up without the need to read and write an entire device. In other words, it is similar to the reason to why btrfs is not appropriate for this use case (especially with a weekly btrfs balance run by default)

On the Tumbleweed lab system where I was testing, I had intentionally picked btrfs in hopes of triggering the same type of corruption seen on the business critical systems to help with debugging/troubleshooting. The real problems seem to occur starting at ~ 64G of RAM for a VM when DRS moves them around due to high IO when btrfs balance runs. That is not the kind of sizing that is an easy request for debugging a filesystem that isn't appropriate to the use case.
participants (3)
-
Jan Engelhardt
-
peter.clark@ngic.com
-
Stefan Seyfried