In the YaST storage area, we have a problem that manifests itself in very
subtle bugs:
Every once in a while, device nodes (in particular for partitions) are not
present after a seemingly harmless call like "parted -l", not even after
calling "udevadm settle".
That means that even though parted just told us that there should be a
/dev/sda1, /dev/sda2 etc., those device nodes are just not there (yet - they
are recreated moments later), and operations on those device nodes (like
trying to determine the filesystem, if any, on them) fail with ENOENT.
After investigating this quite some bit, it turned out that even though it
should be purely read-only, "parted -l" opens the disk device first
read-only, then closes it, then reopens it read-write (for reasons we don't
understand yet) - and in that process the kernel gets the ioctl() call to
re-read the partition table, which triggers that the device nodes for the
partitions on that disk are first completely removed and then recreated (by
udev AFAIK).
snwint even tried to add a real read-only mode to parted, but so far this was
not successful.
Since our parted maintainer is very busy with other things at this time (and
for the immediate future), here is the call to our developer community:
What can we reasonably do about this? Does anybody have a good idea, or maybe
even might want to contribute?
Likely causes
=============
It is a spurious problem, but it happens most often in virtual machines. It
might have to do with only one CPU typically being configured, so there is no
real parallel processing; when parted sends the "re-read partition table"
ioctl() to the kernel, the device nodes for the partitions are first removed,
then the partition table is re-read, then udev events are generated to
recreate them.
We suspect that it might be a race condition due to task switches between
udev and YaST at unlucky moments: The kernel would generate more udev events,
but udev is still busy, and there is a task switch to YaST which fires off an
"udevadm settle" call - which returns immediately because right now the udev
event queue is indeed empty, and YaST continues, in the next step trying to
open one of those partition devices, which fails because the device node is
not present yet.
Real life relevance
===================
It would be bad enough if this would only hurt OpenQA, but this is very
likely very relevant for customers, too: Virtual machines for web hosters,
for example.
Brainstorming approach #1: Suspend udev
=======================================
We talked about suspending udev for a while, maybe at least for purely
passive operations like that "parted -l". Can this realistically be done?
Would any pending udev events get completely lost, or would they remain in
the queue? What would we lose in either case?
We can't do that for every parted call, in particular not for those where we
create or delete partitions since in those cases, we would want the updates
what the partitioning looks like now. Not sure if the same problem would not
reappear, just a little less likely.
Brainstorming approach #2: Use someting other than parted
=========================================================
In the past, we had used "fdisk" on i386, and IIRC there were other tools on
other architecturs. parted was the promise to unify them all and make the
others obsolete, which it did pretty good - except for this issue. But would
we really want to trade that one problem for all the others we had gotten rid
of when we moved to parted? I don't think so.
Verdict: Not desirable.
Brainstorming approach #3: Tactical sleep
==========================================
We try our best to avoid dysfunctional and broken-by-design approaches like
adding sleep() calls to hope everything has settled down after that time.
Those things tend to accumulate in code and never go away, and while they
cost time constantly, those timeouts are always too short (if we are unlucky)
and too long (the user has to wait) at the same time.
Verdict: Not desirable.
Brainstorming approach #4: Add read-only mode to parted
=======================================================
"parted -l", which we are using in this situation, should already be a
read-only operation. Unfortunately, strace shows that this is not the case:
It starts with opening the disk device read-only, reads information, closes
it - and then for whatever reason opens it again read-write which triggers
the code that sends the ioctl() to make the kernel re-read the partition table.
We consider that a bug (https://bugzilla.suse.com/show_bug.cgi?id=979275),
but it does not seem to be easy to fix. Any contribution to that would be
very welcome.
Brainstorming approach #5: [add your idea here]
===============================================
[this area intentionally left blank]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Related bugs:
https://bugzilla.suse.com/show_bug.cgi?id=979275
https://bugzilla.suse.com/show_bug.cgi?id=978137
Please note that while I will not be in the office during the next week,
Arvin, Ancor and Steffen will be here to follow up on this.
Kind regards
--
Stefan Hundhammer