[yast-devel] Call for ideas: parted vs. udev-events vs. kernel

25 May 2016

      In the YaST storage area, we have a problem that manifests itself in very 
subtle bugs:

Every once in a while, device nodes (in particular for partitions) are not 
present after a seemingly harmless call like "parted -l", not even after 
calling "udevadm settle".

That means that even though parted just told us that there should be a 
/dev/sda1, /dev/sda2 etc., those device nodes are just not there (yet - they 
are recreated moments later), and operations on those device nodes (like 
trying to determine the filesystem, if any, on them) fail with ENOENT.

After investigating this quite some bit, it turned out that even though it 
should be purely read-only, "parted -l" opens the disk device first 
read-only, then closes it, then reopens it read-write (for reasons we don't 
understand yet) - and in that process the kernel gets the ioctl() call to 
re-read the partition table, which triggers that the device nodes for the 
partitions on that disk are first completely removed and then recreated (by 
udev AFAIK).

snwint even tried to add a real read-only mode to parted, but so far this was 
not successful.

Since our parted maintainer is very busy with other things at this time (and 
for the immediate future), here is the call to our developer community:

What can we reasonably do about this? Does anybody have a good idea, or maybe 
even might want to contribute?

Likely causes
=============

It is a spurious problem, but it happens most often in virtual machines. It 
might have to do with only one CPU typically being configured, so there is no 
real parallel processing; when parted sends the "re-read partition table" 
ioctl() to the kernel, the device nodes for the partitions are first removed, 
then the partition table is re-read, then udev events are generated to 
recreate them.

We suspect that it might be a race condition due to task switches between 
udev and YaST at unlucky moments: The kernel would generate more udev events, 
but udev is still busy, and there is a task switch to YaST which fires off an 
"udevadm settle" call - which returns immediately because right now the udev 
event queue is indeed empty, and YaST continues, in the next step trying to 
open one of those partition devices, which fails because the device node is 
not present yet.

Real life relevance
===================

It would be bad enough if this would only hurt OpenQA, but this is very 
likely very relevant for customers, too: Virtual machines for web hosters, 
for example.

Brainstorming approach #1: Suspend udev
=======================================

We talked about suspending udev for a while, maybe at least for purely 
passive operations like that "parted -l". Can this realistically be done? 
Would any pending udev events get completely lost, or would they remain in 
the queue? What would we lose in either case?

We can't do that for every parted call, in particular not for those where we 
create or delete partitions since in those cases, we would want the updates 
what the partitioning looks like now. Not sure if the same problem would not 
reappear, just a little less likely.

Brainstorming approach #2: Use someting other than parted
=========================================================

In the past, we had used "fdisk" on i386, and IIRC there were other tools on 
other architecturs. parted was the promise to unify them all and make the 
others obsolete, which it did pretty good - except for this issue. But would 
we really want to trade that one problem for all the others we had gotten rid 
of when we moved to parted? I don't think so.

Verdict: Not desirable.

Brainstorming approach #3: Tactical sleep
==========================================

We try our best to avoid dysfunctional and broken-by-design approaches like 
adding sleep() calls to hope everything has settled down after that time. 
Those things tend to accumulate in code and never go away, and while they 
cost time constantly, those timeouts are always too short (if we are unlucky) 
and too long (the user has to wait) at the same time.

Verdict: Not desirable.

Brainstorming approach #4: Add read-only mode to parted
=======================================================

"parted -l", which we are using in this situation, should already be a 
read-only operation. Unfortunately, strace shows that this is not the case: 
It starts with opening the disk device read-only, reads information, closes 
it - and then for whatever reason opens it again read-write which triggers 
the code that sends the ioctl() to make the kernel re-read the partition table.

We consider that a bug (https://bugzilla.suse.com/show_bug.cgi?id=979275), 
but it does not seem to be easy to fix. Any contribution to that would be 
very welcome.

Brainstorming approach #5: [add your idea here]
===============================================

     [this area intentionally left blank]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Related bugs:

     https://bugzilla.suse.com/show_bug.cgi?id=979275
     https://bugzilla.suse.com/show_bug.cgi?id=978137

Please note that while I will not be in the office during the next week, 
Arvin, Ancor and Steffen will be here to follow up on this.

Kind regards
-- 
Stefan Hundhammer  
YaST Developer

SUSE Linux GmbH
GF: Felix Imendörffer, Jane Smithard, Graham Norton; HRB 21284 (AG Nürnberg)
Maxfeldstr. 5, 90409 Nürnberg, Germany
-- 
To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org
To contact the owner, e-mail: yast-devel+owner@opensuse.org

Stefan Hundhammer

tags

participants (6)