[Bug 618678] blkback thread hangs after unsuccessful xen domU start

2 Jul 2010

      http://bugzilla.novell.com/show_bug.cgi?id=618678

http://bugzilla.novell.com/show_bug.cgi?id=618678#c14

--- Comment #14 from Kattiganehalli srinivasan  2010-07-02 21:24:28 UTC ---
In all the previous instances we have seen vifs hanging around it was related
to references that were acquired on the interface under heavy network traffic
that would take a long time (sometime hours) to drain out. In this case though
(on xen75), I have been seeing this problem under absolutely no network load.

netif_disconnect() is the function that does the final teardown of the VIF. In
the case of a normal guest shutdown, this function gets called from the
xenwatch thread as part of handling the guest state change
(XenbusStateClosing). The xm destroy case is little different in the sense that
we kill the guest without involving the guest. In this case we will receive the
XenbusStateUnknown and the cleanup path is via netback_remove.

In this version of netback, a new r/w semaphore has been introduced
(teardown_sem) which is the root cause of the problem. Given that we have only
one watch thread and all this code gets executed in the context of the watch
thread, not clear why we need this semaphore. However, what is happening is
that we are acquiring this semaphore in the write mode is netback_remove() and
while still holding this semaphore we attempt to acquire this semaphore in the
read mode in the function netback_uevent(). This is what causes the xenbus
watch thread to get itself wedged   even before cleaning up the vif - even if
the vif were cleaned up it would not matter since we have lost the xenwatch
thread!

For what it is worth we did not have this teardown_sem in sles11 sp1. After the
(July 4th) I will look at why this semaphore was introduced. For what it is
worth, I got rid of this semaphore on xen75 and things are working - both
destroy and save/restore.

-- 
Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

[Bug 618678] blkback thread hangs after unsuccessful xen domU start

bugzilla_noreply＠novell.com