[opensuse-kernel] Xen kernel bug: new GENERIC_TIME based code breaks live migration
Greetings, So I've been trying to get together a new xen kernel for Gentoo based on the Suse 2.6.22 xen patches in 2.6.22.17-0.1. So far I've hit two bugs. The easier of the two is below, I'll post the second bug in a second email since they are not related. Note that this is on i386, I haven't even tried x86_64 yet. Currently domU live migration between two hosts is broken. The guest's system clock (implemented in arch/i386/kernel/time-xen.c) is pulled directly from xen so time will typically change significantly when moving between two xen instances. In older versions of the xen patches this works fine but these 2.6.22 patches have replaced the old do_gettimeofday implementation with the GENERIC_TIME implementation which gets the time from the new xen_clocksource_read function. The problem is that xen_clocksource_read makes sure that time never goes backwards, either returning the new time if it went forwards or the previous read if time appears to have gone backwards. So when the system time jumps backwards during a move to a new physical machine suddenly time stops. I've worked around the issue for now by replacing time-xen.c with an older version from redhat's 2.6.21 xen kernel and disabling GENERIC_TIME. The attached patch does this for i386 but it leaves x86_64 broken since I haven't gotten to that yet. For kicks I tried letting xen_clocksource_read return a time in the past but that caused the kernel to get lost in an endless loop somewhere. Perhaps the system time should be kept locally within the kernel instead of pulling directly from xen and updated during the timer tick's wall clock update. Then xen_clocksource_read would use that time plus the change in TSC since the last tick instead of xen's time. Cheers, -- Michael Marineau Oregon State University mike@marineau.org -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
On Wed, Feb 27, 2008 at 9:15 PM, Michael Marineau <mike@marineau.org> wrote:
Greetings,
So I've been trying to get together a new xen kernel for Gentoo based on the Suse 2.6.22 xen patches in 2.6.22.17-0.1. So far I've hit two bugs. The easier of the two is below, I'll post the second bug in a second email since they are not related. Note that this is on i386, I haven't even tried x86_64 yet.
Currently domU live migration between two hosts is broken. The guest's system clock (implemented in arch/i386/kernel/time-xen.c) is pulled directly from xen so time will typically change significantly when moving between two xen instances. In older versions of the xen patches this works fine but these 2.6.22 patches have replaced the old do_gettimeofday implementation with the GENERIC_TIME implementation which gets the time from the new xen_clocksource_read function. The problem is that xen_clocksource_read makes sure that time never goes backwards, either returning the new time if it went forwards or the previous read if time appears to have gone backwards. So when the system time jumps backwards during a move to a new physical machine suddenly time stops. I've worked around the issue for now by replacing time-xen.c with an older version from redhat's 2.6.21 xen kernel and disabling GENERIC_TIME. The attached patch does this for i386 but it leaves x86_64 broken since I haven't gotten to that yet.
For kicks I tried letting xen_clocksource_read return a time in the past but that caused the kernel to get lost in an endless loop somewhere. Perhaps the system time should be kept locally within the kernel instead of pulling directly from xen and updated during the timer tick's wall clock update. Then xen_clocksource_read would use that time plus the change in TSC since the last tick instead of xen's time.
And of course I was dumb and didn't attach my workaround patch, here it is. -- Michael Marineau Oregon State University mike@marineau.org
Currently domU live migration between two hosts is broken. The guest's system clock (implemented in arch/i386/kernel/time-xen.c) is pulled directly from xen so time will typically change significantly when moving between two xen instances. In older versions of the xen patches this works fine but these 2.6.22 patches have replaced the old do_gettimeofday implementation with the GENERIC_TIME implementation which gets the time from the new xen_clocksource_read function. The problem is that xen_clocksource_read makes sure that time never goes backwards, either returning the new time if it went forwards or the previous read if time appears to have gone backwards. So when the system time jumps backwards during a move to a new physical machine suddenly time stops. I've worked around the issue for now by replacing time-xen.c with an older version from redhat's 2.6.21 xen kernel and disabling GENERIC_TIME. The attached patch does this for i386 but it leaves x86_64 broken since I haven't gotten to that yet.
Hmm, indeed, that is a case where insisting on monotonicity is a problem, but ...
For kicks I tried letting xen_clocksource_read return a time in the past but that caused the kernel to get lost in an endless loop somewhere. Perhaps the system time should be kept locally within the kernel instead of pulling directly from xen and updated during the timer tick's wall clock update. Then xen_clocksource_read would use that time plus the change in TSC since the last tick instead of xen's time.
... that is exactly why monotonicity is required (there is a calculation somewhere in generic code [I debugged this a while back, but don't recall without in-depth checking] that creates huge positive timeouts when time stamps move a tiny bit backwards, but since as said this is in generic code that I understand at best half ways it didn't seem reasonable to change the behavior there). Going back to the non-generic-time handling is not really a solution here, instead I'm already feeling quite nervous about GENERIC_CLOCKEVENTS & Co being suppressed for Xen (but I don't think changing this would help the issue you raise) - I simply didn't have a chance to educate myself enough about this new time infrastructure. So I think the issue should rather be taken care of in the resume path (where, without checking the code) I assume generic code is capable of dealing with time moving backwards. Jan -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-kernel+help@opensuse.org
participants (2)
-
Jan Beulich
-
Michael Marineau