[Bug 473302] New: X.org busy-loops (100% CPU in a select() loop) while desktop is locked
https://bugzilla.novell.com/show_bug.cgi?id=473302 Summary: X.org busy-loops (100% CPU in a select() loop) while desktop is locked Classification: openSUSE Product: openSUSE 11.1 Version: Final Platform: x86 OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: X.Org AssignedTo: bnc-team-xorg-dev@forge.provo.novell.com ReportedBy: matz@novell.com QAContact: xorg-maintainer-bugs@forge.provo.novell.com Found By: Development This bug happens for me since I updated to 11.1 final from 11.1 beta4, and it's extremely annoying for my office mates because my CPU fan is very loud. The error mode is a bit strange. It happens after some time after the desktop is locked (in my case via kdesktop_lock). That first spawns a screen-saver and after some time kills it (to not uselessly use CPU) and switches off the monitor. After the monitor is switched off it takes some more time for the bug to show up. The symptom of it is that X takes 100% CPU without anyone doing anything on the system. The reason is a busy-loop constructed with zero timeout selects, as can be seen in an strace taken from a remote system (i.e. without unlocking the desktop). The relevant excerpt from that strace looks like so: 13:48:02.254152 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout) 13:48:02.254289 clock_gettime(CLOCK_MONOTONIC, {174665, 645523431}) = 0 13:48:02.254383 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 13:48:02.254483 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 13:48:02.254581 clock_gettime(CLOCK_MONOTONIC, {174665, 645815070}) = 0 13:48:02.254675 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout) 13:48:02.254812 clock_gettime(CLOCK_MONOTONIC, {174665, 646046604}) = 0 13:48:02.254906 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 13:48:02.255006 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 13:48:02.255105 clock_gettime(CLOCK_MONOTONIC, {174665, 646338413}) = 0 13:48:02.255198 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout) Note how select() is called with a zero timeout (i.e. return immediately no matter if some FD is active or not). This happens ad nauseam, and causes the 100% CPU load. It goes away as soon as the desktop is unlocked (then Xorg starts using reasonable timeout ala 175 seconds again). It also goes away if someone moves the mouse (which causes the monitor to be switched on, the graphical screensaver to be started again, hence we then _have_ an active X11 client again, which somehow breaks the busy loop). Unfortunately it's slightly annoying to reproduce as it needs some time after the monitor is switched off to materialize. I.e. it doesn't start immediately after the monitor is switched off, but only after some time (Richard tells me that he had to move the mouse like every 20 minutes which would translate to something like 10 minutes after the monitor is switched off again). I'm fairly sure this didn't happen in 11.1 beta4. xorg-x11-server-7.4-17.3 x11-video-nvidiaG02-180.22-4.1 nvidia-gfxG02-kmp-pae-180.22_2.6.27.7_9.1-4.1 (yes, nvidia) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c5 Michael Matz <matz@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lnussel@novell.com --- Comment #5 from Michael Matz <matz@novell.com> 2009-02-06 08:21:07 MST --- More data points: Ludwig seems to have the same problem (also KDE3, but intel graphics on his laptop). Not reproducible with just 'xset dpms ...' it seems, though on this particular bug it can also mean that we just didn't wait long enough for it to show up. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c6 --- Comment #6 from Michael Matz <matz@novell.com> 2009-02-06 09:53:27 MST --- I can reproduce it now. From a remote machine I simply do xset dpms force standby KDE3 will notice this, and starts kdesktop_lock and the kss module (the program that paints something nice), immediately SIGSTOPs that kss module (because due to DPMS it wouldn't be visible anyway), and then the countdown to the bug starts. stracing X then reveals that the select() loop starts with a 870 something seconds timeout, which is reduced with each call to select, by the amount it took for the select to return plus 20 milli-seconds. So, for the first ~870 seconds after the screensaver becomes active all is well, as there's still a time-out so select only returns seldomly. After these ~870 seconds the timeout value is 0, and _doesn't_ get reset to something larger again. From then on X will take 100% CPU. I can reproduce this reliably, but it takes some time to show up (exactly the ~870 seconds). I've tried looking at the patches that went in after beta4. The only thing which barely is related to screen-saving seems to be dpms_screensaver.diff, from 2008-11-28. I think this was shortly after beta4 was released. The patch itself looks innocent (just introduces a call to dixSaveScreens()), but perhaps it triggers a bug somewhere else (in the timeout calculations for select). [probably irrelevant: grepping for "select" in all diffs of xorg-x11-server reveals some occurences in xorg-server-xf4vnc.patch and xorg-server-xf4vnc-busyloop.diff, where at least some in xf4vnc.patch also explicitely use a {0,0} timeout. That at least looks dubious, as that usually is not what one wants. But as this is vnc, it's most probably unrelated.] -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c7 --- Comment #7 from Michael Matz <matz@novell.com> 2009-02-06 10:26:56 MST --- Man! WaitFor.c is terrible spaghetti code. I would see several reasons why the select timeout could be set to zero in WaitForSomething. In the SMART_SCHEDULE code, if there are any ClientsWithInput, or in general if there are timers with an expire time in the past. It might be that when the screen is off (DPMS on) that either the timers don't go away (even when they are expired) or that ClientsWithInput never goes empty. Presumably that would make sense when the DPMS code wants to ensure that when the display is off that as few CPU is used as possible, and due to a bug there's possibly too much pruning or early-out so that events which would normally remove expired timers or clients with input don't take action. The dpms_screensaver.diff introduces a call to dixSaveScreens() which wasn't there before (and without error handling) which does fiddle with timers a bit, so maybe that's a lead. Or possibly I'm completely off :-) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 Stefan Dirsch <sndirsch@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|sndirsch@novell.com |lverhaegen@novell.com AssignedTo|bnc-team-xorg-bugs@forge.pr |eich@novell.com |ovo.novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User lnussel@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c8 --- Comment #8 from Ludwig Nussel <lnussel@novell.com> 2009-02-09 02:32:09 MST --- I could reproduce using "xset dpms force standby" and patient waiting now. The behavior seen in strace is different for me though. select is called with one second timeout {1, 0} and then all of a sudden the timeout is lower {0, 726000} once and in all following calls zero. Just before the broken select call X does this: writev(40, [{"B\1\232\0\1\0\200\3\0\0\0\0\221\274\r\0\0\0\0\0\217\274\r\0\340\212 \0\1\307\240\267", 32}], 1) = 32 fd 40 is one of the many fds on /tmp/.X11-unix/X0 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c9 --- Comment #9 from Michael Matz <matz@novell.com> 2009-02-09 03:41:52 MST --- So that means your initial select timeout is 1 second (where it is about 870 for me)? That's fully possible, but still would hint at the same problem. X has a list of timers (with an expire time). The select timeout is calculated in such a way that it times out at or before the next timer has to fire. The list of timers depend on the applications running on the display, so slight changes there are expected. But the general theme is still the same. My assumption is that this list of timers is not cleaned up in the bug case, so that even already expired timers stay in the list. Due to compensation code for negative offsets this would lead exactly to the picture of zero-timeout selects. (the other possibility for that to happen is with the SMART_SCHEDULE code as mentioned above) Btw. Ludwig: You can find out which application is connected to the socket corresponding to FD 40 in Xorg in the following way: # lsof -p $(pidof X) | grep '40u.*unix' X 6191 root 40u unix 0xf7c94280 0t0 20618 /tmp/.X11-unix/X0 # lsof | grep 20617 kopete 6598 matz 4u unix 0xf7c94080 0t0 20617 socket What you need to look out for is a socket filedescriptor with an inode of one less than the inode number of the socket on the X11 side (above 20618 on X side, 20617 on kopete side). This obviously depends on how the kernel invents these inode numbers, but for my kernel (32bit kernel-pae-2.6.27.7-9.1) it uses the above rules. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User lnussel@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c10 --- Comment #10 from Ludwig Nussel <lnussel@novell.com> 2009-02-09 05:32:55 MST --- Thanks for the hint. The receiving process was kded4. That suspicious write didn't show up again though. Yet if I kill kde4 stuff (kded4, kdeinit4) the busy loop stops. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c11 --- Comment #11 from Michael Matz <matz@novell.com> 2009-02-09 07:43:37 MST --- And you can't reproduce it anymore when kde{d,init}4 isn't running anymore? Then this one triggers the bug in Xorg, maybe a nice hint. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User captain.magnus@opensuse.org added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c12 Magnus Boman <captain.magnus@opensuse.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |captain.magnus@opensuse.org --- Comment #12 from Magnus Boman <captain.magnus@opensuse.org> 2009-02-09 14:50:12 MST --- This bug sounds related (if not identical) to https://bugzilla.novell.com/show_bug.cgi?id=472046 (I have the exact same trace as described in the description). The patch attached in bug#472046 fixes it for me. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 Stefan Dirsch <sndirsch@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P3 - Medium CC| |sndirsch@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=473302 User sndirsch@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=473302#c13 Stefan Dirsch <sndirsch@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE --- Comment #13 from Stefan Dirsch <sndirsch@novell.com> 2009-02-28 10:07:49 MST --- dup. *** This bug has been marked as a duplicate of bug 472046 *** https://bugzilla.novell.com/show_bug.cgi?id=472046 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com