[opensuse-kernel] user/sys ticks for process exceed overall user/sys ticks
Hello everyone! (I recently posted about this on linux-kernel, but got no responses. I would be very grateful for any pointers.) I am trying to determine what fraction of the whole system's user and sys time are consumed by a specific process, over a period of time. Hence, I read the respective tick counts from the first lines of /proc/stat and /proc/<pid>/stat at the start and end of that period and compute deltas. However, on an otherwise idle system, I often see the ticks consumed by a threaded process exceed those of the whole system. The difference is not just a few ticks, but can amount to over 10%, regardless of how long the observed time period is. There is no virtualization involved. I have been able to reproduce this with the current openSUSE and Ubuntu releases on different x86_64 machines. How can this be explained? Intuitively, I would expect that a tick accounted to a running task is at the same time accounted to the CPU it is running on, and so a process can never be accounted more ticks than the whole system, i.e. all CPUs. I have attached a simple Python script to reproduce this behavior. Example output:
./ticks.py delta process: user: 2495 sys: 181 delta all: user: 2283 sys: 258
Thanks, Peter
Peter Hofer wrote:
I have attached a simple Python script to reproduce this behavior. Example output:
./ticks.py delta process: user: 2495 sys: 181 delta all: user: 2283 sys: 258
FWIW, I just ran this on Linux pcx36 3.2.0-58-generic #88-Ubuntu SMP Tue Dec 3 17:40:43 UTC 2013 i686 i686 i386 GNU/Linux and saw $ ./ticks.py delta process: user: 10539 sys: 112 delta all: user: 10700 sys: 153 So two questions occur to me: what proportion of runs do you see the phenomenon, and what kernel are you running? -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
$ ./ticks.py delta process: user: 10539 sys: 112 delta all: user: 10700 sys: 153
So two questions occur to me: what proportion of runs do you see the phenomenon, and what kernel are you running?
I just ran the script ten times on my quadcore computer here, which is mostly idle (logged into KDE, with some applications open). The phenomenon occurred in every run, with the process having around 150 to 200 user ticks more than the entire system is reported to have. I'm running openSUSE 13.1 with the current kernel from the update repository: $ uname -a Linux sunflower 3.11.10-7-desktop #1 SMP PREEMPT Mon Feb 3 09:41:24 UTC 2014 (750023e) x86_64 x86_64 x86_64 GNU/Linux -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Peter Hofer wrote:
./ticks.py delta process: user: 2495 sys: 181 delta all: user: 2283 sys: 258
Thanks, Peter
What I'm wondering is why your total's are near the same as a single user's ticks. I.e. you have 4 cores, shouldn't the total be about 4x the single user/single-thread ticks? -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Linda Walsh wrote:
Peter Hofer wrote:
./ticks.py delta process: user: 2495 sys: 181 delta all: user: 2283 sys: 258
Thanks, Peter
What I'm wondering is why your total's are near the same as a single user's ticks. I.e. you have 4 cores, shouldn't the total be about 4x the single user/single-thread ticks?
Silly me -- I should have read the documentation before responding. Those are not ticks. /proc/stat returns clock-ticks "(Time units are in USER_HZ (typically hundredths of a second)". Vs. the numbers in /proc/self/stat are in "jiffies". I don't think there is an established formula for comparing jiffies & clock ticks. Has to do with variable CPU clock speeds. You need to use the system call for getting actual time values back. More interesting... you are running 8 threads. It's interesting to see how well python does parallelization. Set (in bash):
export TIMEFORMAT="%2Rsec %2Uusr %2Ssys (%P%% cpu)"
Then run with:
time python ticks.py time python ticks.py delta process: user: 9677 sys: 3221 delta all: user: 6263 sys: 2370 69.19sec 96.78usr 32.22sys (186.45% cpu)
---- Your 8 threads utilize about 1.9 cores. you probably didn't get an answer on the kernel list because your initial statement of the problem was wrong (i.e. you hadn't read the documentation)...but that's just a guess...never know why people don't answer -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
What I'm wondering is why your total's are near the same as a single user's ticks. I.e. you have 4 cores, shouldn't the total be about 4x the single user/single-thread ticks?
---- Silly me -- I should have read the documentation before responding.
Those are not ticks. /proc/stat returns clock-ticks "(Time units are in USER_HZ (typically hundredths of a second)". Vs. the numbers in /proc/self/stat are in "jiffies".
I don't think there is an established formula for comparing jiffies & clock ticks. Has to do with variable CPU clock speeds. You need to use the system call for getting actual time values back.
Different units of measurement sound like a good explanation. I did read proc(5), and assumed both /proc/stat and /proc/self/stat times are specified in clock ticks, since the manpage gives sysconf(_SC_CLK_TCK) as scaling factor for both. I looked up relevant parts of the kernel sources today and got the impression that at least the accounting happens entirely in jiffies. From what I can gather, account_user_time() in kernel/sched/cputime.c is used to update the times of the task, the task group and the entire CPU and is always called with arguments of cputime_one_jiffy and cputime_to_scaled(cputime_one_jiffy). The cputime_to_scaled() macro doesn't seem to do anything, however. Next, I looked at do_task_stat() in fs/proc/array.c, where /proc/<pid>/stat content is generated. The user and system time are obtained via task_cputime_adjusted(), or thread_group_cputime_adjusted(). From what I can gather, the latter is used for the entire process. Either way, the returned values appear to be the ones updated with jiffy counts from account_user_time(), and converted via cputime_to_clock_t() for output. show_stat() in fs/proc/stat.c generates the /proc/stat content and prints the CPU-wide values, updated indirectly from account_user_time(), and also converts them via cputime_to_clock_t(). I can easily have missed something -- and if I have, please tell me -- but it seems to me that user and system times from /proc/stat and /proc/pid/stat should have matching units. My next best guess is that there are things attributed to the utime/stime of a process which are attributed to different activities on the CPU level. However, if there is some system call or library function that gives perfectly matching per-process and CPU-wide times, I'd be incredibly thankful as well!
More interesting... you are running 8 threads. It's interesting to see how well python does parallelization. [...] Your 8 threads utilize about 1.9 cores.
Interesting! That's probably because of CPython's global interpreter lock (GIL). I didn't really have full utilization in mind for that script, since the effect seems to occur whenever there's more than one thread involved. Python's multiprocessing package should offer better utilization since it avoids the GIL by spawning processes instead of threads, which in this case would defeat the purpose of the script. -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Peter Hofer wrote:
show_stat() in fs/proc/stat.c generates the /proc/stat content and prints the CPU-wide values, updated indirectly from account_user_time(), and also converts them via cputime_to_clock_t().
I can easily have missed something -- and if I have, please tell me -- but it seems to me that user and system times from /proc/stat and /proc/pid/stat should have matching units. My next best guess is that there are things attributed to the utime/stime of a process which are attributed to different activities on the CPU level.
Double and triple sigh! The docs out of date w/r/t/ the code? Unheard of!! ;^) But the jiffies would seem to be a questionable indicator of cpu time, since they don't seem to be based on an absolute time, but an amount of cpu time. The amount of that would vary based on what speed each of the cores are going. I don't know about your specific cpu model, but things like 'turbo mode', where the clock runs at 1-4, more, added multiples of 133MHz when some or all of the other cores are idle. From data displayed by 'cpufreq-info', it shows my cpu with: maximum transition latency: 10.0 us. hardware limits: 1.60 GHz - 2.79 GHz available frequency steps: 2.79 GHz, 2.66 GHz, 2.53 GHz, 2.39 GHz, 2.26 GHz, 2.13 GHz, 2.00 GHz, 1.86 GHz, 1.73 GHz, 1.60 GHz or 12x - 21x times the base bus speed of 133MHz. My chip X5660, while rated at 2.8GHz, can go as fast as 24x base bus speed or 3.2GHz in turbo mode when only 1 core is active. Of note the .01GHz max latency time -- which skew the numbers a bit depending on how often the cpu transitions. Maybe if you disabled power control and forced your cpu into 1 speed the numbers might get closer? Erratum depressing: ------------------ The worst news to me is how accurate perf events and timer events really are. In the processor erratum on the 5600 series, it includes dropped interrupts for the timer among others. (see http://www.intel.com/content/dam/www/public/us/en/documents/specification-up... for the full 5-page erratum which is rather depresssing). Thread v. Procs & parallelism -----------------------------
interesting to see how well python does parallelization. [...] Your 8 threads utilize about 1.9 cores.
Interesting! That's probably because of CPython's global interpreter lock (GIL). I didn't really have full utilization in mind for that script, since the effect seems to occur whenever there's more than one thread involved.
Python's multiprocessing package should offer better utilization since it avoids the GIL by spawning processes instead of threads, which in this case would defeat the purpose of the script.
--- But mapping it to procs would be a better mapping on linux to reality, as linux threads are built on procs with varying amounts of memory shared. On linux, there's rarely a use case where threads will outperform procs. On windows is a different matter -- there thread spawning is cheaper vs. process creation, but I think, because linux process create is so efficient vs. windows, that even building threads on top of procs linux's thread creation time is still faster than window's thread-creation time. Wwasn't always the case, but more work was put in to the kernel to optimize thread creation as threads were notably, "some 'delta'", above the process create time when they first came out. I think they might be some small percent faster now, but I wouldn't wager one way or the other -- certainly no where near the benefit threads are on windows. Perl, BTW, went the way of linux -- building it's threads on procs, which is a likely contributing cause to their efficiency on linux. ---- How timing is calculated...(?!) -------------------------------- Oh, another "gotcha", is how timing is done. During boot, on my system, I see: [ 0.000000] hpet clockevent registered [ 0.000000] tsc: Fast TSC calibration using PIT [ 0.001000] tsc: Detected 2792.844 MHz processor [ 0.000004] Calibrating delay loop (skipped), value calculated using timer frequency.. 5585.68 BogoMIPS (lpj=2792844) It appears to calculate jiffies (I think lpj=loops/jiffy) it is using the Hz of the processor. Then sometime later I see: [ 1.307104] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0, 0 [ 1.312644] hpet0: 4 comparators, 64-bit 14.318180 MHz counter [ 1.320680] Switched to clocksource hpet So the jiffies calculated via the processors's speed, are counted in terms of the HPET clocksource, which is accurate to about .014ms. @ 5586 jiffies, that equates to a maximum accuracy of ~80 jiffies (79.98) / tick. That, alone could account for quite a skew. Given that skewing, and the erratum problems, I wouldn't expect too much in the way of accurate timing info on one of these processors... *cough*...(sigh) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Linda Walsh <suse@tlinx.org> wrote:
I can easily have missed something -- and if I have, please tell me -- but it seems to me that user and system times from /proc/stat and /proc/pid/stat should have matching units. My next best guess is that there are things attributed to the utime/stime of a process which are attributed to different activities on the CPU level.
Double and triple sigh! The docs out of date w/r/t/ the code? Unheard of!! ;^) But the jiffies would seem to be a questionable indicator of cpu time, since they don't seem to be based on an absolute time, but an amount of cpu time. The amount of that would vary based on what speed each of the cores are going.
I don't know about your specific cpu model, but things like 'turbo mode', where the clock runs at 1-4, more, added multiples of 133MHz when some or all of the other cores are idle.
I am testing on an Intel Core i7-4770, a quadcore with hyperthreading and turbo mode.
Maybe if you disabled power control and forced your cpu into 1 speed the numbers might get closer?
I tried a number of approaches today: using the cpufreq "performance" governor to disable frequency scaling, disabling turbo mode in the firmware, disabling hyperthreading, taking cores and hardware threads offline, pinning the process to specific cores and/or hardware threads using taskset, and combinations of these things. The only time the CPU and process times matched was when I was taking all but one virtual CPU (= hardware thread) offline, or when I was pinning the thread to exactly one specific virtual CPU. I didn't see any significant improvement with anything else I tried.
Erratum depressing: The worst news to me is how accurate perf events and timer events really are. In the processor erratum on the 5600 series, it includes dropped interrupts for the timer among others. [...]
That does sound bad. But while this could skew the results, I think missed timer interrupts should result in lost jiffies for both the process and CPU counts.
Python's multiprocessing package should offer better utilization since it avoids the GIL by spawning processes instead of threads, which in this case would defeat the purpose of the script.
But mapping it to procs would be a better mapping on linux to reality, as linux threads are built on procs with varying amounts of memory shared.
[...]
Perl, BTW, went the way of linux -- building it's threads on procs, which is a likely contributing cause to their efficiency on linux.
That is true, however the application(s) that I want to observe this way already exist and are making extensive use of threads. However, on the matter of processes vs threads -- it might be that the continuous summation of per-task (i.e. thread) counts to the per-process counts is inexact somehow. I'll test if there is a significant deviance between the sum of per-thread utimes and the process utime tomorrow.
Oh, another "gotcha", is how timing is done. During boot, on my system, I see: [...]
So the jiffies calculated via the processors's speed, are counted in terms of the HPET clocksource, which is accurate to about .014ms. @ 5586 jiffies, that equates to a maximum accuracy of ~80 jiffies (79.98) / tick.
That, alone could account for quite a skew.
That is significant. However, I would expect that it also skews both the per-process and CPU-wide jiffy counts.
Given that skewing, and the erratum problems, I wouldn't expect too much in the way of accurate timing info on one of these processors... *cough*...(sigh)
Actually, I'm not really after highly accurate timing or resource usage data. What I actually want is to estimate how much particular processes or threads in a guest VM are affected by time stolen by the hypervisor. My idea is to see what fractions of the entire guest's user and system time over short periods are consumed by which tasks, and then attribute the steal time recorded in the same period to them, at the same ratios. The emphasis here being strongly on the word "estimate" ;) -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Peter Hofer wrote:
Actually, I'm not really after highly accurate timing or resource usage data. What I actually want is to estimate how much particular processes or threads in a guest VM are affected by time stolen by the hypervisor. My idea is to see what fractions of the entire guest's user and system time over short periods are consumed by which tasks, and then attribute the steal time recorded in the same period to them, at the same ratios. The emphasis here being strongly on the word "estimate" ;)
I think your best bet is with the hyper-threads off. Hyper-threads aren't real threads for most code even though the kernel does have code to try to optimize their scheduling. For them to benefit, your code has to pretty much all be executing out of cache, and only some operations can be done in parallel in another hyperthread. I could see the possibility of ticks being allocated to hyperthreads that overlap x% with the other hyperthread on the same core, and ending up with more user ticks than overall ticks. Maybe the # of greater ticks w/hyperthreads is related to howmuch extra cpu cycles are made available with hyperthreading (but I think that's optimistic given the benchmark). I.e. you couldn't expect 2x ticks in 1x time from hyperthreads as they can't utilize the full cpu. Vs. w/multi core, you get N-x cpu time/real time where N=#cores ... so software that shows cpu time/realtime as a %will show, for 4 cores, up to 400% usage.. The bad habit, mostly intro'd by MS, of dropping the number of cores loses information and precision -- to the point on a 1024-cpu processor, 10 fully loaded coreds would only show 1% cpu usage on a MS machine... something indistinguishable from the noise level. I suspect your hyperviser overhead will correlate to the number of VM's contending for resources, with lowest overhead for 1:1 mappings. Good luck... -- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org
Peter Hofer wrote:
./ticks.py delta process: user: 2495 sys: 181 delta all: user: 2283 sys: 258
Sorry about the multiple replies, but I was curious if perl would do better:
time python ticks.py delta process: user: 9263 sys: 2987 delta all: user: 6034 sys: 2178 67.35sec 92.64usr 29.89sys (181.94% cpu)
time perl /tmp/pticks delta process: user: 2917 sys: 3 delta all: user: 2926 sys: 25 3.36sec 29.20usr 0.03sys (870.05% cpu)
If you want to test parallelism, you might want to try perl: For 9 threads: lang #thrds #coresuse %efficency python 9 1.82 20.2% perl 9 8.70 96.7% (2 CPU's @ 6 cores ea @ 1.6-2.8GHz using on-demand scheduler). Source attached. I tried to use as close to same semantics as the python program. Even used python indentation where practical (I did split the prints at the end... something python seems to have problems with...) #!/usr/bin/perl use 5.16.0; use threads; use P; sub open_for_read($) { open(my $handle, "<$_[0]") or die "opening $_[0]: $!"; $handle } sub ticks_all { my $f = open_for_read("/proc/stat"); return (split ' ', <$f>)[1,3] } sub ticks_process() { my $f = open_for_read("/proc/self/stat"); return (split ' ', <$f>)[13,14] } sub dowork () { use Digest::MD5; my $d = Digest::MD5->new; $d->add('nobody inspects'); $d->add(' the spammish repetition') for (0 .. 10_000_000)} my ($before_all_user, $before_all_sys) = ticks_all(); my ($before_process_user, $before_process_sys) = ticks_process(); my @threads; for my $i (0 .. 8) { my $t = threads->create(\&dowork); push @threads,$t } $_->join() foreach @threads; my ($after_all_user, $after_all_sys) = ticks_all(); my ($after_process_user, $after_process_sys) = ticks_process(); #(note: changing perl defaults for print) $, = " "; #put spaces between output fields $\ = "\n"; #add LF to end of lines by default print 'delta process: user:', $after_process_user - $before_process_user, ' sys:', $after_process_sys - $before_process_sys; print 'delta all: user:', $after_all_user - $before_all_user, ' sys: ', $after_all_sys - $before_all_sys;
participants (3)
-
Dave Howorth
-
Linda Walsh
-
Peter Hofer