
Hi, I need some advise on profiling. I'm comparing the running time of two simulators, let's call them algo1 and algo2 (a slightly modified version of algo1). My problem, in summary, is that on a particular platform configuration, the runtime of algo2 is almost 3x longer than algo1, yet the profiling shows that algo2 calls the hotspot function only 22% higher than algo1. In addition, a single call to the hotspot function in both algo1 and algo2 should be the same (explained later). So the question is, why is the runtime of algo2 increase 3x? If the runtime increase by 20% to, say 50%, then I can understand, but 300% is a bit difficult to accept. The tests were performed on three platform configurations: 1. P4 3 GHz (HT enabled and 1MB cache), SuSE 9.3, glibc-2.3.4-23.4, and gcc-3.3.5-5. 2. Xeon 2.8 GHz (512KB cache), Fedora Core 3, glibc-2.3.6-0.fc3.1, and gcc-3.4.4-2.fc3. 3. Opteron 2.1 GHz (1MB cache), Fedora Core 4, glibc-2.3.5-10.3, and gcc-4.0.2-8.fc4. Each configuration has >= 1 GB memory, and each run of algo1 or algo2 consume about 256-300 MB. Therefore, overhead because of swapping is not a problem (later confirmed with 100% CPU util as reported by /usr/bin/time). Also, only config-3 is a 64-bit platform. The runtime of each program on each configuration is shown below: - Config-1: * algo1: user=1445.93 s, sys=11.33 s, real=23 m 54.02 s * algo2: user=4311.42 s, sys=17.92 s, real=1 h 11 m 41 s - Config-2: * algo1: user=1926.66 s, sys=18.85 s, real=32 m 14.45 s * algo2: user=1897.31 s, sys=17.94 s, real=31 m 46.54 s - Config-3: * algo1: user=1413.00 s, sys=9.18 s, real=23 m 41.39 s * algo2: user=1214.51 s, sys=6.35 s, real=20 m 21.07 s It can be seen that on config-1 (P4, SuSE), algo2 takes almost 3x longer than algo1. However, on other configurations, both versions takes roughly the same amount of time. Here are some snippets of gprof profiling result of algo1 and algo2, both on config-1. /*************************** Algo 1 ******************************/ FLAT PROFILE: % cumulative self self total time seconds seconds calls s/call s/call name 21.22 2323.95 2323.95 1678059784 0.00 0.00 ineq_compare 9.43 3357.38 1033.43 1538051170 0.00 0.00 biguint_equal 6.52 4071.34 713.96 140185901 0.00 0.00 getNeighbors1 ... 0.00 10954.11 0.00 1 0.00 0.00 initRand 0.00 10954.11 0.00 1 0.00 0.08 initStateDiagrams CALL GRAPH: granularity: each sample hit covers 4 byte(s) for 0.00% of 15130.79 s index % time self children called name <spontaneous> [1] 91.2 143.11 13656.78 main [1] 129.54 12024.06 61206914/61206914 processRequest [2] 94.21 404.78 165336753/165336753 getEvent [21] ... /*****************************************************************/ /*************************** Algo 2 ******************************/ FLAT PROFILE: % cumulative self self total time seconds seconds calls s/call s/call name 21.79 2494.03 2494.03 2057510279 0.00 0.00 ineq_compare 9.35 3564.45 1070.41 2020502220 0.00 0.00 biguint_equal 6.77 4339.43 774.98 148414184 0.00 0.00 getNeighbors1 ... 0.00 11444.75 0.00 1 0.00 0.00 initRand 0.00 11444.75 0.00 1 0.00 0.04 initStateDiagrams CALL GRAPH: granularity: each sample hit covers 4 byte(s) for 0.00% of 14820.76 s index % time self children called name <spontaneous> [1] 90.8 136.76 13327.52 main [1] 129.92 12040.16 44425464/44425464 processRequest [2] 70.57 283.72 123000389/123000389 getEvent [25] /*****************************************************************/ Some observations that I've made are as follows. First, the hotspot is ineq_compare. In algo2, it's called 22% times higher than in algo1. This means, for algo2 to takes almost 3x longer than algo1, there should be a significant increase in the cost of a single call to algo2's ineq_compare. Even though from gprof I couldn't know total_s/call is increased in algo2, but from the source code I'm sure that self_s/call does not change in algo1 and algo2. Function ineq_compare takes two big unsigned integers (biguint) as its parameters and compares them. Each biguint is implemented as an array of long long unsigned int. The function remains the same in both versions and the array in biguint consists of two elements. Hence, the conclusion that each call of ineq_compare costs the same in both algo1 and algo2. Second, gprof shows that main() in both algo1 and algo2 takes roughly the same amount of time. I would expect that the profile shows main() in algo2 is significantly longer than in algo1? Because of the result on the two other platforms, I'm tempted to "blame" either the hardware (P4) or software platform (older glibc and gcc) :D, but somehow I'm still declined to do that. Any ideas on how to better find out the cause of slowdown? Regards, Verdi -- Telefonieren Sie schon oder sparen Sie noch? NEU: GMX Phone_Flat http://www.gmx.net/de/go/telefonie

--- Ursprüngliche Nachricht --- Von: "Verdi March" <cincaipatron@gmx.net> An: suse-programming-e@suse.com Betreff: [suse-programming-e] Profiling problem Datum: Fri, 9 Dec 2005 04:25:44 +0100 (MET)
Even though from gprof I couldn't know total_s/call is increased in algo2, but from the source code I'm sure that self_s/call does not change in algo1 and algo2.
Should read as: Even though gprof couldn't show if total_s/call is increased in algo2 (shown as 0.0 s), but from the source code I'm sure that total_s/call does not change in algo1 and algo2. Regards, Verdi -- GMX DSL-Flatrate 1 Jahr kostenlos* + WLAN-Router ab 0,- Euro* Bis 31.12.2005 einsteigen! Infos unter: http://www.gmx.net/de/go/dsl

Recompiling seems to fix this problem. Compiled on gcc-3.3.5 (suse 9.3), on both P-IV machine (suse 9.3) and Xeon (fc4) the program will take more than one hour to complete. The same program, compiled on gcc-3.4.4 (fc4), took about 30min on both P-IV and Xeon. I'll try to recompile on suse 10.0 and run the program on suse 9.3 to make sure that gcc is really the cause of the performance degradation. Regards, Verdi -- 10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse f�r Mail, Message, More +++
participants (1)
-
Verdi March