It sounds more like why when benchmarks are run, they are run 3 or more times with the first run being thrown out. This is due to the initial loading of data (reading files off of disk into cache) that must be done. The next run will not have as much disk I/O and if the test is small enough, could actually be reading straight from memory, which is significantly faster than disk. After a reboot, or being powered off, the cache is clear, so all data needs to be read into memory again. -Alain. -----Original Message----- From: fabrice piccini [mailto:fpiccini@nerim.net] Sent: Thursday, June 30, 2005 12:15 PM To: suse-amd64@suse.com Subject: Re: [suse-amd64] Puzzling wild variation of execution time Sheo Shanker Prasad wrote:
I will greatly appreciate help in understanding the puzzling wild variation in the execution time of small test/benchmark program, as explained below.
(1) The nature of the benchmark/test program:
A floating point arithmetic intensive atmospheric chemistry modeling task. It used to take, in a consistent manner, around 2m 37s on my machine when it was running under SuSe 9.1 Pro. When I run the benchmark program, I shut-off all other programs. Thus, during its run, it is the only active program. THe rest are belonging to root are sleeping. There is no other user.
(2) The machine:
Dual Opteron 250 with 4 GB of PC3200 (DDR) memory in 4 dimms that fill two dimms on cpu0 and two dimms on cpu1 installed in Tyan Thunder K8 system board that has American MegaTrend AMIBIOS V2-04. It now runs under SuSE 9.3 Pro and the Linux version 2.6.11.4-21.7-smp (geeko@buildhost) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu Jun 2 14:23:14 UTC 2005
(3) Background of the problem:
After only 4 months of the delivery of the machine by the vendor, the mother board failed. The machine was sent to the vendor. He then replaced the mother board with a new one and installed SuSE 9.3 Pro to bring the machine up-to-date with the Linux version 2.6.11.4-21.7-smp (geeko@buildhost) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu Jun 2 14:23:14 UTC 2005
(4) The problem:
After the repair and upgrade to 9.3 Pro, the execution times of the same test/benchmark program varies widely between run to run. It is never consistent. It varies from 2m 37s to more than 6m. A pattern is seen in the variation. Usually, the longest execution time is when the machine is started after remaining shut-off for a few hours. The shortest execution time is obtained after the machine has been constantly on for abot 7 to 10 hours.
(5) My puzzle:
While a computer is well known to be reproducible, the wild variation by a factor of almost 3 in the execution is very disturbing.
What could be the cause? Does it mean that some vital component involved in execution of the numerical modeling (RAM memory or the memory on the CPUs) are loose or erratic and may be about to go bad? I am just puzzled. Any clue from you all will be really very very much appreciated.
well, I'm not an expert on this kind of problem, but I'm still interested in problem solving... all you said brings me to think to some kind of memory leakage (stabilization of exec time after a long period of uptime)...supposing your algorithm is consumming a lot of memory... if this is true (memory leakage), the next step is to determine if the guilty is the kernel (9.1 was 2.6.5, 9.3 is 2.6.11 -please, kernel experts, are there a lot of changes in memory management between these two versions, and especially involving smp systems ?), or maybe the compiler (did you compile yourself you algorithm ? did you recompile it after the installation of suse93 ? written in C, C++, Fortran, other ?) maybe an other possibility is hardware/bios related problem (or incorrect bios settings), but I don't have any idea on how to diagnose a faulty cpu...please, tyan hardware experts, it's up to you... Ah, I've just reread your post. You said the longest time is after the machine is started after a long shut-off... What occurs if, after a long period of uptime you reboot the machine and immediately run your algorithm ? - If you get the same good results (2'37") then you certainly have a hardware problem (thermal stabilization problem on some component ?) - If you get bad results (longest time) then it's certainly a software related problem (bios/kernel/compiler) sorry not to be able to help you more...that was only to start the discussion...now it's time for true experts. cheers, fabrice -- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com