Puzzling wild variation of execution time

I will greatly appreciate help in understanding the puzzling wild variation in the execution time of small test/benchmark program, as explained below. (1) The nature of the benchmark/test program: A floating point arithmetic intensive atmospheric chemistry modeling task. It used to take, in a consistent manner, around 2m 37s on my machine when it was running under SuSe 9.1 Pro. When I run the benchmark program, I shut-off all other programs. Thus, during its run, it is the only active program. THe rest are belonging to root are sleeping. There is no other user. (2) The machine: Dual Opteron 250 with 4 GB of PC3200 (DDR) memory in 4 dimms that fill two dimms on cpu0 and two dimms on cpu1 installed in Tyan Thunder K8 system board that has American MegaTrend AMIBIOS V2-04. It now runs under SuSE 9.3 Pro and the Linux version 2.6.11.4-21.7-smp (geeko@buildhost) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu Jun 2 14:23:14 UTC 2005 (3) Background of the problem: After only 4 months of the delivery of the machine by the vendor, the mother board failed. The machine was sent to the vendor. He then replaced the mother board with a new one and installed SuSE 9.3 Pro to bring the machine up-to-date with the Linux version 2.6.11.4-21.7-smp (geeko@buildhost) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu Jun 2 14:23:14 UTC 2005 (4) The problem: After the repair and upgrade to 9.3 Pro, the execution times of the same test/benchmark program varies widely between run to run. It is never consistent. It varies from 2m 37s to more than 6m. A pattern is seen in the variation. Usually, the longest execution time is when the machine is started after remaining shut-off for a few hours. The shortest execution time is obtained after the machine has been constantly on for abot 7 to 10 hours. (5) My puzzle: While a computer is well known to be reproducible, the wild variation by a factor of almost 3 in the execution is very disturbing. What could be the cause? Does it mean that some vital component involved in execution of the numerical modeling (RAM memory or the memory on the CPUs) are loose or erratic and may be about to go bad? I am just puzzled. Any clue from you all will be really very very much appreciated. -- Best regards. Sheo (Sheo S. Prasad) Creative Research Enterprises 6354 Camino del Lago Pleasanton, CA 94566, USA Voice Phone: (+1) 925 426-9341 Fax Phone: (+1) 925 426-9417 e-mail: ssp@CreativeResearch.org

Sheo Shanker Prasad wrote:
I will greatly appreciate help in understanding the puzzling wild variation in the execution time of small test/benchmark program, as explained below.
(1) The nature of the benchmark/test program:
A floating point arithmetic intensive atmospheric chemistry modeling task. It used to take, in a consistent manner, around 2m 37s on my machine when it was running under SuSe 9.1 Pro. When I run the benchmark program, I shut-off all other programs. Thus, during its run, it is the only active program. THe rest are belonging to root are sleeping. There is no other user.
(2) The machine:
Dual Opteron 250 with 4 GB of PC3200 (DDR) memory in 4 dimms that fill two dimms on cpu0 and two dimms on cpu1 installed in Tyan Thunder K8 system board that has American MegaTrend AMIBIOS V2-04. It now runs under SuSE 9.3 Pro and the Linux version 2.6.11.4-21.7-smp (geeko@buildhost) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu Jun 2 14:23:14 UTC 2005
(3) Background of the problem:
After only 4 months of the delivery of the machine by the vendor, the mother board failed. The machine was sent to the vendor. He then replaced the mother board with a new one and installed SuSE 9.3 Pro to bring the machine up-to-date with the Linux version 2.6.11.4-21.7-smp (geeko@buildhost) (gcc version 3.3.5 20050117 (prerelease) (SUSE Linux)) #1 SMP Thu Jun 2 14:23:14 UTC 2005
(4) The problem:
After the repair and upgrade to 9.3 Pro, the execution times of the same test/benchmark program varies widely between run to run. It is never consistent. It varies from 2m 37s to more than 6m. A pattern is seen in the variation. Usually, the longest execution time is when the machine is started after remaining shut-off for a few hours. The shortest execution time is obtained after the machine has been constantly on for abot 7 to 10 hours.
(5) My puzzle:
While a computer is well known to be reproducible, the wild variation by a factor of almost 3 in the execution is very disturbing.
What could be the cause? Does it mean that some vital component involved in execution of the numerical modeling (RAM memory or the memory on the CPUs) are loose or erratic and may be about to go bad? I am just puzzled. Any clue from you all will be really very very much appreciated.
well, I'm not an expert on this kind of problem, but I'm still interested in problem solving... all you said brings me to think to some kind of memory leakage (stabilization of exec time after a long period of uptime)...supposing your algorithm is consumming a lot of memory... if this is true (memory leakage), the next step is to determine if the guilty is the kernel (9.1 was 2.6.5, 9.3 is 2.6.11 -please, kernel experts, are there a lot of changes in memory management between these two versions, and especially involving smp systems ?), or maybe the compiler (did you compile yourself you algorithm ? did you recompile it after the installation of suse93 ? written in C, C++, Fortran, other ?) maybe an other possibility is hardware/bios related problem (or incorrect bios settings), but I don't have any idea on how to diagnose a faulty cpu...please, tyan hardware experts, it's up to you... Ah, I've just reread your post. You said the longest time is after the machine is started after a long shut-off... What occurs if, after a long period of uptime you reboot the machine and immediately run your algorithm ? - If you get the same good results (2'37") then you certainly have a hardware problem (thermal stabilization problem on some component ?) - If you get bad results (longest time) then it's certainly a software related problem (bios/kernel/compiler) sorry not to be able to help you more...that was only to start the discussion...now it's time for true experts. cheers, fabrice

From your description I am not sure how did you ensure that there is no interference from other processes. One thing you could try is switching to single user mode with as many things as possible turned off: # /sbin/telinit 1 You said that all the other processes are off but after you restart the machine you might have some jobs running like indexing the file system, etc.
-m

On Thu, 30 Jun 2005 11:02:34 -0700 Sheo Shanker Prasad <ssp@creativeresearch.org> wrote:
(5) My puzzle:
While a computer is well known to be reproducible, the wild variation by a factor of almost 3 in the execution is very disturbing.
It could be cache effects. Linux does not do enforced L2 cache coloring while allocating user memory so in memory intensive programs some variation can happen depending on how well the cache is used. However an Opteron has a 16 way associative L2 cache so such things normally don't happen there. One difference between 9.1 and 9.3 is that 9.3 allocates the memory from the start to the end after boot up, while 9.1 tended to allocate it from up to down (highest memory first). It is possible that your BIOS has a broken MTRR or somesuch and that some low page is not cached properly and slowing things down. If it was a high page it could be avoided by booting with mem=...M with a number smaller than your memory. However for low pages that's difficult. -Andi

Andi Kleen wrote:
On Thu, 30 Jun 2005 11:02:34 -0700 Sheo Shanker Prasad <ssp@creativeresearch.org> wrote:
(5) My puzzle:
While a computer is well known to be reproducible, the wild variation by a factor of almost 3 in the execution is very disturbing.
It could be cache effects. Linux does not do enforced L2 cache coloring while allocating user memory so in memory intensive programs some variation can happen depending on how well the cache is used. However an Opteron has a 16 way associative L2 cache so such things normally don't happen there.
One difference between 9.1 and 9.3 is that 9.3 allocates the memory from the start to the end after boot up, while 9.1 tended to allocate it from up to down (highest memory first). It is possible that your BIOS has a broken MTRR or somesuch and that some low page is not cached properly and slowing things down. If it was a high page it could be avoided by booting with mem=...M with a number smaller than your memory. However for low pages that's difficult.
-Andi
Is that difference twixt allocation protocols a 2.4.x kernel vs. 2.6.n kerrnel difference (i.e. generic to ANY AMD64 linux) or something else ? -- William A. Mahaffey III --------------------------------------------------------------------- Remember, ignorance is bliss, but willful ignorance is LIBERALISM !!!!

Is that difference twixt allocation protocols a 2.4.x kernel vs. 2.6.n kerrnel difference (i.e. generic to ANY AMD64 linux) or something else ?
9.1 and 9.3 both use 2.6 kernels. It was a change at some point in 2.6 that helps I/O clustering, because it is more likely that multiple pages lie next to each other in memory and can be submitted to the disk with a single request. 2.4 didn't have it. It's not AMD64 specific. As an additional side effect it makes Linux more hardware bug-to-bug compatible with some other popular OS. Previously machines with broken memory near the end tended to work mostly with that one as long as memory was not filled up, but fail early with Linux. Now they should mostly work with Linux too. Of course when memory eventually fills up you still have a problem. -Andi
participants (5)
-
Andi Kleen
-
fabrice piccini
-
Marcin Zalewski
-
Sheo Shanker Prasad
-
William A. Mahaffey III