Dear all, I experience a strange problem with a C-Program. When I run it on different machines, sometimes it produces different results, and sometimes it doesn't. Basically, there seem to exist two different cases: a) I compiled the program at the end of January. The compiled binary was copied to different computers, and produces different results on different machines. b) I recompiled the program recently. With the _new_ binary of April, all machines calculate the same result! First ideas: 1) Yes, the code was identical for case a) and b). 2) Could it be that the hardware of my computers differs too much? This does not seem to be the case. Also, for the case b) it does not matter on which of the machines I compile before running it on all of them. 3) Could it be that some kernel update (or gcc update?) partially "broke" the compiler, and led to the results of case a) (which was compiled on a SuSE Linux 9.0 with kernel 2.4.21-99). A later kernel update possibly "repaired" this problem (case b) was compiled with SuSE Linux 9.0 with 2.4.21-xxx kernel, with xxx > 166). Also, a check with a SuSE 9.1, kernelversion 2.6.5-7.111 reproduces case b). Is the idea of problems with online-updates plausible at all? Does anyone have other ideas? Thank you very much, Mathias P.S. Please reply also to duneldaion@web.de in addition to posting to the list. Thanks!
On Monday 02 May 2005 4:10 pm, duneldaion@web.de wrote:
Dear all,
I experience a strange problem with a C-Program. When I run it on different machines, sometimes it produces different results, and sometimes it doesn't.
Basically, there seem to exist two different cases: a) I compiled the program at the end of January. The compiled binary was copied to different computers, and produces different results on different machines. b) I recompiled the program recently. With the _new_ binary of April, all machines calculate the same result!
First ideas: 1) Yes, the code was identical for case a) and b). 2) Could it be that the hardware of my computers differs too much? This does not seem to be the case. Also, for the case b) it does not matter on which of the machines I compile before running it on all of them. 3) Could it be that some kernel update (or gcc update?) partially "broke" the compiler, and led to the results of case a) (which was compiled on a SuSE Linux 9.0 with kernel 2.4.21-99). A later kernel update possibly "repaired" this problem (case b) was compiled with SuSE Linux 9.0 with 2.4.21-xxx kernel, with xxx > 166). Also, a check with a SuSE 9.1, kernelversion 2.6.5-7.111 reproduces case b).
There are several variables, but, a properly tested piece of code should be able to provide reproducible results. There are some variables that could be expected: First, if the code is using random numbers. Secondly, the code has a latent bug, such as an uninitialized variable somewhere. Thirdly, if the code is written to depend on behavior that may have changed in 2.6, then you could have trouble. Also, you don't mention if you are doing integer or floating point math, and what language the code war written in. But, I still would suspect that the code has a latent bug. As an example, I once received a bug report written by the compiler people on an application that had been running for over 5 years with no trouble. On close inspection, I found that this piece of code had a latent bug in it that had existed even in then old AT&T Unix systems, but never caused a problem before. -- Jerry Feldman <gaf@blu.org> Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9
Also, you don't mention if you are doing integer or floating point math, and what language the code war written in. But, I still would suspect that the code has a latent bug.
I could bet that this is the problem. There are other possibilities, but this is the most probable cause. I think the easiest way to find this kind of errors is to run your code with some emulator or anything that changes the "memory layout". First step, use valgrind (developer.kde.org/~sewardj/). Then try electric fence and similar tools. This should solve the problem. If it doesn't, well, debug the code in all machines until you get what is wrong. []s Davi de Castro Reis
Also, you don't mention if you are doing integer or floating point math, and what language the code war written in. But, I still would suspect that the code has a latent bug.
I could bet that this is the problem. There are other possibilities, but this is the most probable cause.
I think the easiest way to find this kind of errors is to run your code with some emulator or anything that changes the "memory layout". First step, use valgrind (developer.kde.org/~sewardj/). Then try electric fence and similar tools. This should solve the problem. If it doesn't, well, debug the code in all machines until you get what is wrong. One of the best tools available is IBM Rational's Purify Plus. While valgrind and and electric fence are also good tools, Purify still beats
On Tuesday 03 May 2005 9:40 am, Davi de Castro Reis wrote: them hands down. -- Jerry Feldman <gaf@blu.org> Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9
Jerry Feldman wrote:
On Tuesday 03 May 2005 9:40 am, Davi de Castro Reis wrote:
Also, you don't mention if you are doing integer or floating point math, and what language the code war written in. But, I still would suspect that the code has a latent bug.
I could bet that this is the problem. There are other possibilities, but this is the most probable cause.
I think the easiest way to find this kind of errors is to run your code with some emulator or anything that changes the "memory layout". First step, use valgrind (developer.kde.org/~sewardj/). Then try electric fence and similar tools. This should solve the problem. If it doesn't, well, debug the code in all machines until you get what is wrong.
One of the best tools available is IBM Rational's Purify Plus. While valgrind and and electric fence are also good tools, Purify still beats them hands down.
Purify is commercial, no ? (no problem with that, just checking :-) ) -- William A. Mahaffey III --------------------------------------------------------------------- Remember, ignorance is bliss, but willful ignorance is LIBERALISM !!!!
On Wednesday 04 May 2005 08:12, William A. Mahaffey III wrote:
Jerry Feldman wrote:
On Tuesday 03 May 2005 9:40 am, Davi de Castro Reis wrote:
Also, you don't mention if you are doing integer or floating point math, and what language the code war written in. But, I still would suspect that the code has a latent bug.
I could bet that this is the problem. There are other possibilities, but this is the most probable cause.
I think the easiest way to find this kind of errors is to run your code with some emulator or anything that changes the "memory layout". First step, use valgrind (developer.kde.org/~sewardj/). Then try electric fence and similar tools. This should solve the problem. If it doesn't, well, debug the code in all machines until you get what is wrong.
One of the best tools available is IBM Rational's Purify Plus. While valgrind and and electric fence are also good tools, Purify still beats them hands down.
Sometimes things are more than they look. Try computing, to the maximum accuracy: x = 10864 y = 18817 result1 = 9.0*x^4 - y^4 + 2.0*y^2 result2 = 9.0*x^4 - (y^4 - 2.0*y^2) True result is 1 Reference: http://www-anp.lip6.fr/cadna/Examples_Dir/ex1.php Regards, Colin
Purify is commercial, no ? (no problem with that, just checking :-) ) Yes, and expensive. You can download a trial version. One of our members was trying to trouble shoot a memory leak in some middleware that runs in a commercial telephone switch. The Open Source tools were not able to locate the problem. At my suggestion he downloaded Purify and was able to locate the problem in 5 minutes. In this case, the
On Tuesday 03 May 2005 6:12 pm, William A. Mahaffey III wrote: problem was not in his company's code, but in the phone company's code. I spent nearly 3 years porting PurifyPlus to the Digital Alpha platform (in a 6 person team). Our member was so please he bought the pizzas for the meeting, and his company bought a PurifyPlus license. -- Jerry Feldman <gaf@blu.org> Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9
On Monday 02 May 2005 22:10, duneldaion@web.de wrote:
I experience a strange problem with a C-Program. When I run it on different machines, sometimes it produces different results, and sometimes it doesn't.
You need to be a lot more specific for any meaningful investigation of that.
Basically, there seem to exist two different cases: a) I compiled the program at the end of January. The compiled binary was copied to different computers, and produces different results on different machines. b) I recompiled the program recently. With the _new_ binary of April, all machines calculate the same result!
There could be all kinds of strange side effects - plain bugs in your code that show up only sometimes, variables that are not initialized properly, all kinds of library function calls that might behave differently in different environments. There is no way of telling without really seeing the sources of that program. If you post them here (or preferably a link to them to avoid oversized list postings), somebody might be inclined to investigate that further - but of course since this is a free community there is no guarantee for that. (And no, I don't want the sources as personal mail ;-) ) CU -- Stefan Hundhammer <sh@suse.de> Penguin by conviction. YaST2 Development SUSE Linux Products GmbH Nuernberg, Germany
participants (6)
-
Colin Carter
-
Davi de Castro Reis
-
duneldaion@web.de
-
Jerry Feldman
-
Stefan Hundhammer
-
William A. Mahaffey III