[opensuse-programming] Why you sometimes need assembler

Just a follow-up on my earlier question about which assembler to choose. I've opted for nasm. I still don't quite like the simplicity-idea, but I think I can work around the most annoying "features" with a set of my own macros. However, I've just completed some initial testing, and here is why you need assembler: I'm processing a dataset of 262176 bytes. Not a lot, but still about 2million bits. The computation is 99.9% CPU-bound. In my first version, I used a div instruction in the code, and the computation took 151minutes (wall-time). In the second, I removed the div, and probably removed one or two other instructions. New computation took 62minutes. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

Hello, On Tue, 12 Jan 2010, Per Jessen wrote:
Just a follow-up on my earlier question about which assembler to choose. I've opted for nasm. I still don't quite like the simplicity-idea, but I think I can work around the most annoying "features" with a set of my own macros.
However, I've just completed some initial testing, and here is why you need assembler:
I'm processing a dataset of 262176 bytes. Not a lot, but still about 2million bits. The computation is 99.9% CPU-bound.
In my first version, I used a div instruction in the code, and the computation took 151minutes (wall-time). In the second, I removed the div, and probably removed one or two other instructions. New computation took 62minutes.
... and replaced it by what? And have you compared to C-Code compiled with gcc or icc? ;) -dnh -- "Lege die eine Hand in die Gefriertruhe und die andere auf eine heiße Herdplatte. Im Durchschnitt ist das dann ein angenehmes Gefühl." [CB's Lehrer erklaert den "Durchschnitt"] -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

David Haller wrote:
Hello,
On Tue, 12 Jan 2010, Per Jessen wrote:
Just a follow-up on my earlier question about which assembler to choose. I've opted for nasm. I still don't quite like the simplicity-idea, but I think I can work around the most annoying "features" with a set of my own macros.
However, I've just completed some initial testing, and here is why you need assembler:
I'm processing a dataset of 262176 bytes. Not a lot, but still about 2million bits. The computation is 99.9% CPU-bound.
In my first version, I used a div instruction in the code, and the computation took 151minutes (wall-time). In the second, I removed the div, and probably removed one or two other instructions. New computation took 62minutes.
... and replaced it by what? And have you compared to C-Code compiled with gcc or icc? ;)
The div was a divide by 10 followed by examination of remainder and quotient. Under the circumstances, the number being divided would always be in the range 0-19, which meant I could easily substitute with a subtract 10 and a check of the result. The div was effectively replaced by 3 other instructions, and I think I managed to do away with one more instruction along the way. (I really only looked at the timing). As for gcc code, I did start out with C-code for a quick prototype. I also had a quick look at what was generated - it would have been horrendously slow in comparison. After all, neither gcc nor icc could have been aware of the special circumstances that I knew about. Hand-optimized assembler will _always_ beat the compiler, but the trade-off is in speed vs maintainability. The latter is (in this project) not a concern for me. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On Tue, 2010-01-12 at 23:30 +0100, Per Jessen wrote:
Hand-optimized assembler will _always_ beat the compiler
That is a myth which even very good assembly programmers can only live up to in small inner-loop type segments. A good optimising compiler can do global optimisations that a human would find extremely difficult to do. In a sense of course there is nothing that a compiler can do that a human programmer can't, but modern global scale optimisations make it extraordinarily difficult, if not practically impossible, for even the best humans to achieve. Then again, most optimisations are exceeded easily by small algorithmic changes. Hand optimising code can't change the complexity of an algorithm Anders -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

Anders Johansson wrote:
On Tue, 2010-01-12 at 23:30 +0100, Per Jessen wrote:
Hand-optimized assembler will _always_ beat the compiler
That is a myth which even very good assembly programmers can only live up to in small inner-loop type segments. A good optimising compiler can do global optimisations that a human would find extremely difficult to do.
No myth at all. Perhaps you haven't met many very good assembler programmers? It's not really a topic I have a desire to debate (I know from experience that I am right). The main reasons for _not_ using assembler are productivity and maintainability, but when speed is paramount, hand-optimized assembler will _always_ beat the compiler. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On 13/01/10 07:37, Per Jessen wrote:
[...] No myth at all. Perhaps you haven't met many very good assembler programmers? It's not really a topic I have a desire to debate (I know from experience that I am right).
I am (at least with one leg) in the HPC software development business and I say your point of view is somewhat biased. Nowadays there's hardly any reason for writing assembler code directly. There are, of course, a few occasions where you are right and hand-optimized assembler code will outperform code generated by any compiler. But that's not a statement you can apply globally. Of course if you write bad C/Fortran/C++ or whatever code and you hope for the compiler to come up with some magic to create an ultra-fast assembler out of it, then you're wrong. I would like to see the C code you used as baseline reference and information about the compiler and access to your hand-optimized code to actually have a look at it myself. Without showing evidence, you just make empty statements. Thomas -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

Thomas Hertweck wrote:
On 13/01/10 07:37, Per Jessen wrote:
[...] No myth at all. Perhaps you haven't met many very good assembler programmers? It's not really a topic I have a desire to debate (I know from experience that I am right).
I am (at least with one leg) in the HPC software development business and I say your point of view is somewhat biased.
Yeah, that I can't deny.
Nowadays there's hardly any reason for writing assembler code directly. There are, of course, a few occasions where you are right and hand-optimized assembler code will outperform code generated by any compiler. But that's not a statement you can apply globally.
I think it is - what you may have is a situation where the difference is not measurable or not significant, or not worth it when weighed up against the other factors (maintainability, productivity).
Of course if you write bad C/Fortran/C++ or whatever code and you hope for the compiler to come up with some magic to create an ultra-fast assembler out of it, then you're wrong.
Certainly, that would not be a fair comparison.
I would like to see the C code you used as baseline reference and information about the compiler and access to your hand-optimized code to actually have a look at it myself. Without showing evidence, you just make empty statements.
Well, it's up to people who believe otherwise to prove me wrong :-) All I did were some plain wall-time measurements, I didn't use the C-code performance as a baseline, I just took a quick look at the generated code, and knew I could do better. AFAIC, the main argument is that if the compiler uses 100 instructions to implement an algorithm, and I use 90, that's 10 instructions saved (all instructions being equal). I've been writing assembler code for more than 20 years, and I'm quite certain I can do better. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On Wednesday January 13 2010, Per Jessen wrote:
...
AFAIC, the main argument is that if the compiler uses 100 instructions to implement an algorithm, and I use 90, that's 10 instructions saved (all instructions being equal). I've been writing assembler code for more than 20 years, and I'm quite certain I can do better.
Another common and important argument is that developer time is in the vast majority of cases more valuable than processor time. The relative difficulty of debugging lower-level languages along with the observation that bug counts correlate to lines of code written, not the amount of functionality implemented combine to skew the economic decision away from using assembly in all but the most narrow and specialized of situations.
/Per Jessen, Zürich
Randall Schulz -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

Randall R Schulz wrote:
On Wednesday January 13 2010, Per Jessen wrote:
...
AFAIC, the main argument is that if the compiler uses 100 instructions to implement an algorithm, and I use 90, that's 10 instructions saved (all instructions being equal). I've been writing assembler code for more than 20 years, and I'm quite certain I can do better.
Another common and important argument is that developer time is in the vast majority of cases more valuable than processor time. The relative difficulty of debugging lower-level languages along with the observation that bug counts correlate to lines of code written, not the amount of functionality implemented combine to skew the economic decision away from using assembly in all but the most narrow and specialized of situations.
Completely agree Randall, that's what I meant by "productivity". "sometimes" in the subject line covers just those narrow and specialized situations where speed is paramount, and all other considerations secondary. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On 13/01/10 18:53, Per Jessen wrote:
[...] I've been writing assembler code for more than 20 years, and I'm quite certain I can do better.
Sure, you are free to waste your time! ;-) Thomas -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On 01/13/2010 02:37 AM, Per Jessen wrote:
Anders Johansson wrote:
On Tue, 2010-01-12 at 23:30 +0100, Per Jessen wrote:
Hand-optimized assembler will _always_ beat the compiler
That is a myth which even very good assembly programmers can only live up to in small inner-loop type segments. A good optimising compiler can do global optimisations that a human would find extremely difficult to do.
No myth at all. Perhaps you haven't met many very good assembler programmers? It's not really a topic I have a desire to debate (I know from experience that I am right).
The main reasons for _not_ using assembler are productivity and maintainability, but when speed is paramount, hand-optimized assembler will _always_ beat the compiler.
/Per Jessen, Zürich
Per, try to handwrite assembler code in IA64 :-). While I would disagree with "_always_ " there are times when the optimizers don't do a good job either. The C div() function is a pain, and certainly, when you have numbers where you know certain properties about those numbers you can do significantly better. div() is sometimes included as a builtin. Just to add to Anders' comments from my Alpha chip background. We took the output instruction stream from the compiler, broke it up into "basic blocks" then reordered the instructions based on latency tables and other criteria to reduce stalls and improve parallelism. We did not for both C code as well as for the assembler. This is one of the things that the Intel compiler does. But, my bottom line here would be the same as Per Jessen's. In his specific case, I would stick with assembly code. -- Jerry Feldman <gaf@blu.org> Boston Linux and Unix PGP key id: 537C5846 PGP Key fingerprint: 3D1B 8377 A3C0 A5F2 ECBB CA3B 4607 4319 537C 5846

Hello, On Tue, 12 Jan 2010, Per Jessen wrote:
David Haller wrote:
On Tue, 12 Jan 2010, Per Jessen wrote: [..]
I'm processing a dataset of 262176 bytes. Not a lot, but still about 2million bits. The computation is 99.9% CPU-bound.
In my first version, I used a div instruction in the code, and the computation took 151minutes (wall-time). In the second, I removed the div, and probably removed one or two other instructions. New computation took 62minutes.
... and replaced it by what? And have you compared to C-Code compiled with gcc or icc? ;)
The div was a divide by 10 followed by examination of remainder and quotient. Under the circumstances, the number being divided would always be in the range 0-19, which meant I could easily substitute with a subtract 10 and a check of the result. The div was effectively replaced by 3 other instructions, and I think I managed to do away with one more instruction along the way. (I really only looked at the timing).
As for gcc code, I did start out with C-code for a quick prototype. I also had a quick look at what was generated - it would have been horrendously slow in comparison. After all, neither gcc nor icc could have been aware of the special circumstances that I knew about. Hand-optimized assembler will _always_ beat the compiler, but the trade-off is in speed vs maintainability. The latter is (in this project) not a concern for me.
Definitely not. Generated code can be surprisingly efficient today (with proper options). But, in special cases, of course, *nothing* beats hand-crafted assembler[1] ;) But one should check that it actually does. I'm glad you found a good solution. Have Fun! -dnh, more of a perlmonger, and I do run benchmarks there, with at times surprising results :) [1] just have a look at my /bin/true ;) 45 Bytes hand- + nasm-crafted bytes of ELF Binary. Not by me, mind you. http://www.muppetlabs.com/%7Ebreadbox/software/tiny/true.asm.txt -- Lies halt mal dclp.*, da faellt dir nix mehr ein. Wenn man ein Guerteltier ueber die Tastatur abrollt, kommt besserer PHP Code raus als da gepostet wird. -- R. Huebenthal in darw -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On Tue, Jan 12, 2010 at 11:30:02PM +0100, Per Jessen wrote:
David Haller wrote:
... and replaced it by what? And have you compared to C-Code compiled with gcc or icc? ;)
The div was a divide by 10 followed by examination of remainder and quotient. Under the circumstances, the number being divided would always be in the range 0-19, which meant I could easily substitute with a subtract 10 and a check of the result. The div was effectively replaced by 3 other instructions, and I think I managed to do away with one more instruction along the way. (I really only looked at the timing).
As for gcc code, I did start out with C-code for a quick prototype. I also had a quick look at what was generated - it would have been horrendously slow in comparison. After all, neither gcc nor icc could have been aware of the special circumstances that I knew about.
But an operation on the range 0-19 could have been implemented with a lookup table or switch statement in C. This is a standard algorithmic optimisation possible in any language. ciao Arvin -- Arvin Schnell, <aschnell@suse.de> Senior Software Engineer, Research & Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org

On 01/12/2010 09:05 PM, Per Jessen wrote:
Just a follow-up on my earlier question about which assembler to choose. I've opted for nasm. I still don't quite like the simplicity-idea, but I think I can work around the most annoying "features" with a set of my own macros.
However, I've just completed some initial testing, and here is why you need assembler:
I'm processing a dataset of 262176 bytes. Not a lot, but still about 2million bits. The computation is 99.9% CPU-bound.
In my first version, I used a div instruction in the code, and the computation took 151minutes (wall-time). In the second, I removed the div, and probably removed one or two other instructions. New computation took 62minutes.
/Per Jessen, Zürich
I used to use masm 32 under windows 98, it had a series of macros for conditional loops and one called invoke which used a c like prototype for calling api functions and prevented you from having to pop and push values continually and also had a gui generator. I used it to make programs to interface with hardware I designed and built, it's the only thing I realy missed when changing to linux. It still exists to this day. Regards Dave P -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org
participants (8)
-
Anders Johansson
-
Arvin Schnell
-
Dave Plater
-
David Haller
-
Jerry Feldman
-
Per Jessen
-
Randall R Schulz
-
Thomas Hertweck