Randall R Schulz wrote:
Kolja,
Performance analysis, let alone optimization, just gets harder and harder as hardware gets more and more sophisticated. Oops. I mean it gets more and more interesting...
I'm not familiar with the breakdown of execution units and how they relate to x86 instructions (let alone to high-level language constructs) nor how much redundancy there is in execution units and data pathways within a Hyper-Threaded Pentium 4 or Xeon processor. My intuition is that a priori there'd be considerable opportunity for overlap within the CPU itself and that patterns of primary storage access (especially overall L2 and L3 cache hit rates) are the dominant ones. After all, for lots of common execution patterns, access to RAM is the limiting factor.
To clarify, my knowledge of HT comes primarily from http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html and own experiments. As far as I know, in the P4 architecture there are only two parallel execution paths, though only one can be used for "complex" instructions, thus it's only "one and and a half" Processor the best of times. Now, if RAM access is the delimiting factor no SMP solution can help and the Opteron's NUMA capability is a great improvement. An optimal HT situation would be one simple process and a more complex one simultaneously, with one operating on a small data set, i.e. one that fits into L1 (which in turn is very small in a P4) with space left, and the other one maybe on a larger, but contiguous data set to allow streaming and SIMD. These would still have to be written and scheduled well to intertwine fruitfully. If, on the other hand, the scheduler sees two (virtual) processors, it would give "one", i.e. half the CPU time to your one important process while wasting the other half on not important processes, e.g. seti@home.
Randall Schulz
KK