Re: [SLE] Threaded Perl

9 Sep 2004


      Randall R Schulz wrote:
...
Kolja,
Performance analysis, let alone optimization, just gets harder and harder 
as hardware gets more and more sophisticated. Oops. I mean it gets more 
and more interesting...
I'm not familiar with the breakdown of execution units and how they relate 
to x86 instructions (let alone to high-level language constructs) nor how 
much redundancy there is in execution units and data pathways within a 
Hyper-Threaded Pentium 4 or Xeon processor. My intuition is that a priori 
there'd be considerable opportunity for overlap within the CPU itself and 
that patterns of primary storage access (especially overall L2 and L3 
cache hit rates) are the dominant ones. After all, for lots of common 
execution patterns, access to RAM is the limiting factor.
To clarify, my knowledge of HT comes primarily from
http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html
and own experiments.
As far as I know, in the P4 architecture there are only two parallel 
execution paths, though only one can be used for "complex" instructions, 
thus it's only "one and and a half" Processor the best of times.
Now, if RAM access is the delimiting factor no SMP solution can help and 
the Opteron's NUMA capability is a great improvement.
An optimal HT situation would be one simple process and a more complex 
one simultaneously, with one operating on a small data set, i.e. one 
that fits into L1 (which in turn is very small in a P4) with space left, 
and the other one maybe on a larger, but contiguous data set to allow 
streaming and SIMD. These would still have to be written and scheduled 
well to intertwine fruitfully. If, on the other hand, the scheduler sees 
two (virtual) processors, it would give "one", i.e. half the CPU time to 
your one important process while wasting the other half on not important 
processes, e.g. seti@home.
...
Randall Schulz
KK

Re: [SLE] Threaded Perl

Kolja Kauder