Re: [suse-amd64] Opteron Board preference ....

21 Apr 2004


      ascotti@email.unc.edu wrote:
...
--On Wednesday, April 07, 2004 7:00 PM -0500 "William A. Mahaffey III" 
 wrote:
...
Darrell Shively wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi William:
On Wednesday 07 April 2004 05:54, William A. Mahaffey III wrote:
...
[...]
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 &
800 series) through high speed busses & could shuttle data between
eachother as fast as direct memory access (except for some small 
latency
to start the proceedings), no ?
Turns out no.  The Hyptertransport connection between the processors
*is*  very fast but not as fast as each processors' 128+ bit wide 
memory
bus.  This  is why the processor affinity feature of NUMA kernels is
important; it tries  to keep a process on the processor whose RAM
contains its data.
...
I had been leaning toward some of the
balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
It depends on your needs.  A second processor can be useful even if 
it's
memory access is via a hypertransport link.  It depends on what sort of
jobs  you are running - if stuff fits mostly in the 2nd processors'
cache then  there is happiness.
Regards,
    - Darrell
- --
sused@mucus.com      "Perfect! ....what am I doing?"
                     -- Washu
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQFAdB5Veo6c0kw6mZ0RAkurAKCiJKUfv8aSxeUjfS5hF9D4WfNf3wCgt6B1
KiTOydpVERJIfX8PLiCywCU=
=nCUl
-----END PGP SIGNATURE-----
Actually most of the stuff I run would be large jobs requiring a
significant fraction of available RAM, too big to fit into cache. I
thought the actual data speed of the hyper-transport bus (6.4 GB/s) was
similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using
PC2700 RAM), although by different means (64 bit dual-channel DDR bus at
either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the
hyper-transport bus). I would also be interested in knowing how SMP is
working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM 
benchmark, which calculates bandwidth by measuring the time required 
to perform simple operations on large arrays (where the bottleneck is 
the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 
->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 
2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 
GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it 
rarely exceeds 3 GB/s. I have observed similar trends with different 
codes, all constrained by the memory bandwidth. The bottom line is
1) The theoretical Bandwidth is an ephemeral goal. Compilers just are 
not that tuned. I can get closer if I start writing my own assembler 
code, but that takes time..
2) NUMA makes a big difference.
3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the 
order of 750 MB/s, for large packets. Why so low, I do not know. It 
could be related to the compiler used to compile the MPICH library 
(pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit). I have read 
elsewhere that pgcc does not produce good code (I think was for the 
ATLAS libraries), but that might have to do with the way ATLAS 
routines are written. Actually, it would be interesting to hear from 
other people on the best combination of compiler/switches. On a 
related note, I have not noticed much difference between compiling in 
32 or 64 bit.
Having said that, my BIOS (quartet motherboard, manifactured by 
Celestica) allows to distribute memory addresses in a roundrobin  
fashion among CPUs. When this option is enabled, a large array is 
spread uniformly among CPUs. While this is bad running NUMA, it could 
improve the performance with traditional SMP kernels, but I have not 
tried that.
Overall,  I am quite happy with the machine. Based on runs of a CFD 
code (parallel), each Opteron CPU is equivalent to 2.5 Athlon cpus at 
the same frequency. It took almost a month to put all the pieces of 
software and firmware to work together, but it was worth the hassle.
Alberto Scotti
I was revisiting this thread in more detail & came up w/ another 
question based on this response. What is the difference between NUMA & 
traditional SMP as referred to in this response ? Thanks in advance.