ascotti@email.unc.edu wrote:
--On Wednesday, April 07, 2004 7:00 PM -0500 "William A. Mahaffey III"
wrote: Darrell Shively wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi William:
On Wednesday 07 April 2004 05:54, William A. Mahaffey III wrote:
[...]
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ?
Turns out no. The Hyptertransport connection between the processors *is* very fast but not as fast as each processors' 128+ bit wide memory bus. This is why the processor affinity feature of NUMA kernels is important; it tries to keep a process on the processor whose RAM contains its data.
I had been leaning toward some of the balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
It depends on your needs. A second processor can be useful even if it's memory access is via a hypertransport link. It depends on what sort of jobs you are running - if stuff fits mostly in the 2nd processors' cache then there is happiness.
Regards, - Darrell - -- sused@mucus.com "Perfect! ....what am I doing?" -- Washu -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQFAdB5Veo6c0kw6mZ0RAkurAKCiJKUfv8aSxeUjfS5hF9D4WfNf3wCgt6B1 KiTOydpVERJIfX8PLiCywCU= =nCUl -----END PGP SIGNATURE-----
Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark, which calculates bandwidth by measuring the time required to perform simple operations on large arrays (where the bottleneck is the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar trends with different codes, all constrained by the memory bandwidth. The bottom line is 1) The theoretical Bandwidth is an ephemeral goal. Compilers just are not that tuned. I can get closer if I start writing my own assembler code, but that takes time.. 2) NUMA makes a big difference. 3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the order of 750 MB/s, for large packets. Why so low, I do not know. It could be related to the compiler used to compile the MPICH library (pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit). I have read elsewhere that pgcc does not produce good code (I think was for the ATLAS libraries), but that might have to do with the way ATLAS routines are written. Actually, it would be interesting to hear from other people on the best combination of compiler/switches. On a related note, I have not noticed much difference between compiling in 32 or 64 bit.
Having said that, my BIOS (quartet motherboard, manifactured by Celestica) allows to distribute memory addresses in a roundrobin fashion among CPUs. When this option is enabled, a large array is spread uniformly among CPUs. While this is bad running NUMA, it could improve the performance with traditional SMP kernels, but I have not tried that.
Overall, I am quite happy with the machine. Based on runs of a CFD code (parallel), each Opteron CPU is equivalent to 2.5 Athlon cpus at the same frequency. It took almost a month to put all the pieces of software and firmware to work together, but it was worth the hassle.
Alberto Scotti
I was revisiting this thread in more detail & came up w/ another question based on this response. What is the difference between NUMA & traditional SMP as referred to in this response ? Thanks in advance.