Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark, which calculates bandwidth by measuring the time required to perform simple operations on large arrays (where the bottleneck is the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar trends with different codes, all constrained by the memory bandwidth. The bottom line is
My STREAMS figures are somewhat higher: one cpu: 3497.4544 MByte/s for STREAMS Triad four cpus: 12970.4028 MByte/s for STREAMS Triad These are 2.2GHz CPus in a 4-way AMD/Celestica 'Quartet' with 333MHz 512MB DIMMS. kernel is 2.4.21-171.4.5qsnet_numa, otherwise it is a standard SLES8 disto (on a 2-way node with 1.8GHz CPUs + 333MHz DIMMS I get 3528.9499 MBytes/s on 1 cpu and 6968.9103 MBytes/s for Triad when using both CPUs) As you may know, the Opteron Memory bus speed is a function of both the CPU speed and the DIMMS speed. What CPUs did you have ? Also make sure the BIOs is interleaving pairs of DIMMS to get 128 bit access.
3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the order of 750 MB/s, for large packets. Why so low, I do not know. It could be related to the compiler used to compile the MPICH library (pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit).
not really a compiler issue here . MPI needs to copy data from the userspace on one CPU to the userspace on the other CPU. Standard security rules stop one process from directly writing into the memory space of the other. Hence MPICH creates a SystemV shared memory segment that one writes into and the other reads from. So if the memory bus (on each CPU) can be kept saturated at 2GB/s you can never get more than half this with intranode MPI. Small messages can take some advantage of the cache though - I get a best intra-node MPI bandwidth of 1423.49 MB/s - this is with a message size of 256kBytes. This falls to 893 MB/s for say MPI_Sends of 4MBytes at a time. (compare this with around 875 MBytes/s when using MPI between nodes and a Quadrics interconnect) -- Yours, Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com --------------------