Re: [suse-amd64] Opteron Board preference ....

8 Apr 2004

      ...
...
Actually most of the stuff I run would be large jobs requiring a
significant fraction of available RAM, too big to fit into cache. I
thought the actual data speed of the hyper-transport bus (6.4 GB/s) was
similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using
PC2700 RAM), although by different means (64 bit dual-channel DDR bus at
either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the
hyper-transport bus). I would also be interested in knowing how SMP is
working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark,
which calculates bandwidth by measuring the time required to perform simple
operations on large arrays (where the bottleneck is the streaming speed), I
get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU
speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective
bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a
non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar
trends with different codes, all constrained by the memory bandwidth. The
bottom line is
My STREAMS figures are somewhat higher:
one cpu:     3497.4544 MByte/s for STREAMS Triad
four cpus:  12970.4028 MByte/s for STREAMS Triad 

These are  2.2GHz CPus in a 4-way AMD/Celestica 'Quartet'  with 333MHz 512MB 
DIMMS.
kernel is 2.4.21-171.4.5qsnet_numa, otherwise it is a standard SLES8 disto

(on a 2-way node with 1.8GHz CPUs + 333MHz DIMMS  I get 3528.9499 MBytes/s on 
1 cpu and 6968.9103 MBytes/s for Triad when using both CPUs)

As you may know, the Opteron Memory bus speed is a function of both the CPU 
speed and the DIMMS speed.  What CPUs did you have ?
Also make sure the BIOs is interleaving pairs of DIMMS to get 128 bit access.
...
3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the
order of 750 MB/s, for large packets. Why so low, I do not know. It could
be related to the compiler used to compile the MPICH library (pgcc with
-fastsse -Mvect=prefetch, v 5.1, 64 bit).
not really a compiler issue here . MPI needs to copy data from the userspace 
on one CPU to the userspace on the other CPU. Standard security rules stop 
one process from directly writing into the memory space of the other. Hence 
MPICH creates a SystemV shared memory segment that one writes into and the 
other reads from.   So if the memory bus (on each CPU) can be kept saturated 
at 2GB/s you can never get more than half this with intranode MPI. 
  Small messages can take some advantage of the cache though - I get a best 
intra-node MPI bandwidth of 1423.49 MB/s  - this is with a message size of 
256kBytes. This falls to 893 MB/s for say MPI_Sends of 4MBytes at a time.
  (compare this with around 875 MBytes/s when using MPI between nodes and a 
Quadrics interconnect)

-- 
Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger@quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------