Mailinglist Archive: opensuse-amd64 (274 mails)
| < Previous | Next > |
Re: [suse-amd64] Opteron Board preference ....
- From: ALBERTO D SCOTTI <ascotti@xxxxxxxxxxxxx>
- Date: Thu, 8 Apr 2004 12:59:02 +0000 (UTC)
- Message-id: <40754C5F.1070902@xxxxxxxxxxxxx>
Dan,
what you tell me makes me think that I have still work to do. Especially the fact that bot 2.2 GHZ and 1.8 GHz CPUs get similar bandwidth.
Here's my configuration
Qartet MotherBoard
BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS gave about 700 MB/s on 1 CPU!
4 OPTERON 840 CPUs (1.4 GHz)
SLES8 SP3
kernel 2.4.21-207-numa
8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node)
I think that the BIOS is interleaving the dimms, but not across nodes.
STREAM compiled with pgf90 -fastsse -Mvect=prefetch
What other options in the BIOS should I watch for? How about 4-bit ECC checking/scrubbing and the like? How about compiler switches?
Thank you
Alberto
Dan Kidger wrote:
--
Alberto Scotti
Asst. Prof.
Dept. of Marine Sciences
CB 3300
University of North Carolina
Chapel Hill, NC 27599-3300
919-962-9454 (w)
919-962-1254 (f)
what you tell me makes me think that I have still work to do. Especially the fact that bot 2.2 GHZ and 1.8 GHz CPUs get similar bandwidth.
Here's my configuration
Qartet MotherBoard
BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS gave about 700 MB/s on 1 CPU!
4 OPTERON 840 CPUs (1.4 GHz)
SLES8 SP3
kernel 2.4.21-207-numa
8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node)
I think that the BIOS is interleaving the dimms, but not across nodes.
STREAM compiled with pgf90 -fastsse -Mvect=prefetch
What other options in the BIOS should I watch for? How about 4-bit ECC checking/scrubbing and the like? How about compiler switches?
Thank you
Alberto
Dan Kidger wrote:
Actually most of the stuff I run would be large jobs requiring a
significant fraction of available RAM, too big to fit into cache. I
thought the actual data speed of the hyper-transport bus (6.4 GB/s) was
similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using
PC2700 RAM), although by different means (64 bit dual-channel DDR bus at
either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the
hyper-transport bus). I would also be interested in knowing how SMP is
working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark,
which calculates bandwidth by measuring the time required to perform simple
operations on large arrays (where the bottleneck is the streaming speed), I
get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU
speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective
bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a
non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar
trends with different codes, all constrained by the memory bandwidth. The
bottom line is
My STREAMS figures are somewhat higher:
one cpu: 3497.4544 MByte/s for STREAMS Triad
four cpus: 12970.4028 MByte/s for STREAMS Triad
These are 2.2GHz CPus in a 4-way AMD/Celestica 'Quartet' with 333MHz 512MB DIMMS.
kernel is 2.4.21-171.4.5qsnet_numa, otherwise it is a standard SLES8 disto
(on a 2-way node with 1.8GHz CPUs + 333MHz DIMMS I get 3528.9499 MBytes/s on 1 cpu and 6968.9103 MBytes/s for Triad when using both CPUs)
As you may know, the Opteron Memory bus speed is a function of both the CPU speed and the DIMMS speed. What CPUs did you have ?
Also make sure the BIOs is interleaving pairs of DIMMS to get 128 bit access.
3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the
order of 750 MB/s, for large packets. Why so low, I do not know. It could
be related to the compiler used to compile the MPICH library (pgcc with
-fastsse -Mvect=prefetch, v 5.1, 64 bit).
not really a compiler issue here . MPI needs to copy data from the userspace on one CPU to the userspace on the other CPU. Standard security rules stop one process from directly writing into the memory space of the other. Hence MPICH creates a SystemV shared memory segment that one writes into and the other reads from. So if the memory bus (on each CPU) can be kept saturated at 2GB/s you can never get more than half this with intranode MPI. Small messages can take some advantage of the cache though - I get a best intra-node MPI bandwidth of 1423.49 MB/s - this is with a message size of 256kBytes. This falls to 893 MB/s for say MPI_Sends of 4MBytes at a time.
(compare this with around 875 MBytes/s when using MPI between nodes and a Quadrics interconnect)
--
Alberto Scotti
Asst. Prof.
Dept. of Marine Sciences
CB 3300
University of North Carolina
Chapel Hill, NC 27599-3300
919-962-9454 (w)
919-962-1254 (f)
| < Previous | Next > |