Support from the manufacturer for the Arima line of motherboards can be found here: http://www.rioworks.com/Download/HDAMA.htm http://www.rioworks.com/Service-Download.htm Regards, Dan PS - I'd be delighted if Tyan got their act together, they've always had some of the most appealing designs on the market. -----Original Message----- From: Mark Horton [mailto:mark@nostromo.net] Sent: Tuesday, April 06, 2004 12:26 PM To: SuSE AMD 64 Mailing List Subject: Re: [suse-amd64] Opteron Board preference .... I'll chime in my 2 cents worth... I have an Arima HDAMA and a Tyan 2885 and both work great. However, the HDAMA didn't support our raid card. The card is a MegaRAID SCSI 320-4X. I called Accelertech a couple of times, and they were nice, but they didn't seem to have the resources to track down the problem. As far as I could tell I was the only person in the world who was having this particular compatibility issue. They did say if I sent them the card they would try to fix the issue, which I thought was a good sign. I ended up getting the tyan because I knew they already had support for the card. From what I understand some (or all) of the IWill opteron boards are the same as the Arima boards. The Arima HDAMC is the same exact board as the IWIll DK8X. A couple of nits about the HDAMA. 2 of the dimm slots on the second cpu are very close to each other. The other dimm slots have a slight space between them. These 2 dimms, when populated, are almost touching, while the other have a decent space between them. This probably isn't a huge deal but it worried me a little. The tyan board has a space between all 8 dimm slots. One final nit-pick in the HDAMA. Accelertech's website (www.accelertech.com) has been down for over a week now and is still down. They seemed to provide support, bios upgrades and docs, for the Arima boards. Mark Miller, Daniel J. wrote:
MSI-9131 and Arima HDAMA have both been good for us. One of our MSI's
has 6gig of RAM, most of the rest of our systems have 4gig (or less).
From past personal experience with the Tyan 24XX (and other) boards, I'd stay away from Tyan. Supposedly Tyan boards are now OK (they won't just die after a few months to maybe 3 years) but I'd let someone else
prove or disprove that supposition. The only other board line I've ever seen with a failure rate like Tyan's is PCChips.
I hate to be so negative about a line that has so many terrific looking board designs, but so it goes.
Maybe our distributor happened to store their Tyan boards next to an industrial microwave oven with broke shielding, leaving the boxes pristine but mortally wounded components inside, and no one but us has
had reliability problems with Tyan boards - but I doubt it.
Tyan has been offering attractive designs for a long time. I remember
an early dual Pentium (wouldn't support MMX, just Pentium) Tyan system
that had a screwed up cache controller design limiting its effective memory support to 64 meg. At least that board didn't flat out fail after a year.
They make some very appealing systems, because they're often the first
(or the only) company to offer some configurations, but I know that I've learned to find something else, however appealing the Tyan may look.
I'll probably get seduced into trying them again in about 2 years, but
my most recent experiences (with some Athlon MP boards) will keep me away from Tyan for a while.
Get an Asus, MSI, or Arima board.
-----Original Message----- From: wam@mail.hiwaay.net [mailto:wam@mail.hiwaay.net] On Behalf Of William A. Mahaffey III Sent: Saturday, March 27, 2004 8:28 AM To: SuSE AMD 64 Mailing List Subject: [suse-amd64] Opteron Board preference ....
.... I am getting close to trying to build an Opteron based machine as
a compute node on my LAN, set up to run w/o monitor, keybd, mouse, & not a file server, just a (fast) CPU & lots of RAM. I have narrowed down to either ASUS SK8V or TYAN Tomcat K8S (both available @ newegg.com). Does anyone have any strong (preferably firsthand :-)) opinions, yea or nay, about either of these boards ? Thanks in advance.
-- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com
On Tue, 6 Apr 2004 12:41:57 -0500
"Miller, Daniel J."
Support from the manufacturer for the Arima line of motherboards can be found here:
One word of caution with Arima: While their motherboards seem to be very stable (not much problems reported) some of them like the HDAMB are not true NUMA: they have the memory connected only to a single CPU. If you want a well scaling Opteron system I would avoid such designs. Other boards from Arima may be ok in this regard. This can be normally easily checked on board pictures - the traces from the DIMM slots to the CPUs are visible. They should lead to both CPU sockets. -Andi
Andi Kleen wrote:
On Tue, 6 Apr 2004 12:41:57 -0500 "Miller, Daniel J."
wrote: Support from the manufacturer for the Arima line of motherboards can be found here:
One word of caution with Arima: While their motherboards seem to be very stable (not much problems reported) some of them like the HDAMB are not true NUMA: they have the memory connected only to a single CPU. If you want a well scaling Opteron system I would avoid such designs. Other boards from Arima may be ok in this regard.
This can be normally easily checked on board pictures - the traces from the DIMM slots to the CPUs are visible. They should lead to both CPU sockets.
-Andi
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ? I had been leaning toward some of the balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi William: On Wednesday 07 April 2004 05:54, William A. Mahaffey III wrote:
[...]
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ?
Turns out no. The Hyptertransport connection between the processors *is* very fast but not as fast as each processors' 128+ bit wide memory bus. This is why the processor affinity feature of NUMA kernels is important; it tries to keep a process on the processor whose RAM contains its data.
I had been leaning toward some of the balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
It depends on your needs. A second processor can be useful even if it's memory access is via a hypertransport link. It depends on what sort of jobs you are running - if stuff fits mostly in the 2nd processors' cache then there is happiness. Regards, - Darrell - -- sused@mucus.com "Perfect! ....what am I doing?" -- Washu -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux) iD8DBQFAdB5Veo6c0kw6mZ0RAkurAKCiJKUfv8aSxeUjfS5hF9D4WfNf3wCgt6B1 KiTOydpVERJIfX8PLiCywCU= =nCUl -----END PGP SIGNATURE-----
Darrell Shively wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi William:
On Wednesday 07 April 2004 05:54, William A. Mahaffey III wrote:
[...]
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ?
Turns out no. The Hyptertransport connection between the processors *is* very fast but not as fast as each processors' 128+ bit wide memory bus. This is why the processor affinity feature of NUMA kernels is important; it tries to keep a process on the processor whose RAM contains its data.
I had been leaning toward some of the balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
It depends on your needs. A second processor can be useful even if it's memory access is via a hypertransport link. It depends on what sort of jobs you are running - if stuff fits mostly in the 2nd processors' cache then there is happiness.
Regards, - Darrell - -- sused@mucus.com "Perfect! ....what am I doing?" -- Washu -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQFAdB5Veo6c0kw6mZ0RAkurAKCiJKUfv8aSxeUjfS5hF9D4WfNf3wCgt6B1 KiTOydpVERJIfX8PLiCywCU= =nCUl -----END PGP SIGNATURE-----
Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
--On Wednesday, April 07, 2004 7:00 PM -0500 "William A. Mahaffey III"
Darrell Shively wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi William:
On Wednesday 07 April 2004 05:54, William A. Mahaffey III wrote:
[...]
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ?
Turns out no. The Hyptertransport connection between the processors *is* very fast but not as fast as each processors' 128+ bit wide memory bus. This is why the processor affinity feature of NUMA kernels is important; it tries to keep a process on the processor whose RAM contains its data.
I had been leaning toward some of the balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
It depends on your needs. A second processor can be useful even if it's memory access is via a hypertransport link. It depends on what sort of jobs you are running - if stuff fits mostly in the 2nd processors' cache then there is happiness.
Regards, - Darrell - -- sused@mucus.com "Perfect! ....what am I doing?" -- Washu -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQFAdB5Veo6c0kw6mZ0RAkurAKCiJKUfv8aSxeUjfS5hF9D4WfNf3wCgt6B1 KiTOydpVERJIfX8PLiCywCU= =nCUl -----END PGP SIGNATURE-----
Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark, which calculates bandwidth by measuring the time required to perform simple operations on large arrays (where the bottleneck is the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar trends with different codes, all constrained by the memory bandwidth. The bottom line is 1) The theoretical Bandwidth is an ephemeral goal. Compilers just are not that tuned. I can get closer if I start writing my own assembler code, but that takes time.. 2) NUMA makes a big difference. 3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the order of 750 MB/s, for large packets. Why so low, I do not know. It could be related to the compiler used to compile the MPICH library (pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit). I have read elsewhere that pgcc does not produce good code (I think was for the ATLAS libraries), but that might have to do with the way ATLAS routines are written. Actually, it would be interesting to hear from other people on the best combination of compiler/switches. On a related note, I have not noticed much difference between compiling in 32 or 64 bit. Having said that, my BIOS (quartet motherboard, manifactured by Celestica) allows to distribute memory addresses in a roundrobin fashion among CPUs. When this option is enabled, a large array is spread uniformly among CPUs. While this is bad running NUMA, it could improve the performance with traditional SMP kernels, but I have not tried that. Overall, I am quite happy with the machine. Based on runs of a CFD code (parallel), each Opteron CPU is equivalent to 2.5 Athlon cpus at the same frequency. It took almost a month to put all the pieces of software and firmware to work together, but it was worth the hassle. Alberto Scotti
Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark, which calculates bandwidth by measuring the time required to perform simple operations on large arrays (where the bottleneck is the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar trends with different codes, all constrained by the memory bandwidth. The bottom line is
My STREAMS figures are somewhat higher: one cpu: 3497.4544 MByte/s for STREAMS Triad four cpus: 12970.4028 MByte/s for STREAMS Triad These are 2.2GHz CPus in a 4-way AMD/Celestica 'Quartet' with 333MHz 512MB DIMMS. kernel is 2.4.21-171.4.5qsnet_numa, otherwise it is a standard SLES8 disto (on a 2-way node with 1.8GHz CPUs + 333MHz DIMMS I get 3528.9499 MBytes/s on 1 cpu and 6968.9103 MBytes/s for Triad when using both CPUs) As you may know, the Opteron Memory bus speed is a function of both the CPU speed and the DIMMS speed. What CPUs did you have ? Also make sure the BIOs is interleaving pairs of DIMMS to get 128 bit access.
3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the order of 750 MB/s, for large packets. Why so low, I do not know. It could be related to the compiler used to compile the MPICH library (pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit).
not really a compiler issue here . MPI needs to copy data from the userspace on one CPU to the userspace on the other CPU. Standard security rules stop one process from directly writing into the memory space of the other. Hence MPICH creates a SystemV shared memory segment that one writes into and the other reads from. So if the memory bus (on each CPU) can be kept saturated at 2GB/s you can never get more than half this with intranode MPI. Small messages can take some advantage of the cache though - I get a best intra-node MPI bandwidth of 1423.49 MB/s - this is with a message size of 256kBytes. This falls to 893 MB/s for say MPI_Sends of 4MBytes at a time. (compare this with around 875 MBytes/s when using MPI between nodes and a Quadrics interconnect) -- Yours, Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com --------------------
Dan, what you tell me makes me think that I have still work to do. Especially the fact that bot 2.2 GHZ and 1.8 GHz CPUs get similar bandwidth. Here's my configuration Qartet MotherBoard BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS gave about 700 MB/s on 1 CPU! 4 OPTERON 840 CPUs (1.4 GHz) SLES8 SP3 kernel 2.4.21-207-numa 8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node) I think that the BIOS is interleaving the dimms, but not across nodes. STREAM compiled with pgf90 -fastsse -Mvect=prefetch What other options in the BIOS should I watch for? How about 4-bit ECC checking/scrubbing and the like? How about compiler switches? Thank you Alberto Dan Kidger wrote:
Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark, which calculates bandwidth by measuring the time required to perform simple operations on large arrays (where the bottleneck is the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar trends with different codes, all constrained by the memory bandwidth. The bottom line is
My STREAMS figures are somewhat higher: one cpu: 3497.4544 MByte/s for STREAMS Triad four cpus: 12970.4028 MByte/s for STREAMS Triad
These are 2.2GHz CPus in a 4-way AMD/Celestica 'Quartet' with 333MHz 512MB DIMMS. kernel is 2.4.21-171.4.5qsnet_numa, otherwise it is a standard SLES8 disto
(on a 2-way node with 1.8GHz CPUs + 333MHz DIMMS I get 3528.9499 MBytes/s on 1 cpu and 6968.9103 MBytes/s for Triad when using both CPUs)
As you may know, the Opteron Memory bus speed is a function of both the CPU speed and the DIMMS speed. What CPUs did you have ? Also make sure the BIOs is interleaving pairs of DIMMS to get 128 bit access.
3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the order of 750 MB/s, for large packets. Why so low, I do not know. It could be related to the compiler used to compile the MPICH library (pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit).
not really a compiler issue here . MPI needs to copy data from the userspace on one CPU to the userspace on the other CPU. Standard security rules stop one process from directly writing into the memory space of the other. Hence MPICH creates a SystemV shared memory segment that one writes into and the other reads from. So if the memory bus (on each CPU) can be kept saturated at 2GB/s you can never get more than half this with intranode MPI. Small messages can take some advantage of the cache though - I get a best intra-node MPI bandwidth of 1423.49 MB/s - this is with a message size of 256kBytes. This falls to 893 MB/s for say MPI_Sends of 4MBytes at a time. (compare this with around 875 MBytes/s when using MPI between nodes and a Quadrics interconnect)
-- Alberto Scotti Asst. Prof. Dept. of Marine Sciences CB 3300 University of North Carolina Chapel Hill, NC 27599-3300 919-962-9454 (w) 919-962-1254 (f)
ascotti@email.unc.edu wrote:
--On Wednesday, April 07, 2004 7:00 PM -0500 "William A. Mahaffey III"
wrote: Darrell Shively wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi William:
On Wednesday 07 April 2004 05:54, William A. Mahaffey III wrote:
[...]
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ?
Turns out no. The Hyptertransport connection between the processors *is* very fast but not as fast as each processors' 128+ bit wide memory bus. This is why the processor affinity feature of NUMA kernels is important; it tries to keep a process on the processor whose RAM contains its data.
I had been leaning toward some of the balanced MP boards (TYAN S2882, Arima HDAMA) on that count.
It depends on your needs. A second processor can be useful even if it's memory access is via a hypertransport link. It depends on what sort of jobs you are running - if stuff fits mostly in the 2nd processors' cache then there is happiness.
Regards, - Darrell - -- sused@mucus.com "Perfect! ....what am I doing?" -- Washu -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQFAdB5Veo6c0kw6mZ0RAkurAKCiJKUfv8aSxeUjfS5hF9D4WfNf3wCgt6B1 KiTOydpVERJIfX8PLiCywCU= =nCUl -----END PGP SIGNATURE-----
Actually most of the stuff I run would be large jobs requiring a significant fraction of available RAM, too big to fit into cache. I thought the actual data speed of the hyper-transport bus (6.4 GB/s) was similar to the memory bus (6.4 GB/s using PC3200 RAM, 5.3 GB/s using PC2700 RAM), although by different means (64 bit dual-channel DDR bus at either 166 MHz or 200 MHz for the RAM, 16 bit DDR at 1600 MHz for the hyper-transport bus). I would also be interested in knowing how SMP is working .... just to help keep the already busy thread going :-).
I can relate my experience with a quad opteron. Using the STREAM benchmark, which calculates bandwidth by measuring the time required to perform simple operations on large arrays (where the bottleneck is the streaming speed), I get about 2 GB/s on a single CPU, (PC2700 ->333MHz*16 5.3 GB/s, 1.4 GHz CPU speed). Under a NUMA kernel (e.g. 2.4.21-207, SLES SP3), the collective bandwidth gets as high as 7.5 GB/s, running on 4 CPUS, whereas running a non NUMA kernel (2.4.24) it rarely exceeds 3 GB/s. I have observed similar trends with different codes, all constrained by the memory bandwidth. The bottom line is 1) The theoretical Bandwidth is an ephemeral goal. Compilers just are not that tuned. I can get closer if I start writing my own assembler code, but that takes time.. 2) NUMA makes a big difference. 3) Using standard MPICH benchmarks, CPU X to CPU Y bandwidth is of the order of 750 MB/s, for large packets. Why so low, I do not know. It could be related to the compiler used to compile the MPICH library (pgcc with -fastsse -Mvect=prefetch, v 5.1, 64 bit). I have read elsewhere that pgcc does not produce good code (I think was for the ATLAS libraries), but that might have to do with the way ATLAS routines are written. Actually, it would be interesting to hear from other people on the best combination of compiler/switches. On a related note, I have not noticed much difference between compiling in 32 or 64 bit.
Having said that, my BIOS (quartet motherboard, manifactured by Celestica) allows to distribute memory addresses in a roundrobin fashion among CPUs. When this option is enabled, a large array is spread uniformly among CPUs. While this is bad running NUMA, it could improve the performance with traditional SMP kernels, but I have not tried that.
Overall, I am quite happy with the machine. Based on runs of a CFD code (parallel), each Opteron CPU is equivalent to 2.5 Athlon cpus at the same frequency. It took almost a month to put all the pieces of software and firmware to work together, but it was worth the hassle.
Alberto Scotti
I was revisiting this thread in more detail & came up w/ another question based on this response. What is the difference between NUMA & traditional SMP as referred to in this response ? Thanks in advance.
Hmmmm .... I thought the CPUs talked to eachother (at least the 200 & 800 series) through high speed busses & could shuttle data between eachother as fast as direct memory access (except for some small latency to start the proceedings), no ?
Turns out no. The Hyptertransport connection between the processors *is* very fast but not as fast as each processors' 128+ bit wide memory bus. This is why the processor affinity feature of NUMA kernels is important; it tries to keep a process on the processor whose RAM contains its data.
Even if the Hypertransort link is infinitely fast - you would still want memory local to each CPU. This is since in most cases there is an application running on all CPUs (certainly for us scientific users). Each applciation can typically saturate the local memory bus. If applications from other CPus are also memory bandwidth hungry then all applications will slow each other down. This is of course the bane of all those dual-Xeon HPC servers where one copy of a finite element pogram takes say X minutes but with two running simultaneosuly (one per cpu) each now takes 1.4 X minutes to run. Under NUMA kernel on Opteron - they do not slow each other down at all. -- Yours, Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com --------------------
On Thu, 8 Apr 2004 12:05:10 +0100
Dan Kidger
This is since in most cases there is an application running on all CPUs (certainly for us scientific users). Each applciation can typically saturate the local memory bus. If applications from other CPus are also memory bandwidth hungry then all applications will slow each other down. This is of course the bane of all those dual-Xeon HPC servers where one copy of a finite element pogram takes say X minutes but with two running simultaneosuly (one per cpu) each now takes 1.4 X minutes to run. Under NUMA kernel on Opteron - they do not slow each other down at all.
The local policy used by the NUMA kernel is not always optimal though - that is why it is sometimes faster to configure node cache line interleaving in the BIOS. This happens for example when the working set of all CPUs exceeds a single node and the workload prefers bandwidth over latency. Or when you only have a program running on a single CPU, but it needs all the bandwidth it can get; then interleaving is the best policy, because you will combine the bandwidth of all available memory controllers. For most workloads local affinity seems to be pretty good though, so it's a good default. The 9.1/SLES9 kernel will have a new NUMA API that will allow to configure NUMA policies [local affinity, binding to a specific CPU, page interleaving] finegrained per process and per memory mapping without rebooting. -Andi
participants (7)
-
ALBERTO D SCOTTI
-
Andi Kleen
-
ascotti@email.unc.edu
-
Dan Kidger
-
Darrell Shively
-
Miller, Daniel J.
-
William A. Mahaffey III