On Thu, 8 Apr 2004 12:05:10 +0100
Dan Kidger
This is since in most cases there is an application running on all CPUs (certainly for us scientific users). Each applciation can typically saturate the local memory bus. If applications from other CPus are also memory bandwidth hungry then all applications will slow each other down. This is of course the bane of all those dual-Xeon HPC servers where one copy of a finite element pogram takes say X minutes but with two running simultaneosuly (one per cpu) each now takes 1.4 X minutes to run. Under NUMA kernel on Opteron - they do not slow each other down at all.
The local policy used by the NUMA kernel is not always optimal though - that is why it is sometimes faster to configure node cache line interleaving in the BIOS. This happens for example when the working set of all CPUs exceeds a single node and the workload prefers bandwidth over latency. Or when you only have a program running on a single CPU, but it needs all the bandwidth it can get; then interleaving is the best policy, because you will combine the bandwidth of all available memory controllers. For most workloads local affinity seems to be pretty good though, so it's a good default. The 9.1/SLES9 kernel will have a new NUMA API that will allow to configure NUMA policies [local affinity, binding to a specific CPU, page interleaving] finegrained per process and per memory mapping without rebooting. -Andi