Well, last night I have looked into the asm code generated by pgi fortran. Using the guidelines in the AMD software optimization publication for Opteron (pub # 25112, p. 125), I have tweaked the asm just a little. (see diff below for the copy loop). Essentially, I have replaced prefetch and store operations with their non temporal counterpart. To my surprise, the bandwidth jumped for 2 GB/s for the compiled code to 3.7 GB/s for the asm-tweaked code. I actually wonder if the executable that Daniel sent me (which gave 2.8 GB/s BTW) was not tweaked in a similar fashion. 3.7 GB/s is close to the value that AMD reports for this kind of operations. Incidentally, a better results (15% according to AMD) could be obtained by grouping a bunch of prefetch operations, fill up L1 and then use nontemporal store, rather than interleaving... So it seems that my bandwidth problem was after all a compiler issue. It seems that STREAM actually can tell you the maximum bandwidth a compiler can give you (which, a posteriori, is kind of obvious). COmpilers need to be conservative, necessarily, but it would be nice to have the possibility of using directives (or pragmas in C) to instruct the compiler to use for example nontemporal instructions on a given loop. Also, it is crucial that arrays be aligned on a cache line, or else one gets a segfault. From this point of view, I am still not sure if allocate() in PGI fortran does that consistently... If anybody is interested, I can e-mail the tweaked asm code for stream... Cheers Alberto *** 401,413 **** # c(j) = a(j) # 30 CONTINUE movapd (%rsi,%rcx),%xmm0 ! prefetcht0 64(%rsi,%rcx) ! movapd %xmm0,(%rdx,%rcx) movapd 16(%rsi,%rcx),%xmm1 ! movapd %xmm1,16(%rdx,%rcx) addq $32,%rcx subl $4,%eax jg .LB1_719 # lineno: 179 # t = mysecond() - t # c(n) = c(n) + t --- 401,414 ---- # c(j) = a(j) # 30 CONTINUE movapd (%rsi,%rcx),%xmm0 ! prefetchnta 256(%rsi,%rcx) ! movntpd %xmm0,(%rdx,%rcx) movapd 16(%rsi,%rcx),%xmm1 ! movntpd %xmm1,16(%rdx,%rcx) addq $32,%rcx subl $4,%eax jg .LB1_719 + sfence # lineno: 179 # t = mysecond() - t # c(n) = c(n) + t --On Wednesday, April 14, 2004 2:18 PM -0500 Kevin_Gassiot@veritasdgc.com wrote:
Well, I guess I am in the same boat. I was using a pre-compiled executable provided by someone else. I am trying to compile the code myself to see if I can reproduce the results....
Kevin
Kevin Gassiot Advanced Systems Group Visualization Systems Support
Veritas DGC 10300 Town Park Dr. Houston, Texas 77072 832-351-8978 kevin_gassiot@veritasdgc.com
ascotti@email.unc .edu
To 04/13/2004 06:26 Kevin_Gassiot@veritasdgc.com PM cc suse-amd64@suse.com
Subject Please respond to Re: [suse-amd64] Bandwidth on Quad ascotti@email.unc Opteron .edu
Interesting. Daniel Kedger sent me the STREAM executable statically linked that he used on his machine. With that, I get 2.8 GB/s, which is believable, considering the slower CPUs on my box. However, I cannot get the same numbers when I compile STREAM myself. What compiler/compiler switches did you use?
Alberto
--On Tuesday, April 13, 2004 4:14 PM -0500 Kevin_Gassiot@veritasdgc.com wrote:
I just got a quad cpu - 32 GB system in, and am running STREAM to confirm bandwidth. I just got this machine loaded this afternoon, but I am
numbers similar to what Daniel Kidger is seeing. After turning off the node interleaving in the BIOS, I get the following numbers :
single cpu - 3170 MB/sec 4 cpus - 12790 MB/sec
I am running the 2.4.21-193-smp SuSE 9.0 professional kernel.
Hardware -
Quartet Motherboard Phoenix BIOS - 09/26/2003 4 Opteron 846 cpus 32 GB memory - 16 2 Gb DIMMs
BIOS -
Dram Bank Interleave [AUTO] Node Memory Interleave [Disabled] ECC [Enabled]
I will probably take the covers off tomorrow to check out the innards :)
Kevin Gassiot Advanced Systems Group Visualization Systems Support
Veritas DGC 10300 Town Park Dr. Houston, Texas 77072 832-351-8978 kevin_gassiot@veritasdgc.com
ascotti@email.unc .edu
To 04/10/2004 01:14 suse-amd64@suse.com PM cc
Subject Please respond to [suse-amd64] Bandwidth on Quad ascotti@email.unc Opteron .edu
Hi all,
I thought I had a pretty good system in terms of Bandwidth (2 GB/s on a single node, using STREAM benchmark), until Daniel Kidger, with a considerable amount of understatement, pointed out that on their system (with faster CPUs but otherwise same memory type and motherboard), they get
a "somewhat" higher value, 3.5 GB/s. 75% increase in bandwidth is too large
not to investigate the matter further. Plus, all of the codes we run on this machine are bandwidth limited, so any improvement translates in days shaved from runs. The system is as follows
Qartet MotherBoard BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS gave about
700 MB/s on 1 CPU! 4 OPTERON 840 CPUs (1.4 GHz) SLES8 SP3 kernel 2.4.21-207-numa 8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node, INFINEON brand) STREAM compiled with pgf77 -fastsse -Mvect=prefetch
The BIOS settings are
Dram Bank Interleave [AUTO] Node Memory Interleave [Disabled] ECC [Enabled]
Enabling Node Memory Interleave does not change the measured bandwidth significantly. I have also taken out 1 DIMM per node (which should ensure that the memory works in standard single channel mode). The bandwidth drops
to 1.5 GB/s. In other words, dual channel seems to give a 35% boost. I have
also tried a number of kernels (2.4.19, 2.4.24, 2.4.25, 2.6.4) with similar
results. Note that the system scales up well with NUMA kernels (up to about
7.5 GB/s with 4 CPUs).
My questions are:
1) Can/should I do better? 2) If there is a problem, where should I look for a solution? Is it
seeing likely
to be a hardware problem (e.g. lousy DIMMs), a BIOS problem (as I mentioned
earlier, the original BIOS gave me 700 MB/s), incorrect BIOS setting (what else is out there, perhaps the SRAT table?) a kernel problem or a compiler issue?
I'd rather have your input before pestering the vendor and Celestica.
Thank you for your time
Alberto Scotti
-- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com
-- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com