![](https://seccdn.libravatar.org/avatar/129b4791bfbb16bdb0954749e6c6fa7a.jpg?s=120&d=mm&r=g)
----- Forwarded by Kevin Gassiot/HOU/VES/VDGC on 04/15/2004 02:33 PM -----
ALBERTO D SCOTTI
Well, I found some references to compiling the code with the following options :
pgf77 ?O3 ?mp ?fastsse -Mnontemporal I have compiled it this way to see if this gives me the same results. I don't know enough about the PGI compiler to know if the -Mnontemporal flag will do the same thing as what you did to the asm, but it sounds like it might....
I am running a full lmbench run right now, so can't try the streams code yet, but it will be interesting to see if this reproduces the desired results :)
Kevin Gassiot Advanced Systems Group Visualization Systems Support
Veritas DGC 10300 Town Park Dr. Houston, Texas 77072 832-351-8978 kevin_gassiot@veritasdgc.com mailto:kevin_gassiot@veritasdgc.com
-----ascotti@email.unc.edu wrote: -----
To: Kevin_Gassiot@veritasdgc.com From: ascotti@email.unc.edu Date: 04/14/2004 07:45PM cc: suse-amd64@suse.com Subject: Re: [suse-amd64] Bandwidth on Quad Opteron
Well,
last night I have looked into the asm code generated by pgi fortran. Using the guidelines in the AMD software optimization publication for Opteron (pub # 25112, p. 125), I have tweaked the asm just a little. (see diff below for the copy loop). Essentially, I have replaced prefetch and store operations with their non temporal counterpart. To my surprise, the bandwidth jumped for 2 GB/s for the compiled code to 3.7 GB/s for the asm-tweaked code. I actually wonder if the executable that Daniel sent me (which gave 2.8 GB/s BTW) was not tweaked in a similar fashion. 3.7 GB/s is close to the value that AMD reports for this kind of operations. Incidentally, a better results (15% according to AMD) could be obtained by grouping a bunch of prefetch operations, fill up L1 and then use nontemporal store, rather than interleaving... So it seems that my bandwidth problem was after all a compiler issue. It seems that STREAM actually can tell you the maximum bandwidth a compiler can give you (which, a posteriori, is kind of obvious). COmpilers need to be conservative, necessarily, but it would be nice to have the possibility of using directives (or pragmas in C) to instruct the compiler to use for example nontemporal instructions on a given loop. Also, it is crucial that arrays be aligned on a cache line, or else one gets a segfault. From this point of view, I am still not sure if allocate() in PGI fortran does that consistently... If anybody is interested, I can e-mail the tweaked asm code for stream...
Cheers
Alberto
*** 401,413 **** # c(j) = a(j) # 30 CONTINUE movapd (%rsi,%rcx),%xmm0 ! prefetcht0 64(%rsi,%rcx) ! movapd %xmm0,(%rdx,%rcx) movapd 16(%rsi,%rcx),%xmm1 ! movapd %xmm1,16(%rdx,%rcx) addq $32,%rcx subl $4,%eax jg .LB1_719 # lineno: 179 # t = mysecond() - t # c(n) = c(n) + t --- 401,414 ---- # c(j) = a(j) # 30 CONTINUE movapd (%rsi,%rcx),%xmm0 ! prefetchnta 256(%rsi,%rcx) ! movntpd %xmm0,(%rdx,%rcx) movapd 16(%rsi,%rcx),%xmm1 ! movntpd %xmm1,16(%rdx,%rcx) addq $32,%rcx subl $4,%eax jg .LB1_719 + sfence # lineno: 179 # t = mysecond() - t # c(n) = c(n) + t
--On Wednesday, April 14, 2004 2:18 PM -0500 Kevin_Gassiot@veritasdgc.com wrote:
> > > > > Well, I guess I am in the same boat. I was using a pre-compiled > executable provided by someone else. I am trying to compile the code > myself to see if I can reproduce the results.... > > Kevin > > > > Kevin Gassiot > Advanced Systems Group > Visualization Systems Support > > > Veritas DGC > 10300 Town Park Dr. > Houston, Texas 77072 > 832-351-8978 > kevin_gassiot@veritasdgc.com > > > > ascotti@email.unc > .edu > > To 04/13/2004 06:26 Kevin_Gassiot@veritasdgc.com > PM > cc suse-amd64@suse.com > > Subject Please respond to Re: [suse-amd64] > Bandwidth on Quad ascotti@email.unc Opteron > .edu > > > > > > > > > > > Interesting. Daniel Kedger sent me the STREAM executable statically linked > that he used on his machine. With that, I get 2.8 GB/s, which is > believable, considering the slower CPUs on my box. However, I cannot get > the same numbers when I compile STREAM myself. What compiler/compiler > switches did you use? > > Alberto > > > --On Tuesday, April 13, 2004 4:14 PM -0500 Kevin_Gassiot@veritasdgc.com > wrote: > >> >> >> >> >> I just got a quad cpu - 32 GB system in, and am running STREAM to confirm >> bandwidth. I just got this machine loaded this afternoon, but I am > seeing >> numbers similar to what Daniel Kidger is seeing. After turning off the >> node interleaving in the BIOS, I get the following numbers : >> >> single cpu - 3170 MB/sec >> 4 cpus - 12790 MB/sec >> >> I am running the 2.4.21-193-smp SuSE 9.0 professional kernel. >> >> Hardware - >> >> Quartet Motherboard >> Phoenix BIOS - 09/26/2003 >> 4 Opteron 846 cpus >> 32 GB memory - 16 2 Gb DIMMs >> >> >> BIOS - >> >> Dram Bank Interleave [AUTO] >> Node Memory Interleave [Disabled] >> ECC [Enabled] >> >> >> I will probably take the covers off tomorrow to check out the innards :) >> >> >> >> Kevin Gassiot >> Advanced Systems Group >> Visualization Systems Support >> >> >> Veritas DGC >> 10300 Town Park Dr. >> Houston, Texas 77072 >> 832-351-8978 >> kevin_gassiot@veritasdgc.com >> >> >> >> ascotti@email.unc >> .edu >> >> To 04/10/2004 01:14 suse-amd64@suse.com >> PM >> cc >> >> Subject Please respond to [suse-amd64] Bandwidth on >> Quad ascotti@email.unc Opteron >> .edu >> >> >> >> >> >> >> >> >> >> >> Hi all, >> >> I thought I had a pretty good system in terms of Bandwidth (2 GB/s on a >> single node, using STREAM benchmark), until Daniel Kidger, with a >> considerable amount of understatement, pointed out that on their system >> (with faster CPUs but otherwise same memory type and motherboard), they >> get >> >> a "somewhat" higher value, 3.5 GB/s. 75% increase in bandwidth is too >> large >> >> not to investigate the matter further. Plus, all of the codes we run on >> this machine are bandwidth limited, so any improvement translates in days >> shaved from runs. The system is as follows >> >> Qartet MotherBoard >> BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS gave >> about >> >> 700 MB/s on 1 CPU! >> 4 OPTERON 840 CPUs (1.4 GHz) >> SLES8 SP3 >> kernel 2.4.21-207-numa >> 8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node, INFINEON brand) >> STREAM compiled with pgf77 -fastsse -Mvect=prefetch >> >> The BIOS settings are >> >> Dram Bank Interleave [AUTO] >> Node Memory Interleave [Disabled] >> ECC [Enabled] >> >> Enabling Node Memory Interleave does not change the measured bandwidth >> significantly. I have also taken out 1 DIMM per node (which should ensure >> that the memory works in standard single channel mode). The bandwidth >> drops >> >> to 1.5 GB/s. In other words, dual channel seems to give a 35% boost. I >> have >> >> also tried a number of kernels (2.4.19, 2.4.24, 2.4.25, 2.6.4) with >> similar >> >> results. Note that the system scales up well with NUMA kernels (up to >> about >> >> 7.5 GB/s with 4 CPUs). >> >> My questions are: >> >> 1) Can/should I do better? >> 2) If there is a problem, where should I look for a solution? Is it > likely >> to be a hardware problem (e.g. lousy DIMMs), a BIOS problem (as I >> mentioned >> >> earlier, the original BIOS gave me 700 MB/s), incorrect BIOS setting > (what >> else is out there, perhaps the SRAT table?) a kernel problem or a > compiler >> issue? >> >> I'd rather have your input before pestering the vendor and Celestica. >> >> Thank you for your time >> >> Alberto Scotti >> >> >> >> >> >> >> >> >> -- >> Check the List-Unsubscribe header to unsubscribe >> For additional commands, email: suse-amd64-help@suse.com >> >> >> > > > > > > -- > Check the List-Unsubscribe header to unsubscribe > For additional commands, email: suse-amd64-help@suse.com > > >
-- Alberto Scotti Asst. Prof. Dept. of Marine Sciences CB 3300 University of North Carolina Chapel Hill, NC 27599-3300 919-962-9454 (w) 919-962-1254 (f)