Re: [suse-amd64] Bandwidth on Quad Opteron

15 Apr 2004

      Well,

last night I have looked into the asm code generated by pgi fortran. Using 
the guidelines in the AMD software optimization publication for Opteron 
(pub # 25112, p. 125), I have tweaked the asm just a little. (see diff 
below for the copy loop). Essentially, I have replaced prefetch and store 
operations with their non temporal counterpart. To my surprise, the 
bandwidth jumped for 2 GB/s for the compiled code to 3.7 GB/s for the 
asm-tweaked code. I actually wonder if the executable that Daniel sent me 
(which gave 2.8 GB/s BTW) was not tweaked in a similar fashion. 3.7 GB/s is 
close to the value that AMD reports for this kind of operations. 
Incidentally, a better results (15% according to AMD) could be obtained by 
grouping  a bunch of prefetch operations, fill up L1 and then use 
nontemporal store, rather than interleaving...
So it seems that my bandwidth problem was after all a compiler issue. It 
seems that STREAM actually can tell you the maximum bandwidth a compiler 
can give you (which, a posteriori, is kind of obvious). COmpilers need to 
be conservative, necessarily, but it would be nice to have the possibility 
of using directives (or pragmas in C) to instruct the compiler to use for 
example nontemporal instructions on a given loop. Also, it is crucial that 
arrays be aligned on a cache line, or else one gets a segfault. From this 
point of view, I am still not sure if allocate() in PGI fortran does that 
consistently...
If anybody  is interested, I can e-mail the tweaked asm code for stream...

Cheers

Alberto

*** 401,413 ****
  #               c(j) = a(j)
  #    30     CONTINUE
        movapd  (%rsi,%rcx),%xmm0
!       prefetcht0      64(%rsi,%rcx)
!       movapd  %xmm0,(%rdx,%rcx)
        movapd  16(%rsi,%rcx),%xmm1
!       movapd  %xmm1,16(%rdx,%rcx)
        addq    $32,%rcx
        subl    $4,%eax
        jg      .LB1_719
  # lineno: 179
  #           t = mysecond() - t
  #           c(n) = c(n) + t
--- 401,414 ----
  #               c(j) = a(j)
  #    30     CONTINUE
        movapd  (%rsi,%rcx),%xmm0
!       prefetchnta     256(%rsi,%rcx)
!       movntpd %xmm0,(%rdx,%rcx)
        movapd  16(%rsi,%rcx),%xmm1
!       movntpd %xmm1,16(%rdx,%rcx)
        addq    $32,%rcx
        subl    $4,%eax
        jg      .LB1_719
+       sfence
  # lineno: 179
  #           t = mysecond() - t
  #           c(n) = c(n) + t

--On Wednesday, April 14, 2004 2:18 PM -0500 Kevin_Gassiot@veritasdgc.com 
wrote:
...
Well, I guess I am in the same boat.  I was using a pre-compiled
executable provided by someone else.  I am trying to compile the code
myself to see if I can reproduce the results....
Kevin
Kevin Gassiot
Advanced Systems Group
Visualization Systems Support
Veritas DGC
10300 Town Park Dr.
Houston, Texas 77072
832-351-8978
kevin_gassiot@veritasdgc.com
ascotti@email.unc
              .edu
To               04/13/2004 06:26          Kevin_Gassiot@veritasdgc.com
              PM
cc                                         suse-amd64@suse.com
Subject               Please respond to         Re: [suse-amd64]
Bandwidth on Quad                ascotti@email.unc         Opteron
                    .edu
Interesting. Daniel Kedger sent me the STREAM executable statically linked
that he used on his machine. With that, I get 2.8 GB/s, which is
believable, considering the slower CPUs on my box. However, I cannot get
the same numbers when I compile STREAM myself. What compiler/compiler
switches did you use?
Alberto
--On Tuesday, April 13, 2004 4:14 PM -0500 Kevin_Gassiot@veritasdgc.com
wrote:
...
I just got a quad cpu - 32 GB system in, and am running STREAM to confirm
bandwidth.  I just got this machine loaded this afternoon, but I am
...
numbers similar to what Daniel Kidger is seeing.   After turning off the
node interleaving in the BIOS, I get the following numbers :
single cpu - 3170 MB/sec
4 cpus        - 12790 MB/sec
I am running the 2.4.21-193-smp SuSE 9.0 professional kernel.
Hardware -
Quartet Motherboard
Phoenix BIOS - 09/26/2003
4 Opteron 846 cpus
32 GB memory - 16 2 Gb DIMMs
BIOS -
Dram Bank Interleave [AUTO]
Node Memory Interleave [Disabled]
ECC [Enabled]
I will probably take the covers off tomorrow to check out the innards :)
Kevin Gassiot
Advanced Systems Group
Visualization Systems Support
Veritas DGC
10300 Town Park Dr.
Houston, Texas 77072
832-351-8978
kevin_gassiot@veritasdgc.com
ascotti@email.unc
              .edu
To               04/10/2004 01:14          suse-amd64@suse.com
              PM
cc
Subject               Please respond to         [suse-amd64] Bandwidth on
Quad                    ascotti@email.unc         Opteron
                    .edu
Hi all,
I thought I had a pretty good system in terms of Bandwidth (2 GB/s on a
single node, using STREAM benchmark), until Daniel Kidger, with a
considerable amount of understatement, pointed out that on their system
(with faster CPUs but otherwise same memory type and motherboard), they
get
a "somewhat" higher value, 3.5 GB/s. 75% increase in bandwidth is too
large
not to investigate the matter further. Plus, all of the codes we run on
this machine are bandwidth limited, so any improvement translates in days
shaved from runs. The system is as follows
Qartet MotherBoard
BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS gave
about
700 MB/s on 1 CPU!
4 OPTERON 840 CPUs (1.4 GHz)
SLES8 SP3
kernel 2.4.21-207-numa
8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node, INFINEON brand)
STREAM compiled with pgf77 -fastsse -Mvect=prefetch
The BIOS settings are
Dram Bank Interleave [AUTO]
Node Memory Interleave [Disabled]
ECC [Enabled]
Enabling Node Memory Interleave does not change the measured bandwidth
significantly. I have also taken out 1 DIMM per node (which should ensure
that the memory works in standard single channel mode). The bandwidth
drops
to 1.5 GB/s. In other words, dual channel seems to give a 35% boost. I
have
also tried a number of kernels (2.4.19, 2.4.24, 2.4.25, 2.6.4) with
similar
results. Note that the system scales up well with NUMA kernels (up to
about
7.5 GB/s with 4 CPUs).
My questions are:
1) Can/should I do better?
2) If there is a problem, where should I look for a solution? Is it
seeing
likely
...
to be a hardware problem (e.g. lousy DIMMs), a BIOS problem (as I
mentioned
earlier, the original BIOS gave me 700 MB/s), incorrect BIOS setting
(what
else is out there, perhaps the SRAT table?) a kernel problem or a
compiler
issue?
I'd rather have your input before pestering the vendor and Celestica.
Thank you for your time
Alberto Scotti
--
Check the List-Unsubscribe header to unsubscribe
For additional commands, email: suse-amd64-help@suse.com
--
Check the List-Unsubscribe header to unsubscribe
For additional commands, email: suse-amd64-help@suse.com

Re: [suse-amd64] Bandwidth on Quad Opteron

ascotti＠email.unc.edu