Fw: [suse-amd64] Bandwidth on Quad Opteron

15 Apr 2004

      ----- Forwarded by Kevin Gassiot/HOU/VES/VDGC on 04/15/2004 02:33 PM -----

             ALBERTO D SCOTTI                                              
                                                                  To 
                                       Kevin_Gassiot@veritasdgc.com        
             04/15/2004 01:16                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: [suse-amd64] Bandwidth on Quad  
                                       Opteron                             

Kudos to Kevin for fishing out the -Mnontemporal flag!

Since many people have asked me the asm files, I have posted them on

ftp://ftp.unc.edu/pub/marine/ascotti/asm/

You will need to compile second_wall.c separately if you plan to use
them. I have put the statically linked executables, in case you want to
try them out and you do not have PGI fortran.

Here's a summary of the results.

Compiling without any optmization flag gives about 1.7 GB/s (stream_d.s)

Compiling with -fastsse (shorthand for -Mvect=sse -Mscalarsse
-Mcache_align -Mflushz) gives 2.1 GB/s (stream_d_sse.s)

Adding -Mnontemporal brings it to 2.9 GB/s (stream_d_sse_nta.s)

By replacing (by hand) prefetcht0 with prefetchnta, and increasing the
prefetch offset to 256 from 64 bytes, the machine screams at 3.7 GB/s
(stream_d_sse_nta_tweaked.s)

Some comments

1) -Mconcur and -Mnontemporal will give broken code when running on more
tha one CPU. It is probably due to the fact that nta operations require
16 byte alignment, which might be broken when distributing the blocks.
I have not tried, but using MPI should be ok (since every threads
allocates its own memory, and allocate() seems to always align on 16
bytes).

2) PGI refuses to use nontemporal opcodes when operating on float
(STREAM uses double). This is a bummer. What's worse, it rearranges
loops so that more often than not ntazing by hand does not work due to
aligment problems.

3) -O3 does not seem to make a big difference.

Well, that's it for now. I better go back and do some real work. It has
been fun!

Alberto

Kevin_Gassiot@veritasdgc.com wrote:
...
Well, I found some references to compiling the code with the following
options :
pgf77 ?O3 ?mp ?fastsse -Mnontemporal
I have compiled it this way to see if this gives me the same results.  I
don't know enough about the PGI compiler to know if the -Mnontemporal
flag will do the same thing as what you did to the asm, but it sounds
like it might....
I am running a full lmbench run right now, so can't try the streams code
yet, but it will be interesting to see if this reproduces the desired
results :)
Kevin Gassiot
Advanced Systems Group
Visualization Systems Support
Veritas DGC
10300 Town Park Dr.
Houston, Texas 77072
832-351-8978
kevin_gassiot@veritasdgc.com mailto:kevin_gassiot@veritasdgc.com
-----ascotti@email.unc.edu wrote: -----
To: Kevin_Gassiot@veritasdgc.com
    From: ascotti@email.unc.edu
    Date: 04/14/2004 07:45PM
    cc: suse-amd64@suse.com
    Subject: Re: [suse-amd64] Bandwidth on Quad Opteron
Well,
last night I have looked into the asm code generated by pgi fortran.
    Using
    the guidelines in the AMD software optimization publication for
Opteron
    (pub # 25112, p. 125), I have tweaked the asm just a little. (see
diff
    below for the copy loop). Essentially, I have replaced prefetch and
    store
    operations with their non temporal counterpart. To my surprise, the
    bandwidth jumped for 2 GB/s for the compiled code to 3.7 GB/s for the
    asm-tweaked code. I actually wonder if the executable that Daniel
    sent me
    (which gave 2.8 GB/s BTW) was not tweaked in a similar fashion. 3.7
    GB/s is
    close to the value that AMD reports for this kind of operations.
    Incidentally, a better results (15% according to AMD) could be
    obtained by
    grouping  a bunch of prefetch operations, fill up L1 and then use
    nontemporal store, rather than interleaving...
    So it seems that my bandwidth problem was after all a compiler
    issue. It
    seems that STREAM actually can tell you the maximum bandwidth a
    compiler
    can give you (which, a posteriori, is kind of obvious). COmpilers
    need to
    be conservative, necessarily, but it would be nice to have the
    possibility
    of using directives (or pragmas in C) to instruct the compiler to
    use for
    example nontemporal instructions on a given loop. Also, it is
    crucial that
    arrays be aligned on a cache line, or else one gets a segfault. From
    this
    point of view, I am still not sure if allocate() in PGI fortran does
    that
    consistently...
    If anybody  is interested, I can e-mail the tweaked asm code for
    stream...
Cheers
Alberto
*** 401,413 ****
     #               c(j) = a(j)
     #    30     CONTINUE
           movapd  (%rsi,%rcx),%xmm0
    !       prefetcht0      64(%rsi,%rcx)
    !       movapd  %xmm0,(%rdx,%rcx)
           movapd  16(%rsi,%rcx),%xmm1
    !       movapd  %xmm1,16(%rdx,%rcx)
           addq    $32,%rcx
           subl    $4,%eax
           jg      .LB1_719
     # lineno: 179
     #           t = mysecond() - t
     #           c(n) = c(n) + t
    --- 401,414 ----
     #               c(j) = a(j)
     #    30     CONTINUE
           movapd  (%rsi,%rcx),%xmm0
    !       prefetchnta     256(%rsi,%rcx)
    !       movntpd %xmm0,(%rdx,%rcx)
           movapd  16(%rsi,%rcx),%xmm1
    !       movntpd %xmm1,16(%rdx,%rcx)
           addq    $32,%rcx
           subl    $4,%eax
           jg      .LB1_719
    +       sfence
     # lineno: 179
     #           t = mysecond() - t
     #           c(n) = c(n) + t
--On Wednesday, April 14, 2004 2:18 PM -0500
    Kevin_Gassiot@veritasdgc.com
    wrote:
>
     >
     >
     >
     > Well, I guess I am in the same boat.  I was using a pre-compiled
     > executable provided by someone else.  I am trying to compile the
code
     > myself to see if I can reproduce the results....
     >
     > Kevin
     >
     >
     >
     > Kevin Gassiot
     > Advanced Systems Group
     > Visualization Systems Support
     >
     >
     > Veritas DGC
     > 10300 Town Park Dr.
     > Houston, Texas 77072
     > 832-351-8978
     > kevin_gassiot@veritasdgc.com
     >
     >
     >
     >               ascotti@email.unc
     >               .edu
     >
     > To               04/13/2004 06:26
     Kevin_Gassiot@veritasdgc.com
     >               PM
     > cc                                         suse-amd64@suse.com
     >
     > Subject               Please respond to         Re: [suse-amd64]
     > Bandwidth on Quad                ascotti@email.unc         Opteron
     >                     .edu
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     > Interesting. Daniel Kedger sent me the STREAM executable
    statically linked
     > that he used on his machine. With that, I get 2.8 GB/s, which is
     > believable, considering the slower CPUs on my box. However, I
    cannot get
     > the same numbers when I compile STREAM myself. What
compiler/compiler
     > switches did you use?
     >
     > Alberto
     >
     >
     > --On Tuesday, April 13, 2004 4:14 PM -0500
    Kevin_Gassiot@veritasdgc.com
     > wrote:
     >
     >>
     >>
     >>
     >>
     >> I just got a quad cpu - 32 GB system in, and am running STREAM
    to confirm
     >> bandwidth.  I just got this machine loaded this afternoon, but I
am
     > seeing
     >> numbers similar to what Daniel Kidger is seeing.   After turning
    off the
     >> node interleaving in the BIOS, I get the following numbers :
     >>
     >> single cpu - 3170 MB/sec
     >> 4 cpus        - 12790 MB/sec
     >>
     >> I am running the 2.4.21-193-smp SuSE 9.0 professional kernel.
     >>
     >> Hardware -
     >>
     >> Quartet Motherboard
     >> Phoenix BIOS - 09/26/2003
     >> 4 Opteron 846 cpus
     >> 32 GB memory - 16 2 Gb DIMMs
     >>
     >>
     >> BIOS -
     >>
     >> Dram Bank Interleave [AUTO]
     >> Node Memory Interleave [Disabled]
     >> ECC [Enabled]
     >>
     >>
     >> I will probably take the covers off tomorrow to check out the
    innards :)
     >>
     >>
     >>
     >> Kevin Gassiot
     >> Advanced Systems Group
     >> Visualization Systems Support
     >>
     >>
     >> Veritas DGC
     >> 10300 Town Park Dr.
     >> Houston, Texas 77072
     >> 832-351-8978
     >> kevin_gassiot@veritasdgc.com
     >>
     >>
     >>
     >>               ascotti@email.unc
     >>               .edu
     >>
     >> To               04/10/2004 01:14          suse-amd64@suse.com
     >>               PM
     >> cc
     >>
     >> Subject               Please respond to         [suse-amd64]
    Bandwidth on
     >> Quad                    ascotti@email.unc         Opteron
     >>                     .edu
     >>
     >>
     >>
     >>
     >>
     >>
     >>
     >>
     >>
     >>
     >> Hi all,
     >>
     >> I thought I had a pretty good system in terms of Bandwidth (2
    GB/s on a
     >> single node, using STREAM benchmark), until Daniel Kidger, with a
     >> considerable amount of understatement, pointed out that on their
    system
     >> (with faster CPUs but otherwise same memory type and
    motherboard), they
     >> get
     >>
     >> a "somewhat" higher value, 3.5 GB/s. 75% increase in bandwidth
    is too
     >> large
     >>
     >> not to investigate the matter further. Plus, all of the codes we
    run on
     >> this machine are bandwidth limited, so any improvement
    translates in days
     >> shaved from runs. The system is as follows
     >>
     >> Qartet MotherBoard
     >> BIOS upgraded to version PQTDX0-B (9/26/2003). The original BIOS
    gave
     >> about
     >>
     >> 700 MB/s on 1 CPU!
     >> 4 OPTERON 840 CPUs (1.4 GHz)
     >> SLES8 SP3
     >> kernel 2.4.21-207-numa
     >> 8 1Gb 333MHz PC2700 DIMMS (2 DIMMs per node, INFINEON brand)
     >> STREAM compiled with pgf77 -fastsse -Mvect=prefetch
     >>
     >> The BIOS settings are
     >>
     >> Dram Bank Interleave [AUTO]
     >> Node Memory Interleave [Disabled]
     >> ECC [Enabled]
     >>
     >> Enabling Node Memory Interleave does not change the measured
    bandwidth
     >> significantly. I have also taken out 1 DIMM per node (which
    should ensure
     >> that the memory works in standard single channel mode). The
    bandwidth
     >> drops
     >>
     >> to 1.5 GB/s. In other words, dual channel seems to give a 35%
    boost. I
     >> have
     >>
     >> also tried a number of kernels (2.4.19, 2.4.24, 2.4.25, 2.6.4)
with
     >> similar
     >>
     >> results. Note that the system scales up well with NUMA kernels
    (up to
     >> about
     >>
     >> 7.5 GB/s with 4 CPUs).
     >>
     >> My questions are:
     >>
     >> 1) Can/should I do better?
     >> 2) If there is a problem, where should I look for a solution? Is
it
     > likely
     >> to be a hardware problem (e.g. lousy DIMMs), a BIOS problem (as I
     >> mentioned
     >>
     >> earlier, the original BIOS gave me 700 MB/s), incorrect BIOS
setting
     > (what
     >> else is out there, perhaps the SRAT table?) a kernel problem or a
     > compiler
     >> issue?
     >>
     >> I'd rather have your input before pestering the vendor and
    Celestica.
     >>
     >> Thank you for your time
     >>
     >> Alberto Scotti
     >>
     >>
     >>
     >>
     >>
     >>
     >>
     >>
     >> --
     >> Check the List-Unsubscribe header to unsubscribe
     >> For additional commands, email: suse-amd64-help@suse.com
     >>
     >>
     >>
     >
     >
     >
     >
     >
     > --
     > Check the List-Unsubscribe header to unsubscribe
     > For additional commands, email: suse-amd64-help@suse.com
     >
     >
     >
--
Alberto Scotti
Asst. Prof.
Dept. of Marine Sciences
CB 3300
University of North Carolina
Chapel Hill, NC 27599-3300
919-962-9454 (w)
919-962-1254 (f)

Fw: [suse-amd64] Bandwidth on Quad Opteron

Kevin_Gassiot＠veritasdgc.com