Hi, Using the attached program (mcopy_time.c) we noticed a considerable performance difference between SLES 8 and SLES 9. A 2.2 GHz Opteron would give the following results (both times executed on the same machine): *) Compiled against SLES 8 glibc: $ ./mcopy_time 2200 1000 1048576 Memory to memory copy rate = 2098.644531 MB / sec. Block size = 1048576 *) Compiled against SLES 9 glibc: $ ./mcopy_time 2200 1000 1048576 Memory to memory copy rate = 1179.235596 MB / sec. Block size = 1048576 I found that the problem is caused by the AMD-specific patches contained in the source RPMs. If I apply x86-64-opt-mem.diff from glibc-2.2.5-233 (SLES 8) and apply it against glibc-2.3.3-98.38 (SLES 9) and at the same time do *NOT* apply glibc-2.3.3-amd64-string.diff (originally contained in glibc-2.3.3-98.38) performance under SLES9 is identical to SLES 8 (mcopy_time yields 2100 MB/sec. under SLES 9 with my patched glibc). I had to modify x86-64-opt-mem.diff a little so it would apply correctly against glibc-2.3.3. I also found that simply not applying glibc-2.3.3-amd64-string.diff increases performance slightly (without applying any older SLES 8 patches), to about 1550 MB/sec. Has anybody else witnessed this performance drop with SLES9? Can anybody see a problem with my solution? Everything seems to be working fine, I'd just like to make sure I didn't miss anything. Attached you can find the patches. I split the patch into 2 files for now, because it was easier to create them that way. They are both based on x86-64-opt-mem.diff (from SLES 8), modified to apply against glibc 2.3.3. x86_64-string-new.diff adds new files that do not exist in glibc 2.3.3, x86_64-string-modified.diff modifies already existing files. Thanks and best regards, -Markus // Measure how fast we can copy memory #include <stdio.h> #include <stdlib.h> #include <time.h> #include <string.h> /* timing function */ #define rdtscll(val) do { \ unsigned int a,d; \ asm volatile("rdtsc" : "=a" (a), "=d" (d)); \ (val) = ((unsigned long)a) | (((unsigned long)d)<<32); \ } while(0) int main(int argc, char *argv[]) { int cpu_rate, num_loops, block_size, block_size_lwords, i, j; unsigned char *send_block_p, *rcv_block_p; unsigned long start_time, end_time; float rate; unsigned long *s_p, *r_p; if (argc != 4) { fprintf(stderr, "Useage: %s <cpu clk rate (MHz)> <num. iterations> <copy block size>\n", argv[0] ); } cpu_rate = atoi(argv[1]); num_loops = atoi(argv[2]); block_size = atoi(argv[3]); block_size_lwords = block_size / sizeof(unsigned long); block_size = sizeof(unsigned long) * block_size_lwords; send_block_p = malloc(block_size); rcv_block_p = malloc(block_size); if ((send_block_p == NULL) || (rcv_block_p == NULL)) { fprintf(stderr, "Malloc failed to allocate block(s) of size %d.\n", block_size); } // start_time = clock(); rdtscll(start_time); for (i = 0; i < num_loops; i++) { memcpy(rcv_block_p, send_block_p, block_size); // s_p = (unsigned long *) send_block_p; // r_p = (unsigned long *) rcv_block_p; // // for (j = 0 ; j < block_size_lwords; j++) { // *(r_p++) = *(s_p++); // } } // end_time = clock(); rdtscll(end_time); rate = (float) (block_size) * (float) (num_loops) / ((float) (end_time - start_time)) * ((float) cpu_rate) * 1.0E6 / 1.0E6; fprintf(stdout, "Memory to memory copy rate = %f MBytes / sec. Block size = %d.\n", rate, block_size); } /* end main() */