[opensuse-ppc] Fwd: glibc segfault analysis
Hey guys, I've tried to run openSUSE 12.3 on a Freescale e5500 system and it immediately segfaults on every binary. Turns out the root cause is that glibc is configured to power4 compatibility which hardcodes the cache line size to 128 bytes while e5500 has a 64 byte cache line size. There are 2 things I think we should do here: - compile the non-target-optimized glibc without --with-cpu=power4. We already have accelerated versions for power5 and above, so I don't think it's necessary to optimize the base variant for anything but a generic target - fix glibc to work on non-128byte cache line size even when --with-cpu=power4 is specified. I doubt there really is any non-negligible performance difference for the memset() case where dcbz gets used. But if there is, why don't we just binary patch the cache line size into memset()'s li instruction and always be fast? Alex Begin forwarded message:
From: Yang James-RA8135 <RA8135@freescale.com> Subject: RE: glibc segfault analysis Date: 12. September 2013 11:44:20 GMT-05:00 To: Yang James-RA8135 <RA8135@freescale.com>, Graf Alexander-B36701 <B36701@freescale.com>, Yoder Stuart-B08248 <B08248@freescale.com> Cc: Wienskoski Edmar-RA8797 <RA8797@freescale.com>, Newton Peter-RA3823 <RA3823@freescale.com>
I talked to Alex this morning and he confirms that the glibc is power4-targeted.
Alex also tried hexediting the libc.so but that still resulted in segfault
I took a look at doing it, and I have found that it is not just the libc.so's memset that needs to be patched, but that ld.so itself has its own copy of memset.
Once both are patched to force 64-byte dcbz, the program runs without segfault.
-----Original Message----- From: Yang James-RA8135 Sent: Thursday, September 12, 2013 3:25 AM To: Graf Alexander-B36701; Yoder Stuart-B08248 Cc: Wienskoski Edmar-RA8797; Newton Peter-RA3823 Subject: glibc segfault analysis
OK, I think I have figured out the cause of the problem with the segfaulting in SUSE glibc. Edmar, you probably know of this already (and maybe it is also the same reason as the other email thread from Peter's problem from last year).
It boils down to the powerpc64 memset in glibc in the libc-2.17.so is one of the powerX versions rather than the generic powerpc64 one. The powerX versions, such as power4, hard code a 128 byte cache line size without reading the global variable that is set up to tell you what the cache line size is, even though that variable is set up for 64 bytes (which is what it is on e5500/e6500). Here is a link to the code in glibc top of tree (assuming SUSE sets up glibc to target power4):
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/powerpc64/m...
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/powerpc/powerpc64/p...
relevant diff -u:
@@ -160,22 +155,15 @@ /* If the remaining length is less the 32 bytes, don't bother getting the cache line size. */ beq L(medium) - ld rCLS,.LC0@toc(r2) - lwz rCLS,0(rCLS) -/* If the cache line size was not set just goto to L(nondcbz) which is - safe for any cache line size. */ - cmpldi cr1,rCLS,0 - beq cr1,L(nondcbz) - + li rCLS,128 /* cache line size is 128 */
Here is the imem showing the memset loop (dcbz)
instaddr branch target #times instruction 000000002df7d8a0< 000000002df7d960> 4 beq- 0x00c0 0.00% 0.00% 0.00% 100.00% 000000002df7d8a4 4 addi r8,r0,0x0080 // XXX: here is the hard-coded 128-byte stride 000000002df7d8a8 4 dcbt r0,r6 000000002df7d8ac< 9 cmpldi cr1,r5,0x0020 000000002df7d8b0 9 andi. r0,r6,0x007f 000000002df7d8b4 000000002df7d8f0> 9 blt- cr1,0x003c 0.00% 12.50% 25.00% 62.50% 000000002df7d8b8 000000002df7d8d8> 7 beq- 0x0020 16.67% 16.67% 16.67% 50.00% 000000002df7d8bc 5 addi r6,r6,0x0020 000000002df7d8c0 5 addi r5,r5,0xffe0 000000002df7d8c4 5 std r4,-32(r6) 000000002df7d8c8 5 std r4,-24(r6) 000000002df7d8cc 5 std r4,-16(r6) 000000002df7d8d0 5 std r4,-8(r6) 000000002df7d8d4 000000002df7d8ac> 5 b 0xffffffd8 100.00% 0.00% 0.00% 0.00% 000000002df7d8d8< 30 cmpld cr1,r5,r8 000000002df7d8dc 000000002df7d8f0> 30 blt- cr1,0x0014 0.00% 3.45% 6.90% 89.66% 000000002df7d8e0 28 dcbz r0,r6 000000002df7d8e4 28 subf r5,r8,r5 000000002df7d8e8 28 add r6,r6,r8 // XXX: here is the dcbz stride increment of 128 bytes 000000002df7d8ec 000000002df7d8d8> 28 b 0xffffffec 100.00% 0.00% 0.00% 0.00% 000000002df7d8f0< 4 rlwinm. r7,r5,0,0,26 000000002df7d8f4 000000002df7d838> 4 b 0xffffff44 100.00% 0.00% 0.00% 0.00% ....
And the dcbz EA (first column of 64-bit hex) at 128-byte stride from tte_dump:
ra8135@squirrel:/mnt$ tte_dump on-squirrel-217.tte | grep dcbz 00000fffb7fe9900 14757 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9980 14763 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9a00 14769 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9a80 14775 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9b00 14781 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9b80 14787 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9c00 14793 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9c80 14799 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9d00 14805 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9d80 14811 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9e00 14817 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9e80 14823 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9f00 14829 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7fe9f80 14835 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe000 20532 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe080 20538 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe100 20544 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe180 20550 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe200 20556 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe280 20562 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe300 20568 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe380 20574 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe400 20580 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe480 20586 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe500 20592 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe580 20598 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe600 20604 <000000002df7d8e0-7C0037EC> * dcbz r0,r6 00000fffb7ffe680 20610 <000000002df7d8e0-7C0037EC> * dcbz r0,r6
In contrast, Debian's eglibc 2.11.3 has a working memset (or at least it's not choosing the power4 one by default):
ra8135@squirrel:/mnt$ tte_dump on-squirrel-211.tte | grep dcbz 00000fffb7fbb200 19348 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb240 19354 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb280 19360 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb2c0 19366 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb300 19372 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb340 19378 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb380 19384 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb3c0 19390 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb400 19396 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb440 19402 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb480 19408 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb4c0 19414 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb500 19420 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb540 19426 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb580 19432 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 00000fffb7fbb5c0 19438 <00000fffb7fe6530-7C0037EC> * dcbz r0,r6 etc etc etc
Since memset is used to initialize the bss, the broken version skips zeroing every other cache line. The leaf function that causes the segfault, __new_exitfn() in stdlib/cxa_atexit.c, walks over a linked list of destructor functions to register them. The initial list is supposed to be empty, and it is declared in the same file as " static struct exit_function_list initial;", i.e. allocated in bss, which the memset code is supposed to clear but actually doesn't. The first element of the initial list happens to have a non-zero next pointer value of 8, which it then tries to fetch the next element of the list from, resulting in the segfault.
struct exit_function * __new_exitfn (struct exit_function_list **listp) { struct exit_function_list *p = NULL; struct exit_function_list *l; struct exit_function *r = NULL; size_t i = 0;
__libc_lock_lock (lock);
for (l = *listp; l != NULL; p = l, l = l->next) { for (i = l->idx; i > 0; --i) if (l->fns[i - 1].flavor != ef_free) break;
if (i > 0) break;
/* This block is completely unused. */ l->idx = 0; } // ...
0000000b 00000003 0000001c 00000fff b7fd7490 // Operand r28 holds 0x00000fffb7fd7490 00000fffb7fd7490 135663 <00000fffb7e76e90-EBBC0000> * ld r29,0(r28) // l = *listp 0000000b 00000003 0000001d 00000fff b7fe9e60 // Operand r29 holds 0x00000fffb7fe9e60 135664 <00000fffb7e76e94-2FBD0000> * cmpdi cr7,r29,0x0000 // l != NULL? 0000001c 00000001 22000424 // Operand CR holds 0x22000424 00000fffb7e76e9c 135665 <00000fffb7e76e98-419E0154> * beq- cr7,0x0154 0000000b 00000003 0000001d 00000fff b7fe9e60 // Operand r29 holds 0x00000fffb7fe9e60 0000000b 00000003 0000001d 00000fff b7fe9e60 // Operand r29 holds 0x00000fffb7fe9e60 135666 <00000fffb7e76e9c-7FA6EB78> * or r6,r29,r29 // l = *listp copy 135667 <00000fffb7e76ea0-38600000> * addi r3,r0,0x0000 135668 <00000fffb7e76ea4-38800000> * addi r4,r0,0x0000 0000000b 00000003 00000000 00000fff b7e7708c // Operand r0 holds 0x00000fffb7e7708c 135669 <00000fffb7e76ea8-60000000> * ori r0,r0,0x0000 0000000b 00000003 00000000 00000fff b7e7708c // Operand r0 holds 0x00000fffb7e7708c 135670 <00000fffb7e76eac-60000000> * ori r0,r0,0x0000 0000000b 00000003 00000006 00000fff b7fe9e60 // Operand r6 holds 0x00000fffb7fe9e60 00000fffb7fe9e68 135671 <00000fffb7e76eb0-E8E60008> * ld r7,8(r6) // i = l->idx 0000000b 00000003 00000007 00000000 00000000 // Operand r7 holds 000000000000000000 135672 <00000fffb7e76eb4-2FA70000> * cmpdi cr7,r7,0x0000 // i > 0 ? 0000001c 00000001 22000422 // Operand CR holds 0x22000422 00000fffb7e76f0c ! 135673 <00000fffb7e76eb8-419E0054> * beq- cr7,0x0054 0000000b 00000003 00000006 00000fff b7fe9e60 // Operand r6 holds 0x00000fffb7fe9e60 00000fffb7fe9e60 135674 <00000fffb7e76f0c-E9260000> * ld r9,0(r6) // temp_l = l->next 0000000b 00000003 00000004 00000000 00000000 // Operand r4 holds 000000000000000000 0000000b 00000003 00000006 00000fff b7fe9e60 // Operand r6 holds 0x00000fffb7fe9e60 00000fffb7fe9e68 135675 <00000fffb7e76f10-F8860008> * std r4,8(r6) // l->idx = 0 0000000b 00000003 00000006 00000fff b7fe9e60 // Operand r6 holds 0x00000fffb7fe9e60 0000000b 00000003 00000006 00000fff b7fe9e60 // Operand r6 holds 0x00000fffb7fe9e60 135676 <00000fffb7e76f14-7CC33378> * or r3,r6,r6 0000000b 00000003 00000009 00000000 00000008 // Operand r9 holds 000000000000000008 // XXX: temp_l address is 8 135677 <00000fffb7e76f18-2FA90000> * cmpdi cr7,r9,0x0000 0000001c 00000001 22000424 // Operand CR holds 0x22000424 00000fffb7e76f20 135678 <00000fffb7e76f1c-419E009C> * beq- cr7,0x009c 0000000b 00000003 00000009 00000000 00000008 // Operand r9 holds 000000000000000008 0000000b 00000003 00000009 00000000 00000008 // Operand r9 holds 000000000000000008 135679 <00000fffb7e76f20-7D264B78> * or r6,r9,r9 00000fffb7e76eb0 ! 135680 <00000fffb7e76f24-4BFFFF8C> * b 0xffffff8c 0000000b 00000003 00000006 00000000 00000008 // Operand r6 holds 000000000000000008 0000000000000010 135681 <00000fffb7e76eb0-E8E60008> * ld r7,8(r6) // XXX: *temp_l causes segfault
So, yeah, I doubt SUSE necessarily say they support P5020 or really any powerpc that has 64-byte dcbz, but wouldn't it be nice if glibc had a way of indicating that the detected cache line size doesn't match what it has been tuned for?
-- To unsubscribe, e-mail: opensuse-ppc+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-ppc+owner@opensuse.org
Alexander Graf <agraf@suse.de> writes:
- compile the non-target-optimized glibc without --with-cpu=power4.
This is already the case, otherwise it wouldn't run on G3. So your analysis is flawed. Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." -- To unsubscribe, e-mail: opensuse-ppc+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-ppc+owner@opensuse.org
Am 13.09.2013 um 05:44 schrieb Andreas Schwab <schwab@linux-m68k.org>:
Alexander Graf <agraf@suse.de> writes:
- compile the non-target-optimized glibc without --with-cpu=power4.
This is already the case, otherwise it wouldn't run on G3. So your analysis is flawed.
G3 is 32bit which has different defaults :). Alex
Andreas.
-- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
-- To unsubscribe, e-mail: opensuse-ppc+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-ppc+owner@opensuse.org
participants (2)
-
Alexander Graf
-
Andreas Schwab