Oh, btw, removing all __attribute__((always_inline))s from the TU makes the inline heuristics work and we get everything optimized early. The alternative is to place __attribute__((flatten)) on memsetT and rect_memsetT