I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call
pow(a,2) by compiling it into
a*a, but the call
pow(a,6) is not optimized and will actually call the library function
pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable
icc, will eliminate the library call for
What I am curious about is that when I replaced
a*a*a*a*a*a using GCC 4.5.1 and options “
-O3 -lm -funroll-loops -msse4“, it uses 5
movapd %xmm14, %xmm13 mulsd %xmm14, %xmm13 mulsd %xmm14, %xmm13 mulsd %xmm14, %xmm13 mulsd %xmm14, %xmm13 mulsd %xmm14, %xmm13
while if I write
(a*a*a)*(a*a*a), it will produce
movapd %xmm14, %xmm13 mulsd %xmm14, %xmm13 mulsd %xmm14, %xmm13 mulsd %xmm13, %xmm13
which reduces the number of multiply instructions to 3.
icc has similar behavior.
Why do compilers not recognize this optimization trick?
Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.
As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don’t care about numerical accuracy. For example: the
-fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the
-ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.