This is a question that came to mind while reading the brilliant answer by Mysticial to the question: why is it faster to process a sorted array than an unsorted array?
Context for the types involved:
const unsigned arraySize = 32768;
int data[arraySize];
long long sum = 0;
In his answer he explains that the Intel Compiler (ICC) optimizes this:
for (int i = 0; i < 100000; ++i)
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
sum += data[c];
...into something equivalent to this:
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
for (int i = 0; i < 100000; ++i)
sum += data[c];
The optimizer is recognizing that these are equivalent and is therefore exchanging the loops, moving the branch outside the inner loop. Very clever!
But why doesn't it do this?
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
sum += 100000 * data[c];
Hopefully Mysticial (or anyone else) can give an equally brilliant answer. I've never learned about the optimizations discussed in that other question before, so I'm really grateful for this.
volatile
, then the loop interchange would be an invalid optimization as well.
The compiler can't generally transform
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
for (int i = 0; i < 100000; ++i)
sum += data[c];
into
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
sum += 100000 * data[c];
because the latter could lead to overflow of signed integers where the former doesn't. Even with guaranteed wrap-around behaviour for overflow of signed two's complement integers, it would change the result (if data[c]
is 30000, the product would become -1294967296
for the typical 32-bit int
s with wrap around, while 100000 times adding 30000 to sum
would, if that doesn't overflow, increase sum
by 3000000000). Note that the same holds for unsigned quantities, with different numbers, overflow of 100000 * data[c]
would typically introduce a reduction modulo 2^32
that must not appear in the final result.
It could transform it into
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
sum += 100000LL * data[c]; // resp. 100000ull
though, if, as usual, long long
is sufficiently larger than int
.
Why it doesn't do that, I can't tell, I guess it's what Mysticial said, "apparently, it does not run a loop-collapsing pass after loop-interchange".
Note that the loop-interchange itself is not generally valid (for signed integers), since
for (int c = 0; c < arraySize; ++c)
if (condition(data[c]))
for (int i = 0; i < 100000; ++i)
sum += data[c];
can lead to overflow where
for (int i = 0; i < 100000; ++i)
for (int c = 0; c < arraySize; ++c)
if (condition(data[c]))
sum += data[c];
wouldn't. It's kosher here, since the condition ensures all data[c]
that are added have the same sign, so if one overflows, both do.
I wouldn't be too sure that the compiler took that into account, though (@Mysticial, could you try with a condition like data[c] & 0x80
or so that can be true for positive and negative values?). I had compilers make invalid optimisations (for example, a couple of years ago, I had an ICC (11.0, iirc) use signed-32-bit-int-to-double conversion in 1.0/n
where n
was an unsigned int
. Was about twice as fast as gcc's output. But wrong, a lot of values were larger than 2^31
, oops.).
This answer does not apply to the specific case linked, but it does apply to the question title and may be interesting to future readers:
Due to finite precision, repeated floating-point addition is not equivalent to multiplication. Consider:
float const step = 1e-15;
float const init = 1;
long int const count = 1000000000;
float result1 = init;
for( int i = 0; i < count; ++i ) result1 += step;
float result2 = init;
result2 += step * count;
cout << (result1 - result2);
The compiler contains various passes which does the optimization. Usually in each pass either an optimization on statements or loop optimizations are done. At present there is no model which does an optimization of loop body based on the loop headers. This is hard to detect and less common.
The optimization which was done was loop invariant code motion. This can be done using a set of techniques.
It does now -- at least, clang does:
long long add_100k_signed(int *data, int arraySize)
{
long long sum = 0;
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
for (int i = 0; i < 100000; ++i)
sum += data[c];
return sum;
}
compiles with -O1 to
add_100k_signed: # @add_100k_signed
test esi, esi
jle .LBB0_1
mov r9d, esi
xor r8d, r8d
xor esi, esi
xor eax, eax
.LBB0_4: # =>This Inner Loop Header: Depth=1
movsxd rdx, dword ptr [rdi + 4*rsi]
imul rcx, rdx, 100000
cmp rdx, 127
cmovle rcx, r8
add rax, rcx
add rsi, 1
cmp r9, rsi
jne .LBB0_4
ret
.LBB0_1:
xor eax, eax
ret
Integer overflow has nothing to do with it; if there is integer overflow that causes undefined behavior, it can happen in either case. Here's the same kind of function using int
instead of long
:
int add_100k_signed(int *data, int arraySize)
{
int sum = 0;
for (int c = 0; c < arraySize; ++c)
if (data[c] >= 128)
for (int i = 0; i < 100000; ++i)
sum += data[c];
return sum;
}
compiles with -O1 to
add_100k_signed: # @add_100k_signed
test esi, esi
jle .LBB0_1
mov r9d, esi
xor r8d, r8d
xor esi, esi
xor eax, eax
.LBB0_4: # =>This Inner Loop Header: Depth=1
mov edx, dword ptr [rdi + 4*rsi]
imul ecx, edx, 100000
cmp edx, 127
cmovle ecx, r8d
add eax, ecx
add rsi, 1
cmp r9, rsi
jne .LBB0_4
ret
.LBB0_1:
xor eax, eax
ret
Well, I'd guess that some compilers might do this sort of optimization, assuming that we are talking about Integer Arithmetics.
At the same time, some compilers might refuse to do it because replacing repetitive addition with multiplication might change the overflow behavior of the code. For unsigned integer types, it shouldn't make a difference since their overflow behavior is fully specified by the language. But for signed ones, it might (probably not on 2's complement platform though). It is true that signed overflow actually leads to undefined behavior in C, meaning that it should be perfectly OK to ignore that overflow semantics altogether, but not all compilers are brave enough to do that. It often draws a lot of criticism from the "C is just a higher-level assembly language" crowd. (Remember what happened when GCC introduced optimizations based on strict-aliasing semantics?)
Historically, GCC has shown itself as a compiler that has what it takes to take such drastic steps, but other compilers might prefer to stick with the perceived "user-intended" behavior even if it is undefined by the language.
There's a conceptual barrier to this kind of optimization. Compiler authors spend a lot of effort on strength reduction -- for instance, replacing multiplications with adds and shifts. They get used to thinking that multiplies are bad. So a case where one ought to go the other way is surprising and counterintuitive. So nobody thinks to implement it.
The people who develop and maintain compilers have a limited amount of time and energy to spend on their work, so they generally want to focus on what their users care about the most: turning well-written code into fast code. They don't want to spend their time trying to find ways to turn silly code into fast code—that's what code review is for. In a high-level language, there may be "silly" code that expresses an important idea, making it worth the developers' time to make that fast—for example, short cut deforestation and stream fusion allow Haskell programs structured around certain kinds of lazily produced data structures to be compiled into tight loops that don't allocate memory. But that sort of incentive simply does not apply to turning looped addition into multiplication. If you want it to be fast, just write it with multiplication.
Success story sharing
ADD.W A6,$A000
, forgetting that word operations with address registers sign-extend the word to 32 bits before the add. Took awhile to troubleshoot, since the only thing the code did between thatADD
and the next time it popped A6 off the stack was to restore caller's registers it has saved to that frame...MyArray[0] = 4;
I could check the adddress ofMyArray
, and look at that location before and after the statement executed; it wouldn't change. Code was something likemove.B @A3,#4
and A3 was supposed to always point toMyArray
any time that instruction executed, but it didn't. Fun.