Is if (a < 901)
faster than if (a <= 900)
?
Not exactly as in this simple example, but there are slight performance changes on loop complex code. I suppose this has to do something with generated machine code in case it's even true.
<
is two times faster than typing <=
.
No, it will not be faster on most architectures. You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:
A test or cmp instruction, which sets EFLAGS
And a Jcc (jump) instruction, depending on the comparison type (and code layout):
jne - Jump if not equal --> ZF = 0
jz - Jump if zero (equal) --> ZF = 1
jg - Jump if greater --> ZF = 0 and SF = OF
(etc...)
Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c
if (a < b) {
// Do something 1
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jge .L2 ; jump if a is >= b
; Do something 1
.L2:
And
if (a <= b) {
// Do something 2
}
Compiles to:
mov eax, DWORD PTR [esp+24] ; a
cmp eax, DWORD PTR [esp+28] ; b
jg .L5 ; jump if a is > b
; Do something 2
.L5:
So the only difference between the two is a jg
versus a jge
instruction. The two will take the same amount of time.
I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc
(Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.
Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction.
Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency
The values for Jcc
are:
Latency Throughput
Jcc N/A 0.5
with the following footnote on Jcc
:
Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches. When branches are predicted successfully, the latency of jcc is effectively zero.
So, nothing in the Intel docs ever treats one Jcc
instruction any differently from the others.
If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS
, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)
Edit: Floating Point
This holds true for x87 floating point as well: (Pretty much same code as above, but with double
instead of int
.)
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
fstp st(0)
seta al ; Set al if above (CF=0 and ZF=0).
test al, al
je .L2
; Do something 1
.L2:
fld QWORD PTR [esp+32]
fld QWORD PTR [esp+40]
fucomip st, st(1) ; (same thing as above)
fstp st(0)
setae al ; Set al if above or equal (CF=0).
test al, al
je .L5
; Do something 2
.L5:
leave
ret
Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.
Comparison Subtraction
---------- -----------
A < B --> A - B < 0
A = B --> A - B = 0
A > B --> A - B > 0
Now, when A < B
the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.
There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.
Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.
Comparison Subtraction Carry Bit Zero Bit
---------- ----------- --------- --------
A < B --> A - B < 0 0 0
A = B --> A - B = 0 1 1
A > B --> A - B > 0 1 0
So, implementing a branch for A < B
can be done in one instruction, because the carry bit is clear only in this case, , that is,
;; Implementation of "if (A < B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; Branch if Carry is Zero to the new address
But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.
;; Implementation of "if (A <= B) goto address;"
cmp A, B ;; compare A to B
bcz address ;; branch if A < B
bzs address ;; also, Branch if the Zero bit is Set
So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.
jge
, which test both the zero and sign/carry flags.
<=
test can be implemented in one instruction with swapping the operands and testing for not <
(equivalent to >=
) This is the desired <=
with swapped operands: cmp B,A; bcs addr
. That's the reasoning this test was omitted by Intel, they considered it redundant and you couldn't afford redundant instructions at those times :-)
Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.
If there was some platform where <
was faster than <=
for simple integer types, the compiler should always convert <=
to <
for constants. Any compiler that didn't would just be a bad compiler (for that platform).
<
nor <=
have speed until the compiler decides which speed they'll have. This is a very simple optimisation for compilers when you consider that they generally already perform dead code optimisation, tail call optimisation, loop hoisting (and unrolling, on occasions), automatic parallelisation of various loops, etc... Why waste time pondering premature optimisations? Get a prototype running, profile it to determine where the most significant optimisations lie, perform those optimisations in order of significance and profile again along the way to measure progress...
(a < C)
to (a <= C-1)
(for some constant C
) causes C
to be more difficult to encode in the instruction set. For example, an instruction set may be able to represent signed constants from -127 to 128 in a compact form in comparisons, but constants outside that range have to loaded using either a longer, slower encoding, or another instruction entirely. So a comparison like (a < -127)
may not have a straightforward transformation.
a > 127
to a > 128
because you have no choice there, you use the one you need. We're comparing a > 127
to a >= 128
, which can't require different encoding or different instructions because they have the same truth table. Any encoding of one is equally an encoding of the other.
<=
to <
for constants". As far as I know, that transformation involves changing the constant. E.g., a <= 42
is compiled as a < 43
because <
is faster. In some edge cases, such a transformation wouldn't be fruitful because the new constant may require more or slower instructions. Of course a > 127
and a >= 128
are equivalent and a compiler should encode both forms in the (same) fastest way, but that's not inconsistent with what I said.
I see that neither is faster. The compiler generates the same machine code in each condition with a different value.
if(a < 901)
cmpl $900, -4(%rbp)
jg .L2
if(a <=901)
cmpl $901, -4(%rbp)
jg .L3
My example if
is from GCC on x86_64 platform on Linux.
Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.
I noticed that if it is not a constant, then the same machine code is generated in either case.
int b;
if(a < b)
cmpl -4(%rbp), %eax
jge .L2
if(a <=b)
cmpl -4(%rbp), %eax
jg .L3
if(a <=900)
to demonstrate that it generates exactly the same asm :)
For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures. Here's the first function:
int compare_strict(double a, double b) { return a < b; }
On PowerPC, first this performs a floating point comparison (which updates cr
, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.
Now consider this function instead:
int compare_loose(double a, double b) { return a <= b; }
This requires the same work as compare_strict
above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror
- condition register bitwise OR) to combine these two bits into one. So compare_loose
requires five instructions, while compare_strict
requires four.
You might think that the compiler could optimize the second function like so:
int compare_loose(double a, double b) { return ! (a > b); }
However this will incorrectly handle NaNs. NaN1 <= NaN2
and NaN1 > NaN2
need to both evaluate to false.
fucomip
sets ZF and CF.
cr
is the equivalent to flags like ZF
and CF
on the x86. (Although the CR is more flexible.) What the poster is talking about is moving the result to a GPR: which takes two instructions on PowerPC, but x86 has a conditional move instruction.
Maybe the author of that unnamed book has read that a > 0
runs faster than a >= 1
and thinks that is true universally.
But it is because a 0
is involved (because CMP
can, depending on the architecture, replaced e.g. with OR
) and not because of the <
.
(a >= 1)
to run slower than (a > 0)
, since the former can be trivially transformed to the latter by the optimizer..
At the very least, if this were true a compiler could trivially optimise a <= b to !(a > b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.
NOT
is just made by other instruction (je
vs. jne
)
a<=b
does not mean the same thing as !(a>b)
.
TL;DR answer
For most combinations of architecture, compiler and language, <
will not be faster than <=
.
Full answer
Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.
As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.
Counterexample
There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.
Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.
Summary
Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.
They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.
This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.
That would be pretty unusual though, as the compiler could work around it, making it irrelevant.
You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1
or a - 1
to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.
When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901
vs. a <= 900
. Many compilers always shrink the magnitude of constants by converting between <
and <=
, e.g. because x86 immediate operand have a shorter 1-byte encoding for -128..127.
For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word. So cmp r0, #0x00f000
would be encodeable, while cmp r0, #0x00efff
would not be. So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM. AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp
and cmn
, unlike 32-bit ARM and Thumb modes.
< vs. <= in general, including for runtime-variable conditions
In assembly language on most machines, a comparison for <=
has the same cost as a comparison for <
. This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV). The other answers have only addressed this part of the question.
But this question is about the C++ operators, the input to the optimizer. Normally they're both equally efficient; the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm. But there is at least one exception where using <=
can accidentally create something the compiler can't optimize.
As a loop condition, there are cases where <=
is qualitatively different from <
, when it stops the compiler from proving that a loop is not infinite. This can make a big difference, disabling auto-vectorization.
Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB). Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size
will always eventually become false. (What Every C Programmer Should Know About Undefined Behavior)
void foo(unsigned size) {
unsigned upper_bound = size - 1; // or any calculation that could produce UINT_MAX
for(unsigned i=0 ; i <= upper_bound ; i++)
...
Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values, except ones that lead to undefined behaviour.
(A simple i <= size
would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.)
In this case, size=0
leads to upper_bound=UINT_MAX
, and i <= UINT_MAX
is always true. So this loop is infinite for size=0
, and the compiler has to respect that even though you as the programmer probably never intend to pass size=0. If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size
.
Asm like if(!size) skip the loop;
do{...}while(--size);
is one normally-efficient way to optimize a for( i<size )
loop, if the actual value of i
isn't needed inside the loop (Why are loops always compiled into "do...while" style (tail jump)?).
But that do{}while can't be infinite: if entered with size==0
, we get 2^n iterations. (Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.)
With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.
Example: sum of integers from 1 to n
Using unsigned i <= n
defeats clang's idiom-recognition that optimizes sum(1 .. n)
loops with a closed form based on Gauss's n * (n+1) / 2
formula.
unsigned sum_1_to_n_finite(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i < n+1 ; ++i)
total += i;
return total;
}
x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer
# clang7.0 -O3 closed-form
cmp edi, -1 # n passed in EDI: x86-64 System V calling convention
je .LBB1_1 # if (n == UINT_MAX) return 0; // C++ loop runs 0 times
# else fall through into the closed-form calc
mov ecx, edi # zero-extend n into RCX
lea eax, [rdi - 1] # n-1
imul rax, rcx # n * (n-1) # 64-bit
shr rax # n * (n-1) / 2
add eax, edi # n + (stuff / 2) = n * (n+1) / 2 # truncated to 32-bit
ret # computed without possible overflow of the product before right shifting
.LBB1_1:
xor eax, eax
ret
But for the naive version, we just get a dumb loop from clang.
unsigned sum_1_to_n_naive(unsigned n) {
unsigned total = 0;
for (unsigned i = 0 ; i<=n ; ++i)
total += i;
return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
xor ecx, ecx # i = 0
xor eax, eax # retval = 0
.LBB0_1: # do {
add eax, ecx # retval += i
add ecx, 1 # ++1
cmp ecx, edi
jbe .LBB0_1 # } while( i<n );
ret
GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it; it auto-vectorizes with SIMD integer addition, running 4 i
values in parallel in the elements of an XMM register.
# "naive" inner loop
.L3:
add eax, 1 # do {
paddd xmm0, xmm1 # vect_total_4.6, vect_vec_iv_.5
paddd xmm1, xmm2 # vect_vec_iv_.5, tmp114
cmp edx, eax # bnd.1, ivtmp.14 # bound and induction-variable tmp, I think.
ja .L3 #, # }while( n > i )
"finite" inner loop
# before the loop:
# xmm0 = 0 = totals
# xmm1 = {0,1,2,3} = i
# xmm2 = set1_epi32(4)
.L13: # do {
add eax, 1 # i++
paddd xmm0, xmm1 # total[0..3] += i[0..3]
paddd xmm1, xmm2 # i[0..3] += 4
cmp eax, edx
jne .L13 # }while( i != upper_limit );
then horizontal sum xmm0
and peeled cleanup for the last n%3 iterations, or something.
It also has a plain scalar loop which I think it uses for very small n
, and/or for the infinite loop case.
BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead. sub eax,1
/jnz
instead of add eax,1
/cmp/jcc would be more efficient. 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.
<
or <=
. But sure, test ecx,ecx
/ bt eax, 3
/ jbe
will jump if ZF is set (ecx==0), or if CF is set (bit 3 of EAX==1), causing a partial flag stall on most CPUs because the flags it reads don't all come from the last instruction to write any flags. On Sandybridge-family, it doesn't actually stall, just needs to insert a merging uop. cmp
/test
write all flags, but bt
leaves ZF unmodified. felixcloutier.com/x86/bt
cmp
on AArch64 are simpler than your answer makes it sound: it takes a 12-bit immediate optionally shifted by 12 bits, so you can have 0xYYY
or 0xYYY000
, and you can also effectively negate the immediate by using cmn
instead. This still supports your point, as cmp w0, #0xf000
is encodeable and cmp w0, #0xefff
is not. But the "rotate into any position" phrasing sounds more like a description of the "bitmask" immediates, which AFAIK are only available for bitwise logical instructions: and, or, eor
, etc.
0x1fe
isn't encodeable but 0xff0
is.) When I wrote this I hadn't understood the differences between AArch64 and ARM immediates, or that only bitwise boolean insns could use the bit-range / repeated bit-pattern encoding. (And mov
; or
with the zero reg is one way to take advantage of those encodings.)
You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing. However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.
Only if the people who created the computers are bad with boolean logic. Which they shouldn't be.
Every comparison (>=
<=
>
<
) can be done in the same speed.
What every comparison is, is just a subtraction (the difference) and seeing if it's positive/negative.
(If the msb
is set, the number is negative)
How to check a >= b
? Sub a-b >= 0
Check if a-b
is positive.
How to check a <= b
? Sub 0 <= b-a
Check if b-a
is positive.
How to check a < b
? Sub a-b < 0
Check if a-b
is negative.
How to check a > b
? Sub 0 > b-a
Check if b-a
is negative.
Simply put, the computer can just do this underneath the hood for the given op:
a >= b
== msb(a-b)==0
a <= b
== msb(b-a)==0
a > b
== msb(b-a)==1
a < b
== msb(a-b)==1
and of course the computer wouldn't actually need to do the ==0
or ==1
either.
for the ==0
it could just invert the msb
from the circuit.
Anyway, they most certainly wouldn't have made a >= b
be calculated as a>b || a==b
lol
a
is in a register and b
is a compile-time constant, then x86 can compute a-b
in one instruction (sub rax, 12345
or cmp
) but not b-a
. There is an instruction for reg - imm
but not the other way around. Many other machines have a similar situation.
Only if computation path depends on data:
a={1,1,1,1,1000,1,1,1,1}
while (i<=4)
{
for(j from 0 to a[i]){ do_work(); }
i++;
}
will compute 250 times more than while(i<4)
Real-world sample would be computing mandelbrot-set. If you include a pixel that iterates 1000000 times, it will cause a lag but the coincidence with <=
usage probability is too low.
Success story sharing
jg
andjnle
are the same instruction,7F
:-)