I'm writing some code in Java where, at some point, the flow of the program is determined by whether two int variables, "a" and "b", are non-zero (note: a and b are never negative, and never within integer overflow range).
I can evaluate it with
if (a != 0 && b != 0) { /* Some code */ }
Or alternatively
if (a*b != 0) { /* Some code */ }
Because I expect that piece of code to run millions of times per run, I was wondering which one would be faster. I did the experiment by comparing them on a huge randomly generated array, and I was also curious to see how the sparsity of the array (fraction of data = 0) would affect the results:
long time;
final int len = 50000000;
int arbitrary = 0;
int[][] nums = new int[2][len];
for (double fraction = 0 ; fraction <= 0.9 ; fraction += 0.0078125) {
for(int i = 0 ; i < 2 ; i++) {
for(int j = 0 ; j < len ; j++) {
double random = Math.random();
if(random < fraction) nums[i][j] = 0;
else nums[i][j] = (int) (random*15 + 1);
}
}
time = System.currentTimeMillis();
for(int i = 0 ; i < len ; i++) {
if( /*insert nums[0][i]*nums[1][i]!=0 or nums[0][i]!=0 && nums[1][i]!=0*/ ) arbitrary++;
}
System.out.println(System.currentTimeMillis() - time);
}
And the results show that if you expect "a" or "b" to be equal to 0 more than ~3% of the time, a*b != 0
is faster than a!=0 && b!=0
:
https://i.stack.imgur.com/POHYD.png
I'm curious to know why. Could anyone shed some light? Is it the compiler or is it at the hardware level?
Edit: Out of curiosity... now that I learned about branch prediction, I was wondering what the analog comparison would show for a OR b is non-zero:
https://i.stack.imgur.com/GpJoM.png
We do see the same effect of branch prediction as expected, interestingly the graph is somewhat flipped along the X-axis.
Update
1- I added !(a==0 || b==0)
to the analysis to see what happens.
2- I also included a != 0 || b != 0
, (a+b) != 0
and (a|b) != 0
out of curiosity, after learning about branch prediction. But they are not logically equivalent to the other expressions, because only a OR b needs to be non-zero to return true, so they are not meant to be compared for processing efficiency.
3- I also added the actual benchmark that I used for the analysis, which is just iterating an arbitrary int variable.
4- Some people were suggesting to include a != 0 & b != 0
as opposed to a != 0 && b != 0
, with the prediction that it would behave more closely to a*b != 0
because we would remove the branch prediction effect. I didn't know that &
could be used with boolean variables, I thought it was only used for binary operations with integers.
Note: In the context that I was considering all this, int overflow is not an issue, but that's definitely an important consideration in general contexts.
CPU: Intel Core i7-3610QM @ 2.3GHz
Java version: 1.8.0_45 Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
if (!(a == 0 || b == 0))
? Microbenchmarks are notoriously unreliable, this is unlikely to really be measurable (~3% sounds like a margin of error to me).
a != 0 & b != 0
.
a*b!=0
has one less branch
(1<<16) * (1<<16) == 0
yet both are different from zero.
a*b
is zero if one of a
and b
is zero; a|b
is zero only if both are.
I'm ignoring the issue that your benchmarking might be flawed, and taking the result at face value.
Is it the compiler or is it at the hardware level?
That latter, I think:
if (a != 0 && b != 0)
will compile to 2 memory loads and two conditional branches
if (a * b != 0)
will compile to 2 memory loads, a multiply and one conditional branch.
The multiply is likely to be faster than the second conditional branch if the hardware-level branch prediction is ineffective. As you increase the ratio ... the branch prediction is becoming less effective.
The reason that conditional branches are slower is that they cause the instruction execution pipeline to stall. Branch prediction is about avoiding the stall by predicting which way the branch is going to go and speculatively choosing the next instruction based on that. If the prediction fails, there is a delay while the instruction for the other direction is loaded.
(Note: the above explanation is oversimplified. For a more accurate explanation, you need to look at the literature provided by the CPU manufacturer for assembly language coders and compiler writers. The Wikipedia page on Branch Predictors is good background.)
However, there is one thing that you need to be careful about with this optimization. Are there any values where a * b != 0
will give the wrong answer? Consider cases where computing the product results in integer overflow.
UPDATE
Your graphs tend to confirm what I said.
There is also a "branch prediction" effect in the conditional branch a * b != 0 case, and this comes out in the graphs.
If you project the curves beyond 0.9 on the X-axis, it looks like 1) they will meet at about 1.0 and 2) the meeting point will be at roughly the same Y value as for X = 0.0.
UPDATE 2
I don't understand why the curves are different for the a + b != 0
and the a | b != 0
cases. There could be something clever in the branch predictors logic. Or it could indicate something else.
(Note that this kind of thing can be specific to a particular chip model number or even version. The results of your benchmarks could be different on other systems.)
However, they both have the advantage of working for all non-negative values of a
and b
.
I think your benchmark has some flaws and might not be useful for inferring about real programs. Here are my thoughts:
(a|b)!=0 and (a+b)!=0 test if either value is non-zero, whereas a != 0 && b != 0 and (a*b)!=0 test if both are non-zero. So you are not comparing the timing of just the arithmetic: if the condition is true more often, it causes more executions of the if body, which takes more time too.
(a+b)!=0 will do the wrong thing for positive and negative values that sum to zero, so you can't use it in the general case, even if it works here. Also for a=b=0x80000000 (MIN_VALUE), the only set bit will overflow out the top.
Similarly, (a*b)!=0 will do the wrong thing for values that overflow. Random example: 196608 * 327680 is 0 because the true result happens to be divisible by 232, so its low 32 bits are 0, and those bits are all you get if it's an int operation.
The VM will optimize the expression during the first few runs of the outer (fraction) loop, when fraction is 0, when the branches are almost never taken. The optimizer may do different things if you start fraction at 0.5.
Unless the VM is able to eliminate some of the array bounds checks here, there are four other branches in the expression just due to the bounds checks, and that's a complicating factor when trying to figure out what's happening at a low level. You might get different results if you split the two-dimensional array into two flat arrays, changing nums[0][i] and nums[1][i] to nums0[i] and nums1[i].
CPU branch predictors detect short patterns in the data, or runs of all branches being taken or not taken. Your randomly generated benchmark data is the worst-case scenario for a branch predictor. If real-world data has a predictable pattern, or it has long runs of all-zero and all-non-zero values, the branches could cost much less.
The particular code that is executed after the condition is met can affect the performance of evaluating the condition itself, because it affects things like whether or not the loop can be unrolled, which CPU registers are available, and if any of the fetched nums values need to be reused after evaluating the condition. Merely incrementing a counter in the benchmark is not a perfect placeholder for what real code would do.
System.currentTimeMillis() is on most systems not more accurate than +/- 10 ms. System.nanoTime() is usually more accurate.
There are lots of uncertainties, and it's always hard to say anything definite with these sorts of micro-optimizations because a trick that is faster on one VM or CPU can be slower on another. If running the 32-bit HotSpot JVM, rather than the 64-bit version, be aware that it comes in two flavors: with the "Client" VM having different (weaker) optimizations compared to the "Server" VM.
If you can disassemble the machine code generated by the VM, do that rather than trying to guess what it does!
a|b != 0
doesn't have any of the problems of a+b
. Regardless of the footnote about C++ compilers optimizing a||b
into a|b
when it's safe, e.g. NULL deref impossible. (godbolt.org/z/EKnYbo7En) That's something Java implementations could do; something to keep an eye out for in the future if not now.
The answers here are good, though I had an idea that might improve things.
Since the two branches and associated branch prediction are the likely culprit, we may be able to reduce the branching to a single branch without changing the logic at all.
bool aNotZero = (nums[0][i] != 0);
bool bNotZero = (nums[1][i] != 0);
if (aNotZero && bNotZero) { /* Some code */ }
It may also work to do
int a = nums[0][i];
int b = nums[1][i];
if (a != 0 && b != 0) { /* Some code */ }
The reason being, by the rules of short circuiting, if the first boolean is false, the second should not be evaluated. It has to perform an extra branch to avoid evaluating nums[1][i]
if nums[0][i]
was false. Now, you may not care that nums[1][i]
gets evaluated, but the compiler can't be certain that it won't throw an out of range or null ref when you do. By reducing the if block to simple bools, the compiler may be smart enough to realize that evaluating the second boolean unnecessarily won't have negative side effects.
a
and b
had side-effects you would have kept them). You still have &&
so you still have a branch.
aZero = (nums[0][i] == 0)
etc. and if (! (aZero | bZero) )
or something. Or I guess Java without implicit bool->int conversion would need nums[...] == 0 ? 1 : 0
. That might require several asm instructions to actually implement, so would be slower than 2 branches if the first one (or both) predicted well.
bool
type; it's called boolean
.
When we take the multiplication, even if one number is 0, then the product is 0. While writing
(a*b != 0)
It evaluates the result of the product thereby eliminating the first few occurrences of the iteration starting from 0. As a result the comparisons are less than that when the condition is
(a != 0 && b != 0)
Where every element is compared with 0 and evaluated. Hence the time required is less. But I believe that the second condition might give you more accurate solution.
a
is zero then b
needs not to be evaluated since whole expression is already false. So every element is compared is not true.
You are using randomized input data which makes the branches unpredictable. In practice branches are often (~90%) predictable so in real code the branchful code is likely to be faster.
That said. I don't see how a*b != 0
can be faster than (a|b) != 0
. Generally integer multiplication is more expensive than a bitwise OR. But things like this occasionally get weird. See for example the "Example 7: Hardware complexities" example from Gallery of Processor Cache Effects.
&
is not a "bitwise OR" but (in this case) a "logical AND" because both operands are booleans and it's not |
;-)
&
isn't a logical AND, it's a bitwise AND. Same as in C. So a & b
is not equivalent to a && b
. But a|b
is equivalent to a || b
, just because of the difference in how "any bits set in either" vs. "no bits set" vs. "no intersecting bits" works.
&
is logical AND without short-circuit, and &&
is logical AND with short-circuit behavior.
boolean
and int
, it's not just a "logical AND" in quotes, i.e. not a bitwise AND that happens to be used on 1-bit operands like in C++. In Java it's truly a logical AND when used on boolean
operands, and you can't do true & 123
.
I know the question is old, but for curiosity and training I did some tests using JMH. The results were slightly different:
bit-wise OR (a | b != 0) and multiplication (a * b != 0) were the fastest;
normal AND (a!=0 & b!=0) was almost as good as multiplication
short-circuiting AND and OR (a!=0 && b!=0, !(a!=0 || b!=0) were slowest
Disclaimer: I am not even near an expert in JMH.
Here the code, I tried to reproduce the code posted in question, added bit-wise OR:
@Warmup(iterations = 5, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 100, timeUnit = TimeUnit.MILLISECONDS)
@Fork(value = 3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MultAnd {
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(MultAnd.class.getSimpleName())
.build();
new Runner(opt).run();
}
private static final int size = 50_000_000;
@Param({"0.00", "0.10", "0.20", "0.30", "0.40", "0.45",
"0.50", "0.55", "0.60", "0.70", "0.80", "0.90",
"1.00"})
private double fraction;
private int[][] nums;
@Setup
public void setup() {
nums = new int[2][size];
for(int i = 0 ; i < 2 ; i++) {
for(int j = 0 ; j < size ; j++) {
double random = Math.random();
if (random < fraction)
nums[i][j] = 0;
else
nums[i][j] = (int) (random*15 + 1);
}
}
}
@Benchmark
public int and() {
int s = 0;
for (int i = 0; i < size; i++) {
if ((nums[0][i]!=0) & (nums[1][i]!=0))
s++;
}
return s;
}
@Benchmark
public int andAnd() {
int s = 0;
for (int i = 0; i < size; i++) {
if ((nums[0][i]!=0) && (nums[1][i]!=0))
s++;
}
return s;
}
@Benchmark
public int bitOr() {
int s = 0;
for (int i = 0; i < size; i++) {
if ((nums[0][i] | nums[1][i]) != 0)
s++;
}
return s;
}
@Benchmark
public int mult() {
int s = 0;
for (int i = 0; i < size; i++) {
if (nums[0][i]*nums[1][i] != 0)
s++;
}
return s;
}
@Benchmark
public int notOrOr() {
int s = 0;
for (int i = 0; i < size; i++) {
if (!((nums[0][i]!=0) || (nums[1][i]!=0)))
s++;
}
return s;
}
}
And the the results:
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark (fraction) Mode Cnt Score Error Units
MultAnd.and 0.00 avgt 30 33.238 ± 0.883 ms/op
MultAnd.and 0.10 avgt 30 48.011 ± 1.635 ms/op
MultAnd.and 0.20 avgt 30 48.284 ± 1.797 ms/op
MultAnd.and 0.30 avgt 30 47.969 ± 1.463 ms/op
MultAnd.and 0.40 avgt 30 48.999 ± 2.881 ms/op
MultAnd.and 0.45 avgt 30 47.804 ± 1.053 ms/op
MultAnd.and 0.50 avgt 30 48.332 ± 1.990 ms/op
MultAnd.and 0.55 avgt 30 47.457 ± 1.210 ms/op
MultAnd.and 0.60 avgt 30 127.530 ± 3.104 ms/op
MultAnd.and 0.70 avgt 30 92.630 ± 1.471 ms/op
MultAnd.and 0.80 avgt 30 69.458 ± 1.609 ms/op
MultAnd.and 0.90 avgt 30 55.421 ± 1.443 ms/op
MultAnd.and 1.00 avgt 30 30.672 ± 1.387 ms/op
MultAnd.andAnd 0.00 avgt 30 33.187 ± 0.978 ms/op
MultAnd.andAnd 0.10 avgt 30 110.943 ± 1.435 ms/op
MultAnd.andAnd 0.20 avgt 30 177.527 ± 2.353 ms/op
MultAnd.andAnd 0.30 avgt 30 226.247 ± 1.879 ms/op
MultAnd.andAnd 0.40 avgt 30 266.316 ± 18.647 ms/op
MultAnd.andAnd 0.45 avgt 30 258.120 ± 2.638 ms/op
MultAnd.andAnd 0.50 avgt 30 259.727 ± 3.532 ms/op
MultAnd.andAnd 0.55 avgt 30 248.706 ± 1.419 ms/op
MultAnd.andAnd 0.60 avgt 30 229.825 ± 1.256 ms/op
MultAnd.andAnd 0.70 avgt 30 177.911 ± 2.787 ms/op
MultAnd.andAnd 0.80 avgt 30 121.303 ± 2.724 ms/op
MultAnd.andAnd 0.90 avgt 30 67.914 ± 1.589 ms/op
MultAnd.andAnd 1.00 avgt 30 15.892 ± 0.349 ms/op
MultAnd.bitOr 0.00 avgt 30 33.271 ± 1.896 ms/op
MultAnd.bitOr 0.10 avgt 30 35.597 ± 1.536 ms/op
MultAnd.bitOr 0.20 avgt 30 42.366 ± 1.641 ms/op
MultAnd.bitOr 0.30 avgt 30 58.385 ± 0.778 ms/op
MultAnd.bitOr 0.40 avgt 30 85.567 ± 1.678 ms/op
MultAnd.bitOr 0.45 avgt 30 32.152 ± 1.345 ms/op
MultAnd.bitOr 0.50 avgt 30 32.190 ± 1.357 ms/op
MultAnd.bitOr 0.55 avgt 30 32.335 ± 1.384 ms/op
MultAnd.bitOr 0.60 avgt 30 31.910 ± 1.026 ms/op
MultAnd.bitOr 0.70 avgt 30 31.783 ± 0.807 ms/op
MultAnd.bitOr 0.80 avgt 30 31.671 ± 0.745 ms/op
MultAnd.bitOr 0.90 avgt 30 31.329 ± 0.708 ms/op
MultAnd.bitOr 1.00 avgt 30 30.530 ± 0.534 ms/op
MultAnd.mult 0.00 avgt 30 30.859 ± 0.735 ms/op
MultAnd.mult 0.10 avgt 30 33.933 ± 0.887 ms/op
MultAnd.mult 0.20 avgt 30 34.243 ± 0.917 ms/op
MultAnd.mult 0.30 avgt 30 33.825 ± 1.155 ms/op
MultAnd.mult 0.40 avgt 30 34.309 ± 1.334 ms/op
MultAnd.mult 0.45 avgt 30 33.709 ± 1.277 ms/op
MultAnd.mult 0.50 avgt 30 37.699 ± 4.029 ms/op
MultAnd.mult 0.55 avgt 30 36.374 ± 2.948 ms/op
MultAnd.mult 0.60 avgt 30 100.354 ± 1.553 ms/op
MultAnd.mult 0.70 avgt 30 69.570 ± 1.441 ms/op
MultAnd.mult 0.80 avgt 30 47.181 ± 1.567 ms/op
MultAnd.mult 0.90 avgt 30 33.552 ± 0.999 ms/op
MultAnd.mult 1.00 avgt 30 30.775 ± 0.548 ms/op
MultAnd.notOrOr 0.00 avgt 30 15.701 ± 0.254 ms/op
MultAnd.notOrOr 0.10 avgt 30 68.052 ± 1.433 ms/op
MultAnd.notOrOr 0.20 avgt 30 120.393 ± 2.299 ms/op
MultAnd.notOrOr 0.30 avgt 30 177.729 ± 2.438 ms/op
MultAnd.notOrOr 0.40 avgt 30 229.547 ± 1.859 ms/op
MultAnd.notOrOr 0.45 avgt 30 250.660 ± 4.810 ms/op
MultAnd.notOrOr 0.50 avgt 30 258.760 ± 2.190 ms/op
MultAnd.notOrOr 0.55 avgt 30 258.010 ± 1.018 ms/op
MultAnd.notOrOr 0.60 avgt 30 254.732 ± 2.076 ms/op
MultAnd.notOrOr 0.70 avgt 30 227.148 ± 2.040 ms/op
MultAnd.notOrOr 0.80 avgt 30 180.193 ± 4.659 ms/op
MultAnd.notOrOr 0.90 avgt 30 112.212 ± 3.111 ms/op
MultAnd.notOrOr 1.00 avgt 30 33.458 ± 0.952 ms/op
https://i.stack.imgur.com/WJUZK.png
Success story sharing
if
.a&b
anda|b
). They are, but not perfectly, that is the puzzle.a*b != 0
anda+b != 0
benchmark differently is becausea+b != 0
is not at all equivalent and should never have been benchmarked. For example, witha = 1, b = 0
, the first expression evaluates to false but the second evaluates to true. The multiply acts sort of like an and operator, whereas the add acts sort of like an or operator.n
zeros then the likelihood of botha
andb
being zero increases withn
. In anAND
operation, with highern
the probability of one of them being non-zero increases and the condition is met. This is opposite for anOR
operation (probability of either one of them being zero increases withn
). This is based on a mathematical perspective. I am not sure if that's how the hardware works.