Keywords: Branch Prediction | Performance Optimization | CPU Architecture
Abstract: This article explores why processing a sorted array is faster than an unsorted array, focusing on the branch prediction mechanism in modern CPUs. Through detailed code examples and performance comparisons, it explains how branch prediction works, the cost of misprediction, and variations under different compiler optimizations. It also provides optimization techniques to eliminate branches and analyzes compiler capabilities.
Introduction
In computer programming, a common performance optimization question is: why is processing a sorted array faster than an unsorted one? This seems counterintuitive, as the sorted state of an array should not affect simple element traversal and conditional checks. However, empirical tests show that sorted arrays can be processed several times faster. This article delves into the root cause—branch prediction—through code analysis and architectural principles.
Phenomenon and Code Example
Consider the following C++ code that generates an array of 32768 random integers and sums elements greater than or equal to 128. Tests compare unsorted and sorted arrays.
#include <algorithm>
#include <ctime>
#include <iostream>
int main() {
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// Comment this line for unsorted test
std::sort(data, data + arraySize);
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i) {
for (unsigned c = 0; c < arraySize; ++c) {
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}Results show unsorted array processing takes about 11.54 seconds, while sorted takes 1.93 seconds—a 6x speedup. Similar effects occur in Java and other languages, with varying degrees.
Principles of Branch Prediction
Branch prediction is a key technique in modern CPU architectures to optimize instruction pipeline execution. At conditional branches (e.g., if statements), the CPU predicts the branch direction (taken or not taken) to prefetch subsequent instructions. Correct predictions allow continuous pipeline flow; mispredictions require flushing the pipeline and reloading instructions, incurring performance penalties.
Analogous to a railroad switch operator: the operator predicts the train's direction before it arrives. Correct predictions allow uninterrupted passage; errors cause stops, reversals, and restarts. Similarly, CPU branch predictors use historical patterns for predictions, with high accuracy boosting performance.
Sorted Arrays and Branch Prediction
In sorted arrays, data ranges from 0 to 255 uniformly. The condition data[c] >= 128 is always false (branch not taken) in the first half and always true (branch taken) in the second half. This predictable pattern enables high prediction accuracy.
Data sequence: 0, 1, 2, ..., 127, 128, 129, ..., 255
Branch behavior: N, N, N, ..., N, T, T, ..., TWhere N denotes not taken and T denotes taken. Consecutive identical behaviors make patterns easy to learn.
In unsorted arrays, random data distribution leads to erratic branch behavior:
Data sequence: 226, 185, 125, 158, 198, ...
Branch behavior: T, T, N, T, T, ...Random patterns prevent effective learning, causing near-50% misprediction rates and significant performance drops.
Optimization: Eliminating Branches
If compilers cannot optimize branches automatically, programmers can manually eliminate them using bitwise operations. For example, replace:
if (data[c] >= 128)
sum += data[c];With:
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];This hack uses sign bit calculation to emulate the condition, avoiding branch instructions. Benchmarks show it improves performance in unsorted arrays and minimizes the sorted-unsorted gap.
Compiler Optimization Variations
Different compilers exhibit varying branch optimization capabilities:
- GCC 4.6.1+ with
-O3generates conditional move instructions (CMOV), eliminating branches and equalizing sorted and unsorted performance. - Visual C++ 2010 fails to generate conditional moves even with
/Ox. - Intel C++ Compiler (ICC) 11 employs loop interchange to hoist unpredictable branches to outer loops, further optimizing performance.
- Clang and modern GCC can vectorize conditional code using SIMD instructions.
These differences highlight the importance of understanding compiler-specific optimizations for performance tuning.
Performance Benchmarks
Tests on Core i7 920 @ 3.5 GHz:
<table><thead><tr><th>Scenario</th><th>C++ Time (seconds)</th><th>Java Time (seconds)</th></tr></thead><tbody><tr><td>Branch - Random Data</td><td>11.777</td><td>10.933</td></tr><tr><td>Branch - Sorted Data</td><td>2.352</td><td>5.644</td></tr><tr><td>Branchless - Random Data</td><td>2.564</td><td>3.114</td></tr><tr><td>Branchless - Sorted Data</td><td>2.587</td><td>3.186</td></tr></tbody>Data confirms branch prediction benefits in sorted data and branchless code stability in random data.
Conclusion and Best Practices
Branch prediction is crucial for CPU performance optimization. Avoid data-dependent branches in critical loops to enhance speed. For unavoidable branches, consider branchless techniques or compiler optimizations. Practical advice includes:
- Profile code to identify and optimize high-frequency branches.
- Test performance across compilers and optimization flags.
- Use data preprocessing or branchless algorithms in performance-sensitive contexts.
Understanding branch prediction enables developers to write efficient code that leverages hardware capabilities effectively.