Impact of Cache Alignment and Loop Structure on Performance: An In-depth Analysis on Intel Core 2 Architecture

Keywords: Cache Alignment | False Aliasing | Loop Optimization | Intel Core 2 | Performance Analysis

Abstract: This paper analyzes the performance differences of element-wise addition operations in separated versus combined loops on Intel Core 2 processors. The study identifies cache bank conflicts and false aliasing due to data alignment as primary causes. It details five performance regions and compares memory allocation strategies, providing theoretical and practical insights for loop optimization in high-performance computing.

Introduction

In modern computer architectures, cache performance critically influences program efficiency. This paper investigates the relationship between loop structure optimization and cache behavior through a specific performance case. Experimental data show that on an Intel Core 2 Duo processor, splitting a combined loop into two separate loops reduces execution time from 5.5 seconds to 1.9 seconds, indicating a significant performance improvement.

Experimental Setup and Initial Observations

The experiment uses the following core code snippets, where a1, b1, c1, and d1 are pointers to double-precision floating-point arrays in heap memory:

const int n = 100000;

// Combined loop version
for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
    c1[j] += d1[j];
}

// Separated loop version
for (int j = 0; j < n; j++) {
    a1[j] += b1[j];
}
for (int j = 0; j < n; j++) {
    c1[j] += d1[j];
}

Compiled with Microsoft Visual C++ 10.0 with full optimization and SSE2 enabled, the combined loop takes 5.5 seconds, while the separated loops take only 1.9 seconds. Disassembly analysis reveals that the combined loop version generates more memory access instructions, suggesting potential cache inefficiencies.

Cache Alignment and False Aliasing Mechanisms

The core reason for the performance difference lies in the memory alignment of data. When arrays are allocated separately (e.g., using new double[n]), they are likely allocated at the same offset from page boundaries. This alignment causes all array accesses to fall into the same cache way in the L1 cache, leading to cache bank conflicts.

The Intel Core 2 processor's L1 cache features an 8-way set-associative structure, but accessing 4 ways is slower than accessing 2 ways. In the combined loop, alternating accesses to a1, b1, c1, and d1 exacerbate these conflicts, whereas separated loops reduce the number of simultaneously active cache ways.

Additionally, alignment can cause false aliasing, where the load/store units mistakenly assume different memory addresses refer to the same data, resulting in unnecessary pipeline stalls. Intel processors include hardware counters for partial address aliasing stalls, confirming this phenomenon.

Impact of Memory Allocation Strategies

To validate the alignment hypothesis, we compare two memory allocation strategies:

Separate Allocation: a1 = new double[n]; b1 = new double[n]; c1 = new double[n]; d1 = new double[n];
Contiguous Allocation: a1 = new double[n*4]; b1 = a1 + n; c1 = b1 + n; d1 = c1 + n;

Experimental results show that with separate allocation, the combined loop takes 6.206 seconds and the separated loops take 2.116 seconds; with contiguous allocation, the combined loop takes 1.894 seconds and the separated loops take 1.993 seconds. Contiguous allocation breaks page alignment, reducing cache conflicts and normalizing performance.

Performance Region Analysis

Based on dataset size, performance can be divided into five regions:

Region 1: The dataset is very small, and performance is dominated by looping and branching overhead.
Region 2: As data size increases, relative overhead decreases, and performance saturates. Separated loops are slightly slower due to additional overhead.
Region 3: Data exceeds L1 cache capacity, and performance is limited by L1 to L2 cache bandwidth.
Region 4: Single-loop performance drops significantly due to false aliasing caused by alignment.
Region 5: Data exceeds all cache levels, and performance is bound by memory bandwidth.

Architectural Comparisons and Generality

Although this paper focuses on Intel Core 2, similar phenomena are observable in other processors (e.g., Intel Core i7). Variations in cache size, associativity, and prefetching strategies affect specific thresholds, but the fundamental principles of alignment and false aliasing are universal.

Optimization Recommendations

Based on the analysis, the following optimizations are recommended:

Allocate large arrays contiguously where possible to improve alignment.
For critical loops, consider separating data access patterns to reduce cache conflicts.
Use profiling tools (e.g., Intel VTune) to monitor cache misses and aliasing events.

Conclusion

This paper demonstrates through empirical analysis that cache alignment and false aliasing are key factors in performance differences between loop structures. Understanding these underlying mechanisms is essential for writing efficient numerical code. Future work could extend to more processor architectures and complex access patterns.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.