False Data Dependency of _mm_popcnt_u64 on Intel CPUs: Analyzing Performance Anomalies from 32-bit to 64-bit Loop Counters

Keywords: false data dependency | popcnt performance | Intel microarchitecture | compiler optimization | loop variable type

Abstract: This paper investigates the phenomenon where changing a loop variable from 32-bit unsigned to 64-bit uint64_t causes a 50% performance drop when using the _mm_popcnt_u64 instruction on Intel CPUs. Through assembly analysis and microarchitectural insights, it reveals a false data dependency in the popcnt instruction that propagates across loop iterations, severely limiting instruction-level parallelism. The article details the effects of compiler optimizations, constant vs. non-constant buffer sizes, and the role of the static keyword, providing solutions via inline assembly to break dependency chains. It concludes with best practices for writing high-performance hot loops, emphasizing attention to microarchitectural details and compiler behaviors to avoid such hidden performance pitfalls.

When optimizing popcount operations on large datasets, a seemingly minor change—switching the loop variable from unsigned (32-bit) to uint64_t (64-bit)—can lead to a drastic 50% performance degradation. This anomaly reproduces on Intel CPUs like Haswell and is independent of compilers (GCC, Clang), sparking technical debate. Based on in-depth analysis, this article uncovers the root cause as a false data dependency in the popcnt instruction and explores compiler optimizations, microarchitectural nuances, and performance tuning strategies.

Performance Anomaly and Benchmarking

The benchmark code uses the _mm_popcnt_u64 intrinsic to perform popcount on a random buffer, with 10,000 iterations for precise measurement. The key variable is the inner loop index: one version uses unsigned, the other uint64_t. On a Haswell Core i7-4770K CPU compiled with GCC (-O3 -march=native), results show the unsigned version achieving 26.113 GB/s, while the uint64_t version drops to 13.8003 GB/s—a 100% performance gap. Clang exhibits similar trends, ruling out compiler bugs.

Core Mechanism of False Data Dependency

The performance difference stems from a false data dependency in the popcnt instruction on Intel CPUs. Although the instruction only writes to the destination register, on microarchitectures like Sandy/Ivy Bridge and Haswell, it incorrectly depends on the readiness of the destination register, causing serialized execution. This dependency is documented in Intel errata (e.g., HSD146 for Haswell) and impairs instruction-level parallelism.

The false dependency propagates across loop iterations, forming chains that hinder CPU parallelization. For instance, slow versions may have chains like popcnt-add-popcnt-popcnt, whereas fast versions use popcnt-popcnt chains, allowing better overlap.

Impact of Compiler Optimizations and Register Allocation

The loop variable type (unsigned vs. uint64_t) does not directly cause the issue but influences compiler register allocation. Allocation decisions determine which variables map to which registers, affecting the structure of false dependency chains. For example, when multiple popcnt operations share the same destination register, chains lengthen and performance degrades; using different registers breaks dependencies.

Inline assembly experiments confirm this: with all popcnt using the same register, performance falls to 8.49 GB/s; with different registers, it rises to 18.62 GB/s; explicitly zeroing registers via xor to break chains restores performance to 17.89 GB/s. This validates the critical role of false dependencies.

Anomalous Behavior with Constant vs. Non-Constant Buffer Sizes

Changing the buffer size from a command-line argument (non-constant) to a compile-time constant (1 << 20) yields unexpected performance shifts: in GCC, the unsigned version drops from 26 GB/s to 20 GB/s, while the uint64_t version improves to 20 GB/s, equalizing performance; in Clang, both drop to 15 GB/s. This indicates constant optimizations may trigger different code generation strategies, not always beneficial.

This "deoptimization" arises as compilers apply more conservative optimizations with known constants, such as altering loop unrolling or register allocation, inadvertently exacerbating false dependencies. Assembly comparisons show constant versions use different addressing modes (e.g., cmp $0x100000,%rdx vs. cmp %rbp,%rcx), affecting chain layouts.

Role of the static Keyword

Adding the static keyword to the buffer size variable (e.g., static uint64_t size=atol(argv[1])<<20;) further alters performance: in GCC, the unsigned version maintains 26 GB/s, while uint64_t improves from 13 GB/s to 20 GB/s; on a colleague's CPU, uint64_t becomes even faster. Clang remains unaffected by static.

static changes the storage class, potentially prompting more aggressive compiler optimizations, like treating the variable as a constant or adjusting register allocation. This underscores the sensitivity of compiler optimizations to microarchitectural details.

Solutions and Performance Tuning Recommendations

To reliably achieve optimal performance, consider these strategies:

Use Inline Assembly to Break Dependency Chains: Explicitly zero destination registers (e.g., xor %rax, %rax) or use distinct registers to avoid false dependencies. Example code demonstrates improving performance from 8.49 GB/s to nearly 18 GB/s.
Compiler Awareness and Updates: GCC 4.9.2 and later recognize this false dependency and generate compensating code, but Clang, MSVC, and others do not yet support it. Use the latest compilers and test different optimization flags.
Microarchitecture-Specific Optimizations: The issue is fixed in Intel Cannon Lake and later CPUs; AMD CPUs lack this false dependency. Consider platform differences when writing portable code.
Comprehensive Performance Testing: In hot loops, test various variable types, storage classes, and compiler settings, as minor changes can cause significant performance fluctuations.

Conclusion and Insights

This case study highlights a crucial lesson in high-performance computing: microarchitectural details and compiler optimizations can impact performance far beyond expectations. Hidden issues like false data dependencies, even in seemingly unrelated code parts (e.g., loop variable types), can create bottlenecks. Developers should:

Deeply understand instruction characteristics and errata of target CPUs.
Employ inline assembly or compiler intrinsics for fine-grained control.
Conduct diverse testing on critical paths, covering different compilers, optimization levels, and code variants.
Monitor compiler updates to leverage fixes for microarchitectural issues.

By combining low-level insights with empirical approaches, such performance pitfalls can be effectively avoided, enabling stable and efficient systems.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.