Performance Difference Analysis of GROUP BY vs DISTINCT in HSQLDB: Exploring Execution Plan Optimization Strategies

Keywords: SQL performance optimization | GROUP BY vs DISTINCT difference | HSQLDB query execution plan

Abstract: This article delves into the significant performance differences observed when using GROUP BY and DISTINCT queries on the same data in HSQLDB. By analyzing execution plans, memory optimization strategies, and hash table mechanisms, it explains why GROUP BY can be 90 times faster than DISTINCT in specific scenarios. The paper combines test data, compares behaviors across different database systems, and offers practical advice for optimizing query performance.

Introduction

In database query optimization, GROUP BY and DISTINCT are two common operators for deduplication, theoretically returning identical result sets. However, in practice, performance differences can be drastic. This paper is based on a case study: querying a table with 500,000 records and no indexes in HSQLDB, containing 5,000 distinct business keys. A DISTINCT query took approximately 90 seconds, while GROUP BY required only 1 second. This performance gap prompts an in-depth technical investigation.

Execution Plan Analysis

Analyzing the execution plans of both queries using the EXPLAIN PLAN FOR command reveals similar structures, but key differences in flags such as isDistinctSelect and isGrouped. For the DISTINCT query, isAggregated=[false] suggests it may employ a sort-based deduplication method. In contrast, the GROUP BY query shows isGrouped=[true] with specified groupColumns, indicating the query optimizer might choose a different execution strategy.

Specifically, the DISTINCT execution plan likely involves: first, copying all business_key values to a temporary table; then, sorting the temporary table; finally, scanning the sorted table to return each item different from the previous one. This method optimizes memory usage, suitable for large datasets or memory-constrained environments, but sorting can incur high time costs, especially without indexes.

On the other hand, the GROUP BY execution plan may utilize a hash table mechanism: scanning the full table, storing each business_key value in a hash table, and returning the keys. This approach significantly boosts speed when memory is sufficient, as it avoids sorting complexity and enables fast deduplication via hash functions. However, if the number of distinct keys is very large, the hash table may consume substantial memory, impacting performance.

Performance Testing and Comparison

Further testing uncovers additional insights. When all 500,000 records have distinct business keys, the DISTINCT query improves to 3 seconds, while GROUP BY increases to 9 seconds. This validates that the hash table method may slow down under high uniqueness due to memory pressure. Tests in MySQL show similar performance for both queries: with 5,000 distinct keys, both take 0.5 seconds; with all distinct keys, DISTINCT takes 11 seconds and GROUP BY 13 seconds. This indicates the performance difference is specific to HSQLDB's optimization.

These results highlight the complexity of database query optimizers: they must balance memory usage against execution speed. In HSQLDB, for scenarios with few distinct keys, the hash table strategy of GROUP BY outperforms the sort-based strategy of DISTINCT, leading to a 90-fold performance gap. But as key uniqueness increases, the sort method may become more efficient, as it handles large datasets better without relying on extensive memory.

Core Knowledge Points Summary

From a technical perspective, DISTINCT and GROUP BY are semantically equivalent, but the choice of execution plan can cause performance variations. Key factors include: the number of distinct keys, available memory, and the database system's optimization strategies. HSQLDB, in this case, shows a clear optimization preference: prioritizing hash tables for fast deduplication, but potentially falling back to sorting under memory constraints.

To optimize query performance, developers should consider: first, analyzing data distribution, e.g., by counting distinct keys; second, adding indexes where possible to accelerate sorting or hashing; and third, monitoring execution plans using tools like EXPLAIN to understand query behavior. In practice, if similar performance issues arise, try switching query methods or adjusting database configurations, such as increasing memory allocation.

Conclusion

This paper, through case analysis, explores the performance differences between GROUP BY and DISTINCT in HSQLDB. The divergence in execution plans—sorting versus hash table mechanisms—is the primary reason for the 90-fold speed difference. This finding underscores the importance of database query optimization and reminds developers to consider underlying execution strategies when writing SQL. Future research could extend to other database systems to compare the universality of optimization behaviors. Overall, understanding these mechanisms helps enhance application performance and avoid potential bottlenecks.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Introduction

Execution Plan Analysis

Performance Testing and Comparison

Core Knowledge Points Summary

Conclusion

Cite this article