Keywords: data stream | median computation | heap data structure
Abstract: This paper thoroughly examines the problem of computing the running median from a stream of integers, with a focus on the two-heap algorithm based on max-heap and min-heap structures. It explains the core principles, implementation steps, and time complexity analysis, demonstrating through code examples how to maintain two heaps for efficient median tracking. Additionally, the paper discusses the algorithm's applicability, challenges under memory constraints, and potential extensions, providing comprehensive technical guidance for median computation in streaming data scenarios.
Introduction
In real-time data processing and stream computing, dynamically computing the running median is a common yet challenging problem. Traditional methods like sorting or simple statistics are inefficient in streaming scenarios, as they require recomputation with each new data arrival. This paper presents an efficient two-heap algorithm based on heap data structures, capable of processing each new element in O(log n) time while maintaining O(n) space complexity.
Algorithm Principle
The core idea of the two-heap algorithm is to use two heaps to maintain a partially ordered state of the data stream: a max-heap for the smaller half of the data and a min-heap for the larger half. Through carefully designed insertion and balancing strategies, the median can be quickly retrieved from the heap roots at any moment.
Implementation Steps
The specific implementation of the algorithm consists of two main steps: inserting new elements and balancing the heap structures.
Insertion Strategy
For each newly arrived integer, the insertion location is determined based on its comparison with the current max-heap root:
if (newElement <= maxHeap.peek()) {
maxHeap.add(newElement);
} else {
minHeap.add(newElement);
}
This strategy ensures that all elements in the max-heap are no greater than any element in the min-heap, maintaining the semi-ordered property of the data.
Balancing Mechanism
After insertion, the sizes of the two heaps may become unbalanced. To ensure correct median computation, a balancing operation is performed:
if (maxHeap.size() - minHeap.size() > 1) {
minHeap.add(maxHeap.poll());
} else if (minHeap.size() - maxHeap.size() > 1) {
maxHeap.add(minHeap.poll());
}
By moving root elements, the size difference between the two heaps is guaranteed not to exceed 1, which is crucial for median calculation.
Median Calculation
Based on the balanced heap structure, the median can be efficiently obtained using the following logic:
if (maxHeap.size() == minHeap.size()) {
median = (maxHeap.peek() + minHeap.peek()) / 2.0;
} else if (maxHeap.size() > minHeap.size()) {
median = maxHeap.peek();
} else {
median = minHeap.peek();
}
This calculation method has a time complexity of O(1), making it highly suitable for real-time applications.
Algorithm Analysis
The two-heap algorithm performs at most two heap insertions and two heap deletions per new element, each with O(log n) time complexity, resulting in an overall time complexity of O(log n). The space complexity is O(n), used to store all processed elements.
Extended Discussion
While the two-heap algorithm is applicable to most data types, optimizations can be considered in specific scenarios. For example, for integer data, counting sort can achieve an O(1) time complexity approximate algorithm. If exact median is not required, probability density estimation methods can further reduce computational overhead. Additionally, for memory-constrained environments, sampling techniques or sliding window strategies can be combined to reduce storage requirements.
Conclusion
The two-heap algorithm provides an efficient and general solution for computing the running median from data streams. By cleverly leveraging the properties of heap data structures, this algorithm achieves a good balance between time and space efficiency. In practical applications, suitable variants or optimization strategies can be selected based on specific data characteristics and performance requirements.