Keywords: Algorithm Optimization | Hash Set | Time Complexity Analysis
Abstract: This paper provides an in-depth analysis of algorithms for finding the smallest missing positive integer in a given sequence. By examining performance bottlenecks in the original solution, we propose an optimized approach using hash sets that achieves O(N) time complexity and O(N) space complexity. The article compares multiple implementation strategies including sorting, marking arrays, and cycle sort, with complete Java code implementations and performance analysis.
Problem Background and Definition
Given an array A containing N integers with elements in the range [-1000000, 1000000], we need to find the first positive integer (greater than 0) that does not appear in the array. For example, for array [1, 3, 6, 4, 1, 2], the smallest missing positive integer is 5; for [1, 2, 3], the result is 4; for [-1, -3], the result is 1.
Analysis of Original Solution Issues
The original solution used TreeSet to store array elements, which caused performance issues. TreeSet is implemented based on red-black trees, with insertion operations having O(logN) time complexity, resulting in overall O(NlogN) time complexity for building the set, failing to meet the required O(N) time complexity.
Main issues in the original code include:
Set<Integer> set = new TreeSet<>();
for (int a : A) {
set.add(a); // O(logN) per insertion
}
Additionally, the original solution used multiple loops: first building the set, then converting the set to an array, followed by marking operations, and finally scanning for the result. This multi-loop structure increased code complexity and execution time.
Optimized Solution
The optimized solution based on hash sets meets the O(N) time complexity requirement. HashSet insertion and query operations have average O(1) time complexity, allowing the complete algorithm to finish in linear time.
Core algorithm steps:
- Create a hash set to store all positive integers
- Iterate through the array, adding positive integers to the set
- Check each positive integer starting from 1 to see if it exists in the set
- Return the first positive integer not found in the set
Java implementation code:
public class Solution {
public int solution(int[] A) {
int N = A.length;
Set<Integer> set = new HashSet<>();
// Step 1: Collect all positive integers
for (int a : A) {
if (a > 0) {
set.add(a);
}
}
// Step 2: Find the smallest missing positive integer
for (int i = 1; i <= N + 1; i++) {
if (!set.contains(i)) {
return i;
}
}
return N + 1;
}
}
Algorithm Correctness Proof
The correctness of this algorithm is based on the observation that the smallest missing positive integer must be in the range 1 to N+1. The reasoning is as follows:
- If the array contains all positive integers from 1 to N, then the smallest missing positive integer is N+1
- If the array is missing some positive integer in the range 1 to N, then that number is the answer
- Since the array length is N, it can contain at most N distinct positive integers
Setting the loop range from 1 to N+1 ensures coverage of all possible cases.
Complexity Analysis
Time Complexity: O(N)
- First loop iterates through the array once, with each insertion averaging O(1)
- Second loop executes at most N+1 times, with each query averaging O(1)
- Overall time complexity is O(N)
Space Complexity: O(N)
- Hash set requires storage for N elements in the worst case
- Meets the required O(N) space complexity (excluding input storage)
Comparison with Other Methods
Sorting Method
The sorting-based approach first sorts the array, then sequentially scans for the missing positive integer:
public int solutionBySorting(int[] A) {
Arrays.sort(A); // O(NlogN)
int smallest = 1;
for (int i = 0; i < A.length; i++) {
if (A[i] == smallest) {
smallest++;
}
}
return smallest;
}
This method has O(NlogN) time complexity. While simple to implement, it cannot meet the linear time complexity requirement.
Marking Array Method
Using an additional boolean array to mark number occurrences:
public int solutionByMarking(int[] A) {
int n = A.length;
boolean[] visited = new boolean[n + 1];
for (int num : A) {
if (num > 0 && num <= n) {
visited[num] = true;
}
}
for (int i = 1; i <= n; i++) {
if (!visited[i]) {
return i;
}
}
return n + 1;
}
This method also has O(N) time complexity and O(N) space complexity, but requires additional array space.
Cycle Sort Method
Placing numbers in correct positions through in-place swapping:
public int solutionByCycleSort(int[] A) {
int n = A.length;
for (int i = 0; i < n; i++) {
while (A[i] > 0 && A[i] <= n && A[i] != A[A[i] - 1]) {
int temp = A[i];
A[i] = A[temp - 1];
A[temp - 1] = temp;
}
}
for (int i = 0; i < n; i++) {
if (A[i] != i + 1) {
return i + 1;
}
}
return n + 1;
}
This method achieves O(N) time complexity and O(1) space complexity, but implementation is more complex and modifies the original array.
Performance Testing and Validation
Extensive test cases validate the correctness and performance of the optimized solution:
- Boundary cases: empty arrays, all-negative arrays, all-positive arrays
- Large-scale data: random tests with N=100,000
- Special patterns: consecutive numbers, duplicate numbers, wide-range numbers
Test results show that the hash set-based solution works correctly in all cases and meets performance requirements.
Practical Application Scenarios
This algorithm has important applications in:
- Sequence number allocation in database systems
- Resource identifier management and allocation
- Item ID management in game development
- Unique identifier generation in distributed systems
Conclusion
This paper provides detailed analysis of multiple solutions for finding the smallest missing positive integer. The optimized solution based on hash sets achieves a good balance between time complexity, space complexity, and code simplicity, making it the recommended approach for such problems. Through proper algorithm design and data structure selection, we can meet performance requirements while ensuring code readability and maintainability.