Keywords: Java | ArrayList | Set | Deduplication | Performance Optimization
Abstract: This paper comprehensively explores various methods for handling duplicate values in Java ArrayList, with focus on high-performance deduplication using Set interfaces. Through comparative analysis of ArrayList.contains() method versus HashSet and LinkedHashSet, it elaborates on best practice selections for different scenarios. The article provides complete implementation examples demonstrating proper handling of duplicate records in time-series data, along with comprehensive solution analysis and complexity evaluation.
Problem Background and Requirements Analysis
When processing time-series data, it is often necessary to read multiple record lines from files, each containing timestamps and multiple numerical fields. Original data may contain duplicate timestamp records that need identification and removal while maintaining data integrity and processing efficiency.
Limitations of Traditional Approaches
Using the ArrayList.contains() method for duplicate checking, while straightforward, exhibits O(n) time complexity, resulting in poor performance with large-scale data. Each new element addition requires traversing the entire list, leading to overall O(n²) time complexity.
High-Efficiency Solutions Based on Set
HashSet Implementation Approach
HashSet, implemented based on hash tables, provides O(1) time complexity for contains operations, significantly improving deduplication efficiency. Below is the complete implementation code:
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Scanner;
import java.io.File;
public class DataProcessor {
public static void main(String[] args) throws Exception {
Scanner scanner = new Scanner(new File("prova.txt"));
ArrayList<String> rawData = new ArrayList<>();
while (scanner.hasNext()) {
rawData.add(scanner.next());
}
scanner.close();
HashSet<String> uniqueTimestamps = new HashSet<>();
ArrayList<String> processedData = new ArrayList<>();
for (int i = 0; i <= rawData.size() - 13; i += 14) {
String timestamp = rawData.get(i) + " " + rawData.get(i + 1);
if (!uniqueTimestamps.contains(timestamp)) {
uniqueTimestamps.add(timestamp);
processedData.add(timestamp);
for (int j = 2; j < 14; j++) {
processedData.add(rawData.get(i + j));
}
}
}
System.out.println("Processed unique records count: " + processedData.size() / 13);
}
}LinkedHashSet Order Preservation Approach
When maintaining element insertion order is required, LinkedHashSet provides a better alternative. It offers O(1) time complexity operations while preserving insertion order:
import java.util.ArrayList;
import java.util.LinkedHashSet;
public class OrderedDataProcessor {
public static void processData(ArrayList<String> rawData) {
LinkedHashSet<String> uniqueSet = new LinkedHashSet<>();
for (int i = 0; i <= rawData.size() - 13; i += 14) {
String timestamp = rawData.get(i) + " " + rawData.get(i + 1);
if (uniqueSet.add(timestamp)) {
for (int j = 2; j < 14; j++) {
uniqueSet.add(rawData.get(i + j));
}
}
}
ArrayList<String> result = new ArrayList<>(uniqueSet);
}
}Performance Comparative Analysis
Experimental comparison of three method performance characteristics:
- ArrayList.contains(): O(n²) time complexity, suitable for small-scale data
- HashSet: O(n) average time complexity, appropriate for most scenarios
- LinkedHashSet: O(n) time complexity, maintains insertion order
Implementation Details and Best Practices
In practical applications, the following aspects require attention:
- Ensure timestamp string format consistency
- Properly handle file reading exceptions
- Select appropriate data structures based on data scale
- Balance memory usage efficiency with performance requirements
Extended Application Scenarios
This technical solution finds wide application in:
- Log file deduplication processing
- Database record uniqueness verification
- Real-time data stream duplicate detection
- Data cleaning in big data ETL processes