Keywords: Java String Processing | LinkedHashSet | Duplicate Character Removal
Abstract: This paper provides an in-depth exploration of multiple approaches for removing duplicate characters from strings in Java. The primary focus is on the LinkedHashSet-based solution, which achieves O(n) time complexity while preserving character insertion order. Alternative methods including traditional loops and Stream API are thoroughly compared, with detailed analysis of performance characteristics, memory usage, and applicable scenarios. Complete code examples and complexity analysis offer comprehensive technical reference for developers.
Problem Context and Requirements Analysis
In Java programming practice, string manipulation represents a common development task, with duplicate character removal being a representative problem. According to user-provided examples, input string "aabbccdef" needs transformation to "abcdef", while "abcdabcd" should become "abcd". The core requirement involves eliminating duplicate characters while maintaining the original order of remaining characters.
Diagnosis of Initial Implementation Issues
The user's initial code contains logical flaws:
public class test {
public static void main(String[] args) {
String input = new String("abbc");
String output = new String();
for (int i = 0; i < input.length(); i++) {
for (int j = 0; j < output.length(); j++) {
if (input.charAt(i) != output.charAt(j)) {
output = output + input.charAt(i);
}
}
}
System.out.println(output);
}
}
The main issue lies in the incomplete conditional logic within the inner loop, causing redundant character additions. Specifically, when the output string is empty, the inner loop doesn't execute, preventing character addition; while immediate addition upon character mismatch ignores potential subsequent matches.
Optimized Solution Using LinkedHashSet
LinkedHashSet, as a crucial component of Java's Collections Framework, combines HashSet's rapid lookup capabilities with LinkedList's insertion order maintenance, making it an ideal choice for this problem.
Core Implementation Principles
LinkedHashSet extends HashSet, internally maintaining element insertion order through linked lists. When adding new elements, it first checks existence in the hash table, adding records to both hash table and linked list if absent to preserve order.
Complete Code Implementation
public class StringDuplicateRemover {
public static String removeDuplicates(String input) {
if (input == null || input.isEmpty()) {
return input;
}
char[] chars = input.toCharArray();
Set<Character> charSet = new LinkedHashSet<>();
for (char c : chars) {
charSet.add(c);
}
StringBuilder sb = new StringBuilder();
for (Character character : charSet) {
sb.append(character);
}
return sb.toString();
}
public static void main(String[] args) {
String test1 = "aabbccdef";
String test2 = "abcdabcd";
System.out.println("Original: " + test1 + " -> Result: " + removeDuplicates(test1));
System.out.println("Original: " + test2 + " -> Result: " + removeDuplicates(test2));
}
}
Complexity Analysis
Time Complexity: O(n), where n represents input string length. Each character addition operation averages O(1) time complexity in LinkedHashSet.
Space Complexity: O(min(n, m)), where m denotes character set size (m=256 for ASCII). Worst-case scenario requires storing all unique characters.
Comparative Analysis of Alternative Approaches
Stream API Method
Java 8's Stream API offers functional programming style solution:
public static String removeDuplicatesWithStream(String input) {
return input.chars()
.distinct()
.mapToObj(c -> String.valueOf((char) c))
.collect(Collectors.joining());
}
Advantages: Concise code, aligns with functional programming paradigms.
Disadvantages: Potential additional object creation overhead for large-scale string processing.
Boolean Array Method
Optimized solution for ASCII character sets:
public static String removeDuplicatesWithArray(String input) {
boolean[] seen = new boolean[256];
StringBuilder result = new StringBuilder();
for (int i = 0; i < input.length(); i++) {
char c = input.charAt(i);
if (!seen[c]) {
result.append(c);
seen[c] = true;
}
}
return result.toString();
}
Advantages: High space efficiency, constant-level space complexity.
Disadvantages: Limited to specific character sets, doesn't support full Unicode character set.
Performance Testing and Benchmark Comparison
JMH benchmark testing evaluates different methods:
- LinkedHashSet Method: Balanced performance in order preservation and generality
- Stream API Method: Code conciseness for small-scale data, slight performance degradation at scale
- Boolean Array Method: Optimal performance choice within known character set boundaries
Practical Application Scenario Recommendations
Recommended LinkedHashSet usage scenarios:
- Processing strings with arbitrary character sets
- High requirements for code readability and maintainability
- Medium-scale data processing (length between 10^4-10^6)
Consider alternative approaches when:
- Known ASCII character set with pursuit of ultimate performance: Boolean array method
- Priority on code conciseness with small data scale: Stream API method
Extended Considerations and Optimization Directions
For ultra-large string processing (length exceeding 10^6), consider these optimization strategies:
- Character encoding-specific compression techniques
- Chunk processing strategies to reduce memory pressure
- Parallel stream processing for multi-core environment performance enhancement
The LinkedHashSet solution presented in this paper achieves excellent balance between generality, performance, and code readability, representing the recommended choice for most practical application scenarios.