Removing Duplicates from Strings in Java: Comparative Analysis of LinkedHashSet and Stream API

Keywords: Java String Processing | LinkedHashSet | Duplicate Character Removal

Abstract: This paper provides an in-depth exploration of multiple approaches for removing duplicate characters from strings in Java. The primary focus is on the LinkedHashSet-based solution, which achieves O(n) time complexity while preserving character insertion order. Alternative methods including traditional loops and Stream API are thoroughly compared, with detailed analysis of performance characteristics, memory usage, and applicable scenarios. Complete code examples and complexity analysis offer comprehensive technical reference for developers.

Problem Context and Requirements Analysis

In Java programming practice, string manipulation represents a common development task, with duplicate character removal being a representative problem. According to user-provided examples, input string "aabbccdef" needs transformation to "abcdef", while "abcdabcd" should become "abcd". The core requirement involves eliminating duplicate characters while maintaining the original order of remaining characters.

Diagnosis of Initial Implementation Issues

The user's initial code contains logical flaws:

public class test {
    public static void main(String[] args) {
        String input = new String(&quot;abbc&quot;);
        String output = new String();
        
        for (int i = 0; i &lt; input.length(); i++) {
            for (int j = 0; j &lt; output.length(); j++) {
                if (input.charAt(i) != output.charAt(j)) {
                    output = output + input.charAt(i);
                }
            }
        }
        
        System.out.println(output);
    }
}

The main issue lies in the incomplete conditional logic within the inner loop, causing redundant character additions. Specifically, when the output string is empty, the inner loop doesn't execute, preventing character addition; while immediate addition upon character mismatch ignores potential subsequent matches.

Optimized Solution Using LinkedHashSet

LinkedHashSet, as a crucial component of Java's Collections Framework, combines HashSet's rapid lookup capabilities with LinkedList's insertion order maintenance, making it an ideal choice for this problem.

Core Implementation Principles

LinkedHashSet extends HashSet, internally maintaining element insertion order through linked lists. When adding new elements, it first checks existence in the hash table, adding records to both hash table and linked list if absent to preserve order.

Complete Code Implementation

public class StringDuplicateRemover {
    public static String removeDuplicates(String input) {
        if (input == null || input.isEmpty()) {
            return input;
        }
        
        char[] chars = input.toCharArray();
        Set&lt;Character&gt; charSet = new LinkedHashSet&lt;&gt;();
        
        for (char c : chars) {
            charSet.add(c);
        }
        
        StringBuilder sb = new StringBuilder();
        for (Character character : charSet) {
            sb.append(character);
        }
        
        return sb.toString();
    }
    
    public static void main(String[] args) {
        String test1 = &quot;aabbccdef&quot;;
        String test2 = &quot;abcdabcd&quot;;
        
        System.out.println(&quot;Original: &quot; + test1 + &quot; -&gt; Result: &quot; + removeDuplicates(test1));
        System.out.println(&quot;Original: &quot; + test2 + &quot; -&gt; Result: &quot; + removeDuplicates(test2));
    }
}

Complexity Analysis

Time Complexity: O(n), where n represents input string length. Each character addition operation averages O(1) time complexity in LinkedHashSet.

Space Complexity: O(min(n, m)), where m denotes character set size (m=256 for ASCII). Worst-case scenario requires storing all unique characters.

Comparative Analysis of Alternative Approaches

Stream API Method

Java 8's Stream API offers functional programming style solution:

public static String removeDuplicatesWithStream(String input) {
    return input.chars()
               .distinct()
               .mapToObj(c -&gt; String.valueOf((char) c))
               .collect(Collectors.joining());
}

Advantages: Concise code, aligns with functional programming paradigms.

Disadvantages: Potential additional object creation overhead for large-scale string processing.

Boolean Array Method

Optimized solution for ASCII character sets:

public static String removeDuplicatesWithArray(String input) {
    boolean[] seen = new boolean[256];
    StringBuilder result = new StringBuilder();
    
    for (int i = 0; i &lt; input.length(); i++) {
        char c = input.charAt(i);
        if (!seen[c]) {
            result.append(c);
            seen[c] = true;
        }
    }
    
    return result.toString();
}

Advantages: High space efficiency, constant-level space complexity.

Disadvantages: Limited to specific character sets, doesn't support full Unicode character set.

Performance Testing and Benchmark Comparison

JMH benchmark testing evaluates different methods:

LinkedHashSet Method: Balanced performance in order preservation and generality
Stream API Method: Code conciseness for small-scale data, slight performance degradation at scale
Boolean Array Method: Optimal performance choice within known character set boundaries

Practical Application Scenario Recommendations

Recommended LinkedHashSet usage scenarios:

Processing strings with arbitrary character sets
High requirements for code readability and maintainability
Medium-scale data processing (length between 10^4-10^6)

Consider alternative approaches when:

Known ASCII character set with pursuit of ultimate performance: Boolean array method
Priority on code conciseness with small data scale: Stream API method

Extended Considerations and Optimization Directions

For ultra-large string processing (length exceeding 10^6), consider these optimization strategies:

Character encoding-specific compression techniques
Chunk processing strategies to reduce memory pressure
Parallel stream processing for multi-core environment performance enhancement

The LinkedHashSet solution presented in this paper achieves excellent balance between generality, performance, and code readability, representing the recommended choice for most practical application scenarios.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.