Optimized Implementation for Detecting and Counting Repeated Words in Java Strings

Keywords: Java | String Processing | Duplicate Detection | HashMap | Word Counting

Abstract: This article provides an in-depth exploration of effective methods for detecting repeated words in Java strings and counting their occurrences. By analyzing the structural characteristics of HashMap and LinkedHashMap, it details the complete process of word segmentation, frequency statistics, and result output. The article demonstrates how to maintain word order through code examples and compares performance in different scenarios, offering practical technical solutions for handling duplicate elements in text data.

Problem Background and Core Requirements

In practical programming scenarios, there is often a need to handle duplicate elements in text data. Taking the user-provided example string "House, House, House, Dog, Dog, Dog, Dog" as reference, the core requirements can be broken down into two key tasks: first, identifying and removing duplicate words to generate a new string list without duplicates; second, accurately counting the occurrence frequency of each word in the original string and storing the statistical results in appropriate data structures.

Technical Implementation Scheme Analysis

The Java Collections Framework provides powerful tools to address such problems. Based on the best answer from the Q&A data, we can employ implementations of the Map interface to efficiently complete the word counting task. HashMap is the preferred choice due to its O(1) time complexity for lookup operations, but when maintaining element insertion order is required, LinkedHashMap demonstrates unique advantages.

Core Algorithm Implementation Details

The input string first needs preprocessing using the split method with specified delimiters. In the example, commas are used as word separators, but real-world applications may need to consider more complex delimiter scenarios.

The word counting phase employs the following algorithm logic:

Map<String, Integer> wordCountMap = new HashMap<>();
String[] words = inputString.split(", ");
for (String word : words) {
    Integer currentCount = wordCountMap.get(word);
    if (currentCount == null) {
        wordCountMap.put(word, 1);
    } else {
        wordCountMap.put(word, currentCount + 1);
    }
}

This code implements precise counting logic by checking whether a word already exists in the map. For newly encountered words, the count is initialized to 1; for existing words, the count is incremented based on the previous value.

Order-Preserving Optimization Solution

When application scenarios require maintaining the original occurrence order of words, LinkedHashMap provides an ideal solution:

Map<String, Integer> orderedWordCount = new LinkedHashMap<>();
// Same counting logic, but insertion order is preserved

This implementation is particularly useful when results need to be output in the order of first occurrence, such as in report generation or user interface displays.

Result Extraction and Formatting

After completing the counting process, all unique words can be obtained through the keySet method:

Set<String> uniqueWords = wordCountMap.keySet();
List<Integer> countValues = new ArrayList<>(wordCountMap.values());

For the example input, this produces the expected output: unique word list ["House", "Dog"] and corresponding count array [3, 4].

Cross-Platform Technology Comparison

The Excel duplicate value counting methods discussed in the reference article, though implemented in different languages, share core algorithmic concepts. Excel uses combinations of UNIQUE and COUNTIF functions to achieve similar functionality, which from another perspective validates the universality of the methods described in this article. In the field of data processing, whether in programming languages or spreadsheet software, hash-based counting algorithms represent classic solutions for handling duplicate elements.

Performance Considerations and Best Practices

In practical applications, several optimization factors need consideration. For large-scale text processing, attention should be paid to memory usage efficiency, avoiding unnecessary object creation. Meanwhile, input data should be normalized before processing, such as by standardizing case and removing leading/trailing spaces, to ensure statistical accuracy.

Exception handling is also an important consideration in production environments, including handling empty inputs, invalid delimiters, and other scenarios to ensure program robustness.

Application Scenario Extensions

The techniques described in this article can be widely applied in numerous fields including log analysis, text mining, and data cleaning. Through appropriate modifications, support can be extended for more complex tokenization rules, multilingual text processing, and other advanced features, providing a reliable technical foundation for various text processing tasks.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.