Keywords: Java | String Matching | Regular Expressions | Word Boundaries | Apache Commons
Abstract: This article explores efficient methods for exact whole word matching in Java strings. By leveraging regular expressions with word boundaries and the StringUtils utility from Apache Commons Lang, it enables simultaneous matching of multiple keywords with position tracking. Performance comparisons and optimization tips are provided for large-scale text processing.
Introduction
String manipulation is a common task in Java programming, with exact whole word matching being particularly essential in scenarios like text analysis, keyword extraction, or syntax highlighting. For instance, distinguishing between "woods" and "123woods" requires more than basic substring checks, as methods like String.contains() can lead to false positives.
Core Concept: Word Boundaries
The \b metacharacter in regular expressions denotes a word boundary, matching positions such as spaces, punctuation, or string edges. This ensures that only complete words are matched. For example, in the string "I will come and meet you at the 123woods", \bwoods\b does not match "123woods" because "woods" is preceded by digits, not a boundary.
Implementation Approach
Based on the best answer, we utilize regular expressions with the Pattern and Matcher classes for efficient matching. First, construct a regex pattern that includes all keywords separated by | for logical OR operations. The StringUtils.join() method from Apache Commons Lang simplifies this process.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import org.apache.commons.lang3.StringUtils;
public class WordMatcher {
public static void main(String[] args) {
String text = "I will come and meet you at the woods 123woods and all the woods";
java.util.List<String> tokens = new java.util.ArrayList<>();
tokens.add("123woods");
tokens.add("woods");
String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println("Matched keyword: " + matcher.group(1) + ", position: " + matcher.start());
}
}
}In this code, patternString is built as \b(123woods|woods)\b, ensuring only whole words are matched. The Matcher.find() method iterates through all matches, with matcher.group(1) returning the keyword and matcher.start() providing the start index.
Performance Analysis and Optimization
Regex matching performs well in most cases, but for extremely large texts, specialized libraries like StringSearch offer high-performance algorithms. Compared to alternatives like StringTokenizer or split()-based approaches, regex avoids unnecessary string splitting, reducing memory overhead.
Comparison with Alternative Methods
Other answers suggest using String.matches() for full-string checks or converting strings to lists via split() for containment tests. However, these methods lack flexibility in handling multiple keywords or precise position tracking, making them less suitable for complex matching needs.
Conclusion
Integrating regular expressions with word boundaries provides a robust solution for whole word matching in Java. It supports multiple keywords, accurate positioning, and efficient processing, ideal for diverse text analysis applications. Developers should optimize by precompiling patterns or employing high-performance libraries based on specific requirements.