Keywords: Java | HTML Decoding | Character Entities | Apache Commons | Jsoup | Performance Optimization
Abstract: This article provides an in-depth exploration of various methods for decoding HTML character entities in Java. It begins with the StringEscapeUtils.unescapeHtml4() method from Apache Commons Text, which serves as the standard solution. Alternative approaches using the Jsoup library are then examined, including the text() method for plain text extraction and unescapeEntities() for direct entity decoding. For performance-critical scenarios, a detailed analysis of a custom unescapeHtml3() implementation is presented, covering core algorithms, character mapping mechanisms, and optimization strategies. Through complete code examples and comparative analysis, developers can select the most suitable decoding approach based on specific requirements.
Overview of HTML Character Entity Decoding
In web development and data processing, HTML character entity decoding is a common requirement. HTML uses specific character entities to represent special characters, such as for non-breaking space and > for greater-than sign. These entities ensure proper character display in HTML documents, but when processing plain text content, they need to be converted back to their original characters.
Apache Commons Text Solution
The Apache Commons Text library provides the StringEscapeUtils.unescapeHtml4() method, which is the standard solution for HTML 4.0 entity decoding. This method converts strings containing entity escapes into corresponding Unicode characters.
import org.apache.commons.text.StringEscapeUtils;
public class HtmlDecoder {
public static String decodeHtml(String html) {
return StringEscapeUtils.unescapeHtml4(html);
}
public static void main(String[] args) {
String encoded = "<p>This is a sample.</p>";
String decoded = decodeHtml(encoded);
System.out.println(decoded); // Output: <p>This is a sample.</p>
}
}
This method supports the complete HTML 4.0 entity set, including named entities (e.g., ) and numeric entities (e.g., –). For most application scenarios, this is the simplest and most reliable solution.
Alternative Approaches with Jsoup
Jsoup is a Java library specifically designed for HTML processing. In addition to powerful HTML parsing capabilities, it also provides entity decoding functionality.
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
public class JsoupDecoder {
// Method 1: Extract plain text using text() method
public static String decodeWithText(String html) {
return Jsoup.parse(html).text();
}
// Method 2: Direct entity decoding using unescapeEntities
public static String decodeEntities(String html) {
return Parser.unescapeEntities(html, true);
}
public static void main(String[] args) {
String encoded = "<p>This is a sample. \"Granny\" Smith –.</p>";
String textResult = decodeWithText(encoded);
System.out.println(textResult); // Output: This is a sample. "Granny" Smith –.
String entityResult = decodeEntities(encoded);
System.out.println(entityResult); // Output: <p>This is a sample. "Granny" Smith –.</p>
}
}
The text() method removes all HTML tags and decodes entities, making it suitable for plain text extraction. The unescapeEntities() method only decodes entities while preserving HTML tag structure.
Custom High-Performance Implementation
For performance-sensitive applications, the Apache Commons implementation may be too heavyweight. The following is an optimized custom implementation specifically for HTML 3.x entities:
import java.io.StringWriter;
import java.util.HashMap;
public class CustomHtmlDecoder {
private static final int MIN_ESCAPE = 2;
private static final int MAX_ESCAPE = 6;
private static final String[][] ESCAPES = {
{"\"", "quot"}, {"&", "amp"}, {"<", "lt"}, {">", "gt"},
{"\u00A0", "nbsp"}, {"\u00A9", "copy"}, {"\u00AE", "reg"}
// Simplified entity mapping; full HTML 3.x entity set required in practice
};
private static final HashMap<String, CharSequence> lookupMap;
static {
lookupMap = new HashMap<String, CharSequence>();
for (String[] escape : ESCAPES) {
lookupMap.put(escape[1], escape[0]);
}
}
public static String unescapeHtml3(String input) {
if (input == null) return null;
StringWriter writer = null;
int len = input.length();
int i = 1;
int st = 0;
while (true) {
// Find ampersand
while (i < len && input.charAt(i-1) != '&') {
i++;
}
if (i >= len) break;
// Find semicolon
int j = i;
while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';') {
j++;
}
if (j == len || j < i + MIN_ESCAPE || j >= i + MAX_ESCAPE + 1) {
i++;
continue;
}
// Process numeric entities
if (input.charAt(i) == '#') {
int k = i + 1;
int radix = 10;
char firstChar = input.charAt(k);
if (firstChar == 'x' || firstChar == 'X') {
k++;
radix = 16;
}
try {
int entityValue = Integer.parseInt(input.substring(k, j), radix);
if (writer == null) {
writer = new StringWriter(input.length());
}
writer.append(input.substring(st, i - 1));
if (entityValue > 0xFFFF) {
char[] chrs = Character.toChars(entityValue);
writer.write(chrs[0]);
writer.write(chrs[1]);
} else {
writer.write(entityValue);
}
} catch (NumberFormatException ex) {
i++;
continue;
}
} else {
// Process named entities
CharSequence value = lookupMap.get(input.substring(i, j));
if (value == null) {
i++;
continue;
}
if (writer == null) {
writer = new StringWriter(input.length());
}
writer.append(input.substring(st, i - 1));
writer.append(value);
}
st = j + 1;
i = st;
}
if (writer != null) {
writer.append(input.substring(st, len));
return writer.toString();
}
return input;
}
}
Performance Comparison and Selection Guidelines
Different solutions offer varying advantages in performance and functionality:
- Apache Commons Text: Feature-complete, supports HTML 4.0 standard, suitable for most general scenarios
- Jsoup: Provides comprehensive HTML parsing in addition to entity decoding, ideal for complex HTML documents
- Custom Implementation: Optimal performance, minimal memory footprint, suitable for high-concurrency or resource-constrained environments
In practical projects, it is recommended to choose the appropriate solution based on specific requirements. If a project already uses Apache Commons or Jsoup, utilizing their provided decoding methods is the most convenient approach. For scenarios with extreme performance demands, custom implementations should be considered.
Practical Application Examples
HTML entity decoding is a fundamental yet crucial step in scenarios such as web resource extraction, content processing, and data analysis. Proper decoding ensures data consistency and accuracy, preventing errors caused by character encoding issues.