Keywords: XML | Java | DOM parsing | invalid characters | Unicode
Abstract: This technical article delves into the common error of invalid XML characters during Java DOM parsing, focusing on Unicode 0xc. It explains the underlying XML character set rules, provides insights into why such errors occur, and offers practical solutions including code examples to sanitize input before parsing.
Parsing XML documents with Java's Document Object Model (DOM) parser is a routine task in many applications, but it can be interrupted by errors such as org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0xc) was found. This exception indicates the presence of characters that violate the XML specification's character set constraints.
XML Character Set Rules
The World Wide Web Consortium (W3C) defines strict rules for allowable characters in XML documents. According to the XML specification, characters with Unicode code points below 0x20 are generally disallowed, with exceptions for 0x9 (horizontal tab), 0xA (line feed), and 0xD (carriage return). Character 0xc, representing the form feed, is one such invalid character that can cause parsing failures.
Understanding the Error
When an XML parser encounters an invalid character like 0xc, it throws a SAXParseException because the document does not conform to the XML standard. This often occurs when XML data is generated from sources that include control characters or when files are corrupted.
Solutions and Best Practices
To prevent this error, developers should ensure that XML data is sanitized before parsing. If you control the XML generation, avoid including disallowed characters. For existing XML files, pre-process the data to remove or replace invalid characters.
Code Example: Filtering Invalid Characters
Based on common practices, here is a Java method to strip out invalid XML characters from a string before parsing:
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer();
char current;
if (in == null || ("".equals(in))) return "";
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i);
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
This function iterates through the input string and retains only characters that are valid according to XML standards, effectively filtering out problematic ones like 0xc.
In conclusion, handling invalid XML characters requires adherence to the XML specification and proactive sanitization of data. By understanding the rules and implementing appropriate filters, developers can avoid parsing errors and ensure robust XML processing in Java applications.