In-depth Analysis and Solution for XML Parsing Error "White spaces are required between publicId and systemId"

Dec 07, 2025 · Programming · 14 views · 7.8

Keywords: XML parsing | DOCTYPE error | Java DOM

Abstract: This article explores the "White spaces are required between publicId and systemId" error encountered during Java DOM XML parsing. Through a case study of a cross-domain AJAX proxy implemented in JSP, it reveals that the error actually stems from a missing system identifier (systemId) in the DOCTYPE declaration, rather than a literal space issue. The paper details the structural requirements of XML document type definitions, provides specific code fixes, and discusses how to properly handle XML documents containing DOCTYPE to avoid parsing exceptions.

Problem Background and Error Phenomenon

In web development, cross-domain data access is often implemented via server-side proxies. The case discussed involves using JSP as a proxy to request remote XML data through jQuery AJAX. The developer encountered an HTTP 500 error with the specific message "White spaces are required between publicId and systemId." This error occurred when the JSP page used the DocumentBuilder.parse() method to parse XML content fetched from a remote server.

Root Cause Analysis

Superficially, the error message suggests a missing space between publicId and systemId in the DOCTYPE declaration. However, based on in-depth analysis from the best answer, the core issue is not a space omission but an incomplete DOCTYPE structure. XML specifications require that if a DOCTYPE includes a public identifier (publicId), it must also specify a system identifier (systemId), even if the latter is an empty string.

An example of an incorrect DOCTYPE declaration is:

<!DOCTYPE persistence PUBLIC "http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd">

This declaration lacks a systemId, causing the parser to throw a misleading error. The correct declaration should include a systemId, even if empty:

<!DOCTYPE persistence PUBLIC "http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd" "">

In the case, the remote XML response might contain an incomplete DOCTYPE, or the proxy code did not properly handle such document structures.

Technical Details and XML Specifications

The XML 1.0 specification clearly defines the syntax for DOCTYPE declarations:

<!DOCTYPE name [ PUBLIC publicId ] systemId? [ &#91; internal subset &#93; ]>

Here, publicId and systemId are both optional, but if publicId is specified, systemId cannot be omitted. Parsers (like Apache Xerces) generate a SAXParseException when encountering documents that violate this rule, with error messages that may vary by implementation, but the core issue is identifier mismatch.

In Java, the DocumentBuilder.parse() method relies on underlying parsers (such as Xerces or Crimson) to validate document structure. When the input stream contains an incomplete DOCTYPE, the parser fails to correctly build the document tree, throwing an exception.

Solutions and Code Implementation

For the JSP proxy code in the case, the issue may stem from the remote XML response containing an incomplete DOCTYPE. Solutions include modifying the proxy code to handle such cases or ensuring the remote server returns normalized XML.

First, inspect the parsing logic in the proxy code. The original code uses db.parse(urlToQuery) to directly parse the URL string, which may not handle DOCTYPE issues in network streams. It is recommended to use db.parse(in) instead, where in is the input stream from URLConnection, for more flexible control over parsing.

An example of modified JSP code snippet is:

URLConnection conn = url.openConnection();
InputStream in = conn.getInputStream();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false); // Disable validation to avoid DOCTYPE issues
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in); // Use input stream instead of URL string

By setting setValidating(false), DOCTYPE validation can be ignored, preventing the error. However, note that this might affect strict document parsing.

Another approach is to preprocess the XML content to ensure DOCTYPE completeness. For example, use string operations to add an empty systemId:

String xmlContent = readStreamToString(in); // Custom method to read stream
if (xmlContent.contains("PUBLIC") && !xmlContent.contains("SYSTEM")) {
    // Simple fix: add empty systemId after PUBLIC
    xmlContent = xmlContent.replaceFirst("(PUBLIC\s+&#91;^&#93;+&#93;)>", "$1 \"\">");
}
Document doc = db.parse(new InputSource(new StringReader(xmlContent)));

This method is more flexible but may not suit all XML variants.

Extended Discussion and Best Practices

Beyond direct fixes, developers should consider the following best practices to avoid similar issues:

In cross-domain proxy scenarios, security concerns such as preventing XML External Entity (XXE) attacks should also be addressed. Configure DocumentBuilderFactory to disable external entity parsing:

dbf.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
dbf.setFeature("http://xml.org/sax/features/external-general-entities", false);
dbf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);

Conclusion

The "White spaces are required between publicId and systemId" error, while seemingly simple, reveals deep specification issues in XML parsing. By understanding DOCTYPE structural requirements, developers can adopt multiple strategies to fix code, from disabling validation to preprocessing content. In complex systems, combining error handling with best practices significantly enhances the robustness of XML data processing. This case analysis provides a practical guide for handling similar parsing exceptions, emphasizing the importance of strict adherence to standards when integrating third-party data.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.