Keywords: iTextSharp | HTML to PDF | C# Programming | PDF Generation | XMLWorker
Abstract: This article provides a comprehensive exploration of converting HTML content to PDF documents using the iTextSharp library. It begins by explaining the fundamental differences in rendering mechanisms between HTML and PDF, then delves into the comparative analysis of HTMLWorker and XMLWorker parsers within iTextSharp. Through complete code examples, three distinct conversion methods are demonstrated. The article also covers CSS style support, memory stream handling, and best practices for PDF output, offering developers thorough technical guidance.
Fundamental Differences Between HTML and PDF
HTML and PDF, though created around the same era, differ fundamentally in their design philosophies and rendering mechanisms. HTML, as a markup language, primarily aims to convey high-level structural information such as paragraphs and tables. Its rendering results are highly dependent on browser environments and display devices, meaning the same HTML document may present completely different layouts on various devices.
In contrast, PDF documents must ensure completely consistent appearance across any rendering device. This consistency requirement dictates that PDF does not support abstract concepts found in HTML but instead relies on precise coordinate systems to position text, lines, and images. During PDF generation, developers need to accurately calculate the position and dimensions of each element rather than relying on automatic browser layout.
iTextSharp Parser Selection
iTextSharp offers two main HTML parsers: HTMLWorker and XMLWorker. HTMLWorker is an older parser that supports only the most basic CSS properties and inline styles, with limited support for complex HTML tags and no ongoing updates. XMLWorker, as its replacement, though more complex to configure, provides better CSS support, more comprehensive tag processing, and greater extensibility.
Basic Conversion Implementation
The following code demonstrates basic HTML conversion using HTMLWorker:
using (var ms = new MemoryStream())
{
using (var doc = new Document())
{
using (var writer = PdfWriter.GetInstance(doc, ms))
{
doc.Open();
var example_html = @"<p>This <em>is </em><span class="headline" style="text-decoration: underline;">some</span> <strong>sample <em> text</em></strong><span style="color: red;">!!!</span></p>";
using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc))
{
using (var sr = new StringReader(example_html))
{
htmlWorker.Parse(sr);
}
}
doc.Close();
}
}
byte[] bytes = ms.ToArray();
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);
}
XMLWorker Advanced Features
XMLWorker provides more robust CSS support, including the ability to process external style sheets. The following example shows how to handle both HTML content and CSS styles simultaneously:
var example_html = @"<p>This <em>is </em><span class="headline" style="text-decoration: underline;">some</span> <strong>sample <em> text</em></strong><span style="color: red;">!!!</span></p>";
var example_css = @".headline{font-size:200%}";
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css)))
{
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html)))
{
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
}
}
Memory Stream Handling and Output
Using MemoryStream during PDF generation efficiently processes document data in memory, avoiding the creation of temporary files. After generation, PDFs can be output in various ways: writing to disk files, sending to clients via HTTP responses, or attaching to emails.
Practical Application Scenarios
In real-world development, HTML-to-PDF conversion is commonly used for generating reports, invoices, contracts, and other documents requiring consistent formatting. By constructing complex HTML structures with StringBuilder and leveraging iTextSharp's parsing capabilities, professional-grade PDF documents can be produced.
Technology Development Trends
As the W3C's css-break-3 standard matures, HTML-to-PDF conversion will become more standardized. Meanwhile, Chromium engine-based solutions (like IronPDF) offer more accurate rendering effects but require balancing performance and resource consumption considerations.