Keywords: .NET Core | Word to PDF | Cross-Platform
Abstract: This article explores a cross-platform method for converting Word .doc and .docx files to PDF in .NET Core environments without relying on Microsoft.Office.Interop.Word. By combining Open XML SDK and DinkToPdf libraries, it implements a conversion pipeline from Word documents to HTML and then to PDF, addressing server-side document display needs in platforms like Azure or Docker containers. The article details key technical aspects, including handling images and links, with complete code examples and considerations.
Introduction
In modern web applications, displaying Word documents in browsers is a common requirement, but browsers lack native support for Word formats, and legal or privacy concerns may prevent using third-party services like Google Docs or Microsoft Office 365. Converting Word documents to PDF is a viable solution. Traditional methods rely on Microsoft.Office.Interop.Word, which is not feasible in .NET Core and cross-platform environments due to dependencies on Windows and Office installations. Based on a top-rated answer with a score of 10.0, this article discusses an Office Interop-free, cross-platform conversion approach.
Technical Background and Challenges
Word documents come in two main formats: .doc (binary) and .docx (Open XML-based). The .docx format, being an open standard, is easier to handle in non-Windows environments. The core challenge lies in extracting document content and generating PDFs while preserving formatting. In .NET Core, the lack of mature PDF generation libraries necessitates combining multiple tools.
Solution Overview
This solution employs a two-step process: first converting Word documents to HTML, then converting HTML to PDF. It leverages the Open XML SDK for .NET Standard to process .docx files and the DinkToPdf library (based on libwkhtmltox) for HTML-to-PDF conversion. For .doc files, which are proprietary binary formats, conversion to .docx is recommended first, but this article focuses primarily on .docx handling.
Detailed Implementation Steps
Step 1: Converting .docx to HTML Using Open XML SDK
The Open XML SDK supports .NET Standard and can read and manipulate .docx files. For conversion to HTML, we use a fork of the OpenXMLSDK-PowerTools library, which provides the WmlToHtmlConverter class. Below is a core code example demonstrating conversion of documents with images and links:
public static string ParseDOCX(FileInfo fileInfo)
{
try
{
byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
using (MemoryStream memoryStream = new MemoryStream())
{
memoryStream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
{
WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
{
AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
PageTitle = fileInfo.FullName,
ImageHandler = imageInfo =>
{
// Handle images by converting to base64 encoding
string extension = imageInfo.ContentType.Split('/')[1].ToLower();
ImageFormat imageFormat = GetImageFormat(extension); // Helper method to get format
if (imageFormat == null) return null;
string base64 = ConvertToBase64(imageInfo.Bitmap, imageFormat);
string mimeType = GetMimeType(imageFormat);
string imageSource = string.Format("data:{0};base64,{1}", mimeType, base64);
return new XElement(Xhtml.img,
new XAttribute(NoNamespace.src, imageSource),
imageInfo.ImgStyleAttribute,
imageInfo.AltText != null ? new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
}
};
XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
var html = new XDocument(new XDocumentType("html", null, null, null), htmlElement);
return html.ToString(SaveOptions.DisableFormatting);
}
}
}
catch (Exception ex)
{
return "Conversion failed: " + ex.Message;
}
}This code handles text, styles, and images in the document. For link issues, invalid URIs may need fixing, e.g., via the UriFixer.FixInvalidUri method. Note that the System.Drawing.Common NuGet package is required for image format handling.
Step 2: Converting HTML to PDF Using DinkToPdf
DinkToPdf is a cross-platform HTML-to-PDF conversion library that wraps libwkhtmltox. In the project, local library files (e.g., .so, .dll, or .dylib) for libwkhtmltox must be included. Below is a configuration example:
var converter = new BasicConverter(new PdfTools());
var doc = new HtmlToPdfDocument()
{
GlobalSettings = {
ColorMode = ColorMode.Color,
Orientation = Orientation.Portrait,
PaperSize = PaperKind.A4,
},
Objects = {
new ObjectSettings() {
HtmlContent = htmlString, // HTML content obtained from Step 1
WebSettings = { DefaultEncoding = "utf-8" },
HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
}
}
};
byte[] pdfBytes = converter.Convert(doc);
File.WriteAllBytes("output.pdf", pdfBytes);This generates a PDF file that preserves the layout and styles of the HTML. In Linux or Docker environments, ensure libgdiplus is installed to support image processing.
Deployment and Considerations
For cross-platform deployment, note the following: For .docx-to-HTML conversion, the OpenXMLSDK-PowerTools library may need to be built from a specific branch and dependencies managed. For HTML-to-PDF conversion, DinkToPdf requires local library files that must be deployed with the application, ensuring compatibility with the target OS (e.g., 32-bit or 64-bit). In Docker containers, dependencies can be installed via Dockerfile, such as running apt-get install libgdiplus in Debian-based images.
Performance and Optimization
The conversion process can be slow, especially for large documents or those with many images. Optimization suggestions include caching conversion results to reduce reprocessing, using asynchronous operations to avoid blocking server threads, and considering storing HTML directly in a file system or database instead of converting to PDF each time to reduce bandwidth usage. Based on tests, the generated PDFs are visually consistent with the original Word documents, suitable for most display needs.
Extensions and Alternatives
For .doc files, as binary formats, there are no direct cross-platform conversion libraries. Consider using command-line tools (e.g., libreoffice) to convert to .docx on the server, but be mindful of licensing and performance. Additionally, other PDF generation libraries exist, such as QuestPDF (free for companies with under $1M revenue), which offers a more intuitive PDF creation approach but does not directly support Word-to-PDF conversion. wkhtmltopdf.exe can also be used as a standalone tool, but integration into .NET Core applications may be more complex.
Conclusion
By combining Open XML SDK and DinkToPdf, we can achieve Word-to-PDF conversion in .NET Core without Microsoft.Office.Interop. This method supports cross-platform deployment, making it suitable for cloud environments like Azure or Docker containers. While primarily targeting .docx format, it can be extended to handle .doc files with preprocessing. Developers should adapt the code based on specific needs and address edge cases, such as corrupted documents or complex formatting.