The OpenXML DocX format in Microsoft Word offers a new, accessible format for manipulating documents. Whereas the previous “doc” format was proprietary, the new “docx” format directly exposes XML source files, enabling developers to generate dynamic Word documents, or directly edit their content. This brings opportunity for exciting tools such as text replacement algorithms and native mail-merge. Along with the opportunities, however, come several challenges and caveats that need to be handled in development.
Before the actual coding, it is useful to take a look at the source format of the documents. This can be done by taking any “docx” Word document and unzipping it. If your zip software does not support the DocX files directly, simply change the extension of the file to “zip” and then unzip using Windows. This will create a new folder with all the files that define the document. The actual source text is often under the “word” subfolder in the “document.xml” file.
The next step is to interface with the DocX files using C#. C#.NET provides an efficient set of libraries for manipulating OpenXML documents, under the “DocumentFormat.OpenXml” namespace. This can require installation of the Open XML SDK. With .NET 3.5 or earlier, the Open XML 2.0 SDK should be used, while newer versions of .NET can handle the Open XML 2.5 SDK.
The OpenXML SDK provides classes to parse the XML documents. The challenge comes in trying to perform text search or replacement on the XML documents. Although Word will properly format the XML documents during steady, forward typing, it will significantly mangle the documents if any changes are made afterwards. In addition, any proof-reading errors are actually encoded in the XML source, as well as predicted page-breaks, comments, bookmarks, etc. In order to programmatically manipulate the files, it is necessary to perform a document clean-up.
An excellent open source library is available to help facilitate the XML clean-up, aptly named “Open XML PowerTools.” This library has a MarkupSimplifier class that handles the heavy lifting of removing excess tags from the document. The developer simply needs to enter the tags they want to remove, and the MarkupSimplifier will concatenate neighboring XML strings and prepare the document for an efficient text replace.
One important caveat – if the text targeted for search or replace is formatted differently midway, this technique will not work. This might happen if half the word is underlined, and the other half is in italics. OpenXML will treat this as two separate strings regardless of the Markup Simplifier. Search operations are still possible by removing format tags, however replace operations are more challenging in this scenario.
One final tip – it’s prudent to use the MemoryStream class to manipulate the OpenXML documents in-memory, as opposed to writing directly to files. Especially in web applications, temp files can cause problems both in security and scalability. Unless the document is very large, it’s a good idea to keep it in memory where possible. The MemoryStream can be used throughout the ASP.NET page, and finally sent to the client directly through the Response object.
Written by Andrew Palczewski
About the Author
Andrew Palczewski is CEO of apHarmony, a Chicago software development company. He holds a Master's degree in Computer Engineering from the University of Illinois at Urbana-Champaign and has over ten years' experience in managing development of software projects.