What Do Your Documents Say About You? Automated Testing of Microsoft Word Documents
Posted by Dan Zambonini on 8th Apr 2010
Why Test Documents?
Details matter.
People evaluate the overall quality of items by their minutiae. Detail-obsessed companies such as Apple spend a disproportionate amount of time on the sides, back, and even inside of their physical products. Why? Because the discernible quality of something affects our trust in it and ultimately, as Apple know all too well, its perceived value and our willingness to pay for it.
Sometimes we might not notice it, but even subtle inconsistencies and inaccuracies can subconsciously influence our opinion.
To ensure quality in our output, we write unit tests for our code and load test our web servers. We test our branding via focus groups and track the analytics of our transactional process. We spell check our blog posts and test the usability of our search results pages.
Documents, however, rarely get tested with anything more than the built-in grammar and spelling tools.
But documents are vitally important. The quality of your resume can affect your entire career. The quality of your proposal can affect the success of your business. Documents communicate our message, and explicitly represent our personal quality standards.
How Can We Test Documents?
Say what you want about Microsoft, but:
- Microsoft Word, clunky though it may be at times, is arguably still the best word processor available.
- Microsoft’s Open XML format, as used by Word 2003 onwards (.docx files), is a well-considered, practical, open standard for storing documents.
What this means is that the common .docx file – which is essentially a zipped XML file – is easy to programmatically read, run a series of tests against, and fix.
At Box UK, we create a lot of documentation – for our products, client projects, and proposals. These are often created by multiple authors in different offices, on different platforms, and on tight deadlines. From my experience of these documentation situations, there are a number of time-consuming tests that I’d like to automate:
- House Style. Not related to formatting, but a consistent spelling, usage and capitalisation of terms and phrases. For example, website (one word) rather than web site (two words), or SCRUM (capitalised) rather than Scrum (sentence case).
- Track Changes. A number of embarrassing government documents have been released over the years that included editing changes; documents released externally should rarely include this potentially damaging history of edits.
- Properties. It’s all too easy to forget about the properties of a document: the metadata – such as Author and Subject – that often gives-away the origin of the file or how something that appears personalised has clearly been re-purposed from something more generic.
- Typography Consistency. In theory, the Styles feature should enable you to apply consistent text styling throughout a document. However, embedded paragraph-level bespoke styles and other complexities often result in a document style being inconsistent throughout, especially if it has been amalgamated from various sources.
Being a Mac user, which has PHP integrated into the operating system, I decided to quickly write a prototype PHP script that I could run against docx files to test for these rules.
Here’s the test document I used. At first glance, everything looks fine:

Look closer, as someone assessing your quality standards might do, and you’ll start to notice problems. Let’s use PHP to identify them.
You can open a docx file easily using the PHP ZipArchive feature, and read the individual files that exist inside the docx zip container:
$za = new ZipArchive();
$za->open($test_file);
$doc_txt = $za->getFromName('word/document.xml');
$style_txt = $za->getFromName('word/styles.xml');
$core_txt = $za->getFromName('docProps/core.xml');
$za->close();
The document.xml file contains the main text of the document, including any bespoke styles that are embedded. The styles.xml defines the global styles for the document, and the core.xml file the properties of the document, i.e. the fields you see when you view File > Properties in Word.
Checking for, and fixing house style (word preferences) is as easy as running a regular expression against the document.xml contents.
Checking for deleted content that could still be found via Track Changes is just as easy to find, using another regular expression or xpath expression, e.g.
$doc_xml = simplexml_load_string($doc_txt);
$del_nodes = $doc_xml->xpath('.//w:delText');
foreach ($del_nodes as $node)
{
// Do something with the node here, e.g. delete it, wipe it or fix it.
}
Similarly, displaying / fixing / removing the Properties of the document is simple:
$core_xml = simplexml_load_string($core_txt);
$properties = $core_xml->xpath('/cp:coreProperties/*');
foreach ($properties as $node)
{
$prop_value = (string) $node;
$prop_name = $node->getName();
// Do something with node here, e.g. delete it, wipe it, or fix it
}
Finally, checking the consistency of typography styles is a bit more tricky, but not too difficult if you follow a logical process. Start by parsing the global styles of the document, using something like:
$style_xml = simplexml_load_string($style_txt);
$ns = $style_xml->getNamespaces(true);
$aStyle = array();
foreach ($style_xml->children($ns['w'])->style as $style_node)
{
$id = (string) $style_node->attributes($ns['w'])->styleId;
$default = (string) $style_node->attributes($ns['w'])->default;
$type = (string) $style_node->attributes($ns['w'])->type;
$name = (string) $style_node->name->attributes($ns['w'])->val;
$size = null;
$basis = null;
$font = null;
if ($size_node = $style_node->rPr->sz) $size = (string) $size_node->attributes($ns['w'])->val;
if ($basis_node = $style_node->basedOn) $basis = (string) $basis_node->attributes($ns['w'])->val;
if ($font_node = $style_node->rPr->rFonts) $font = (string) $font_node->attributes($ns['w'])->ascii;
// Record into an array
$aStyle["$id"] = array('default' => $default,
'name' => $name,
'size' => $size,
'basis' => $basis,
'type' => $type,
'font' => $font);
}
Then you can run through the actual styles embedded in the main document file, which exist at two levels: at paragraph level, and at ‘run’ level. A ‘run’ in Word XML can be any group of characters. Think of a ‘run’ like a <span> in HTML.
// Get all paragraph blocks - we then need to get the style at block level, and
// then any that are overridden in 'runs' within each block.
$run_counter = 0;
$aRunStyle = array();
foreach ($doc_xml->children($ns['w'])->body->p as $para_node)
{
// We only care if this paragraph has got any 'text' in it, so check for that first
$text_nodes = $para_node->xpath('.//w:t');
if (count($text_nodes) > 0)
{
// This paragraph node has visible text in it, so we should analyze it
// These attributes will be overwritten, in turn, by 1) Style, 2) Paragraph, 3) Run
// 1. Start off with the default paragraph style
$aParaStyle = getDefaultParagraphStyle($aStyle);
// 2. Next, get the global style definition for this paragraph, if one has been set
if ($para_node->pPr->pStyle)
{
// Start off with default font settings from the global style
$style_id = $para_node->pPr->pStyle->attributes($ns['w'])->val;
// Now, any settings specifically for the global style
if (isset($aStyle["$style_id"]['size'])) $aParaStyle['size'] = $aStyle["$style_id"]['size'];
if (isset($aStyle["$style_id"]['font'])) $aParaStyle['font'] = $aStyle["$style_id"]['font'];
}
// 3. Next, get the 'paragraph' level style definition, if one has been set
overrideStyle($aParaStyle, $para_node->pPr->rPr, $ns);
// 4. Finally, the lower level styles of any 'runs' inside the paragraph
foreach ($para_node->r as $run_node)
{
$aRunStyle[$run_counter] = $aParaStyle;
if ($run_node->rPr)
{
overrideStyle($aRunStyle[$run_counter], $run_node->rPr, $ns);
// Add the text of the run to the style array, for easier output later
$aRunStyle[$run_counter]['text'] = (string) $run_node->t;
}
$run_counter++;
}
}
}
function getDefaultParagraphStyle($aStyle)
{
reset($aStyle);
foreach ($aStyle as $aStyleDef)
{
if (($aStyleDef['default'] == 1) && ($aStyleDef['type'] == 'paragraph'))
{
return $aStyleDef;
}
}
}
function overrideStyle(&$aStyleContainer, $rPr, $ns)
{
if ($rPr->rFonts) $aStyleContainer['font'] = (string) $rPr->rFonts->attributes($ns['w'])->ascii;
if ($rPr->sz) $aStyleContainer['size'] = (string) $rPr->sz->attributes($ns['w'])->val;
}
What this gives you is an array of different ‘runs’ (sections of text) within the document – $aRunStyle – which details the different style of each. Rather than just recording the style information, you can of course decide to remove run and paragraph-level style definition (so that all text defaults to the global styles), or update the styles to be more consistent.
Using code similar to that above, I created a simple script that I ran against my test document, to produce the following output:

As we can now see, the document wasn’t as high-quality as it first appeared. For a start, it didn’t consistently use house style terminology. It also contained some potentially dangerous Track Changes edits, which anyone could have accessed. The properties also contained information that the end user shouldn’t see, and the typography exhibited small inconsistencies.
So now I can – in seconds – automatically run all company documents through a test script to ensure that terminology is used consistently, versioned edits are removed, properties are correctly set and font styles are consistent throughout even the longest document.
Start testing your documents: there’s no excuse not to!
We're hiring
Want to work with people like us?
Check out our vacancies.

9 comments
Carl Morris said on 8th Apr 2010
Time for a new web app?
You'd have to assure people you're not keeping the sensitive stuff though!
Dan Zambonini said on 8th Apr 2010
@Carl - maybe with the free version, we keep (and laugh at) the sensitive stuff, and with the paid version we don't!
Edward Yarnold said on 12th Apr 2010
Wow, enlightening stuff. I didn't even consider 'validating' .docx documents in this manner, as I didn't realise how useful (or accessible) the new format was (having looked at .doc in the past and been mildly horrified). I presume similar validation could be done on .xlsx documents too?
Radu Prisacaru said on 7th May 2010
Keep up the good work! I invite you to see my post, I hope you will find interesting too.
mathew.llewellyn said on 15th Jun 2010
Show off ! :)
Anthony Green said on 27th Jul 2010
Dan, this is an eye-opener, thanks for sharing with us. It's tough to keep complex Word docs consistent when many people are using the templates and all the while adding their changes. I'm not sure we'll go so far as to create a web app, but we will definitely clean up the XML for the templates.
Would be interested to read more insights from you like this.
winstrol said on 15th Oct 2011
يوم جيد! لا يمكن أن تكون هذه آخر كتابة أي أفضل! قراءة هذا يذكرني آخر الحميدة لم الغرفة القديمة! احتفظ دائما يتحدث حول هذا الموضوع. وسوف ترسل هذه متابعة الكتابة إليه. معينة إلى حد ما سيكون لديه شيء جيد للقراءة. شكرا جزيلا لتقاسم!
SLT-A77 said on 17th Nov 2011
Nagyon szerettem volna küldeni egy kis szót köszönetet mondani nektek a fantasztikus pontokat írunk ezen az oldalon. Saját időigényes internetes keresést a végén beváltása rendkívül jó ötletek cseréjét az én cimborák. Én d hangot, hogy sokan a látogatók valóban rendkívül felruházva, hogy létezik egy jelentős közösség oly sok szép egyének hasznos pontokat. Úgy érzem, nagyon szerencsés, hogy használják a weboldal, és várom, hogy oly sok más szórakoztató pillanatokat az olvasást. Köszönöm ismét egy csomó dolgot.
outdoor lighting manufacturers canada said on 22nd Nov 2011
The the next time I read a weblog, Hopefully that it doesnt disappoint me around this blog. I mean, It was my replacement for read, but When i thought youd have some thing fascinating to mention. All I hear can be a couple of whining about something you could fix if you werent too busy in search of attention.
Post Comment