xmltextnorm: A Simple XML Text Normalizer to Support Diff for Docbook, XHTML, HTML etc.

The difficulty with diffing DocBook Sources

As you may know, Nicholas Tollervey (@ntoll) and I have been working on an O’Reilly book called Getting Started with Fluidinfo, which goes for printing next week.

When you write for O’Reilly, you have a choice of using DocBook, an XML format, or AsciiDoc as your source format. We chose DocBook, which is what O’Reilly uses for production. For the purposes of this article, DocBook is like a richer, more powerful version of HTML or XHTML that can be transformed to produce output in multiple formats, including PDF and ePub.

One frustration with that process, for me, is the lack of a good way of viewing changes made between versions. The two main options O’Reilly suggests are either to use some kind of diff tool on the sources, or to use something like pdfdiff to look at differences in the formatted output.

Unfortunately, pdfdiff doesn’t work well if text moves between pages, which it tends to do with all but the most trivial editing.

Line-based tools like Unix diff or graphical equivalents (opendiff, xdiff etc.) are to some extent inherently unsuitable because they focus on lines, and line breaks have no special significance in DocBook, just as they don’t in HTML: they are just whitespace. Either, like me, you break the lines in convenient places (in my case, usually using Emacs’s fill function, M-Q), or you use long lines. Neither is very satisfactory for a line-based diff tool, because the first strategy makes small changes look large, and the second hides changes in long lines that are hard to read in the diff output.

A third option, in principle, is to use an XML diff tool. There are some, the most interesting of which looks to be diffxml, but unfortunately the output from those appeats not to be primarily targeted at humans.

A partial solution: xmltextnorm.py

Today I did what I should have done at the start of the process, and wrote a simple script to normalize the text in an XML source file in such a way that line-based diff tools will be more useful. The idea is that diffing the normalized text from two XML source files should produce a meaningful diff (of the text) which is relatively insensitive to changes to the source files that won’t affect the formatted PDF, HTML or whatever.

Of course, this is only part of the story: if you want to see changes to the XML markup, you’ll need something else entirely, but I found that using this tool on each chapter of the book, I was able to see very quickly exactly what changes our copy editor had made, something I had been completely unable to do before.

The script is a very short Python program (requiring Python 2.7, or an older Python with a modern version of ElementTree), and it is available either direct from Fluidinfo at:

or from Github.

Usage is simple. The command line has the form

python xmltextnorm.py [infile.xml [outfile.txt]]

which will cause xmltextnorm to write the normalized text from infile.xml to outfile.txt.

If you don’t specify an outfile, it will use the same path as infile, changing the extension to .txt. If you don’t specify either, it will read from stdin and write to stdout, just a like well-behaved Unix utility.

There’s actually nothing DocBook-specific about it, and I suspect it will be just as useful for looking at textual changes to HTML (as long it is well-formed XML) or similar sources.

It’s MIT licensed.

Entities

One point worth noting is that ElementTree doesn’t like non-standard XML entities (reasonably enough). So there’s a dictionary called ENTITIES near the start of the code that allows you to specify any non-standard entities used in your XML input, and something to translate them to. (It doesn’t really matter what you translate them too.) I’ve included — and … since they occur in our book, but you can add others if you need to do so.