XML and cows

Because I've seen it quoted, misquoted, and usually not attributed at all… “Converting PDF to XML is a bit like converting hamburgers into cows." That is the quote I know of. It comes from revered XML developer Michael Kay on the xml-dev mailing list in July 2006.

It's possible Kay got this from somewhere else, but I've never seen an earlier attribution. (Comments are open if I'm wrong.)

I hear all sorts of chest-beating about attribution in data circles, often for good and sufficient reason. I think we can stand to get our quotes and their authors right.


More like this

Totally off-topic I suppose, but exactly what came to mind reading this:

I've been out of the XML conversion business for a few years, but that quote is an apt description of how difficult it is to perform that very conversion.

PDFs don't inherently contain hierarchical structures that you can hook into with automation. They generally were very labor-intensive to work with, which sucked because we were in the data conversion business and the publishing houses we worked with were often more keen on sending us print-ready PDFs rather than something else more useful, like original DTP files.

As a matter of fact, I worked for a small publishing-services bureau as a markup specialist for a couple of years. The round-trippable XML-based editing/typesetting workflow was the Holy Grail, and it often fell down because idiot publishers insisted that the PDF was the "archival version."

Renear and Salo 2003 makes a point about editable vs. non-editable versions of documents. If you have any plans for that text beyond human beings reading it onscreen, don't archive it only in PDF!

The non-tl;dr version: AMEN.