XML and cows

Because I've seen it quoted, misquoted, and usually not attributed at all… “Converting PDF to XML is a bit like converting hamburgers into cows." That is the quote I know of. It comes from revered XML developer Michael Kay on the xml-dev mailing list in July 2006.

It's possible Kay got this from somewhere else, but I've never seen an earlier attribution. (Comments are open if I'm wrong.)

I hear all sorts of chest-beating about attribution in data circles, often for good and sufficient reason. I think we can stand to get our quotes and their authors right.

Tags

More like this

When BEST first came out I said it was boring, because it just said what everyone knew already "Summary: the global temperature record is just what we thought it was". There was some soap opera thrown in for fun, but that didn't affect the science. But now (New Global Temperature Data Reanalysis…
Last Friday, when I didn't have any time to blog, Zen Faulkes wrote an interesting wrap-up post on Science Online 2013 in which he declared he won't be back. Not because it was a bad time, but because other people would benefit from it more, and his not going frees up a spot for somebody else. I…
Yesterday, I had an interesting discussion on Twitter with @jason_pontin (and a couple of others chimed in, e.g., @TomLevenson and @scootsmoon) about the role of quotes in journalism. Specifically, about the importance of providing a brief quote from sources interviewed for a piece. The difference…
Best time to appreciate Open Access? When you're really sick and want to learn more about what you have.: * Complete OA still a long way off. One thing I re-learned during this was that it is incredibly frustrating to see how much of the biomedical literature is still not freely available…

Totally off-topic I suppose, but exactly what came to mind reading this:

I've been out of the XML conversion business for a few years, but that quote is an apt description of how difficult it is to perform that very conversion.

PDFs don't inherently contain hierarchical structures that you can hook into with automation. They generally were very labor-intensive to work with, which sucked because we were in the data conversion business and the publishing houses we worked with were often more keen on sending us print-ready PDFs rather than something else more useful, like original DTP files.

As a matter of fact, I worked for a small publishing-services bureau as a markup specialist for a couple of years. The round-trippable XML-based editing/typesetting workflow was the Holy Grail, and it often fell down because idiot publishers insisted that the PDF was the "archival version."

Renear and Salo 2003 makes a point about editable vs. non-editable versions of documents. If you have any plans for that text beyond human beings reading it onscreen, don't archive it only in PDF!

The non-tl;dr version: AMEN.