Because I've seen it quoted, misquoted, and usually not attributed at all… “Converting PDF to XML is a bit like converting hamburgers into cows." That is the quote I know of. It comes from revered XML developer Michael Kay on the xml-dev mailing list in July 2006.
It's possible Kay got this from somewhere else, but I've never seen an earlier attribution. (Comments are open if I'm wrong.)
I hear all sorts of chest-beating about attribution in data circles, often for good and sufficient reason. I think we can stand to get our quotes and their authors right.
- Log in to post comments
More like this
One of the great frustrations of my intellectual life, such as it is, is the problem of the disappearing quote. This is a function of having acquired a broad liberal education (in the sense of "liberal arts college" not the sense of "person to the left of Rush Limbaugh") in a somewhat haphazard…
FriendFeed, now due to be absorbed into the Borg the Facebook empire, allowed me to lurk on the fringes of the scientific community Cameron Neylon mentions in his post on the takeover.
Insert all the usual clichés here: it was enormously valuable, I learned a lot, and I wouldn't have missed it for…
Someone once pointed out that when a dog pisses on a fire hydrant, it's not committing an act of vandalism. It's just being a dog. It's possible to use that analogy to excuse a creationist who takes a quote wildly out of context, I suppose, but I don't think it's really appropriate. Creationists…
When BEST first came out I said it was boring, because it just said what everyone knew already "Summary: the global temperature record is just what we thought it was". There was some soap opera thrown in for fun, but that didn't affect the science.
But now (New Global Temperature Data Reanalysis…
Totally off-topic I suppose, but exactly what came to mind reading this:
I've been out of the XML conversion business for a few years, but that quote is an apt description of how difficult it is to perform that very conversion.
PDFs don't inherently contain hierarchical structures that you can hook into with automation. They generally were very labor-intensive to work with, which sucked because we were in the data conversion business and the publishing houses we worked with were often more keen on sending us print-ready PDFs rather than something else more useful, like original DTP files.
As a matter of fact, I worked for a small publishing-services bureau as a markup specialist for a couple of years. The round-trippable XML-based editing/typesetting workflow was the Holy Grail, and it often fell down because idiot publishers insisted that the PDF was the "archival version."
Renear and Salo 2003 makes a point about editable vs. non-editable versions of documents. If you have any plans for that text beyond human beings reading it onscreen, don't archive it only in PDF!
The non-tl;dr version: AMEN.