XML and cows

By dsalo on July 23, 2009.

Because I've seen it quoted, misquoted, and usually not attributed at all… “Converting PDF to XML is a bit like converting hamburgers into cows." That is the quote I know of. It comes from revered XML developer Michael Kay on the xml-dev mailing list in July 2006.

It's possible Kay got this from somewhere else, but I've never seen an earlier attribution. (Comments are open if I'm wrong.)

I hear all sorts of chest-beating about attribution in data circles, often for good and sufficient reason. I think we can stand to get our quotes and their authors right.

More like this

Totally off-topic I suppose, but exactly what came to mind reading this:

I've been out of the XML conversion business for a few years, but that quote is an apt description of how difficult it is to perform that very conversion.

PDFs don't inherently contain hierarchical structures that you can hook into with automation. They generally were very labor-intensive to work with, which sucked because we were in the data conversion business and the publishing houses we worked with were often more keen on sending us print-ready PDFs rather than something else more useful, like original DTP files.

As a matter of fact, I worked for a small publishing-services bureau as a markup specialist for a couple of years. The round-trippable XML-based editing/typesetting workflow was the Holy Grail, and it often fell down because idiot publishers insisted that the PDF was the "archival version."

Renear and Salo 2003 makes a point about editable vs. non-editable versions of documents. If you have any plans for that text beyond human beings reading it onscreen, don't archive it only in PDF!

The non-tl;dr version: AMEN.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…

More like this

We're moving!

Belated Zombie Day post

Promoting a comment: "Open and shared format"

Small fry, blogging networks, and reputation

I'd love to dance with you, but...

Monday Pets: Cold Blooded Cognition

Sneak peek

Messier Monday: The Most Concentrated Messier Globular, M75