When is text in a PDF not text?

By dsalo on September 9, 2009.

I see this confusion so often it seems worth addressing.

If you scan a page of text, what you have is a picture. A computer sees it not as letters, numbers, and punctuation—but as pixels, bits of light and shade and color, just like the pixels in your favorite family photo on Flickr.

You can't search for, extract, highlight, or cut-and-paste such "text." It doesn't matter whether you embed the picture in a PDF; you still can't search it. Ceci n'est pas une texte!

Compare this to creating a PDF from a word-processing or page-layout document. The computer already thinks of the text in these documents as text, so it can embed the text in the PDF as text. The text is thus searchable, extractable, and all that good stuff. (Within limits. PDF is horrible for text-mining, for reasons I may decide to discuss sometime.)

To make the text in a scanned picture searchable, you must use Optical Character Recognition (OCR) technology on the picture. OCR tools look at the picture and try to figure out what letters, numbers, and punctuation it contains. Once you've OCRed the picture, you may embed the text in the PDF along with the picture, whereupon you may be able to search and extract it.

But no OCR, no text, as far as computers are concerned.

Was that clear?

More like this

Researchers: Texting and driving bans reduce crash-related hospitalizations

Today, nearly every state in the country has a law that bans texting while driving. But do these laws make a difference?

What is Markdown and why use it?

The joys of markdown are many.

Quoting in Comments

The Scienceblogs techies have fixed something I asked them to fix regarding how to quote what someone else wrote when answering it in a comment. You can now use the blockquote tag in your responses and I think this is the best way to organize a comment to make clear what is being responded to.

I Would Be Interested in Kindle, If It Could Help Me Read Faster

And it could, if done right. Even those of us who read really fast max out at around 600 words per minute. This is a result of what is known as saccadic eye movement.

The only thing I think needs to be added to this explanation is the fact that OCR is astonishingly unreliable. God help you if your document contains the word "bum"; most OCR software will render that as "burn". Flecks of dirt will be interpreted as punctuation or as stray bits of letters. Poor-quality or old-fashioned type (which is more crowded than modern type, usually) will also contribute scads of "scannoes". Nobody has ever managed to write a program that can reliably recover the clean, abstract letter-sequences that underlie the blurry blobs of ink that make up real printing.

Project Gutenberg has been crowd-sourcing the proofreading of scanned texts for years (here), and it's an eye-opening exercise. (Virtuous, too, if you'll forgive the plug.) Every page has OCR errors.

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

We're moving!

August 3, 2010

Looking for us? We're happy to say that we're part of the new Scientopia blogging collective. Come see us there!

Belated Zombie Day post

July 13, 2010

Oh, if I'd only had this picture for Zombie Day... Credit for the photo to UK Serials Group. Credit for the alteration of the speech bubble (you can see the original slide here if you care to) to Steve Lawson. Incidentally, I should have a postprint of an article based on this presentation up…

Promoting a comment: "Open and shared format"

July 8, 2010

Richard Wallis has taken my ribbing in good part, which I appreciate; his response is here and will reward your perusal. He also left a comment here, part of which I will make bold to reproduce: As to RDF underpinning the Linked Data Web - it is only as necessary as HTML was to the growth of the…

Small fry, blogging networks, and reputation

July 8, 2010

So, the PepsiCo blog thing. Right. Advance disclaimer: this is me talking, not either of my illustrious co-bloggers. We have not yet made a decision about what to do; one co-blogger is across the pond at a conference and the other is vacationing, so that discussion will have to wait a bit. This is…

I'd love to dance with you, but...

July 6, 2010

Richard Wallis of Talis (a library-systems vendor) posted The Data Publishing Three-Step to the Talis blog recently. My reaction to this particular brand of reductionism is… shall we say, impolitic. I just want to pat Richard on the head and croon "Who's the clever boy, then? You are! Yes, you are…