When is text in a PDF not text?

I see this confusion so often it seems worth addressing.

If you scan a page of text, what you have is a picture. A computer sees it not as letters, numbers, and punctuation—but as pixels, bits of light and shade and color, just like the pixels in your favorite family photo on Flickr.

You can't search for, extract, highlight, or cut-and-paste such "text." It doesn't matter whether you embed the picture in a PDF; you still can't search it. Ceci n'est pas une texte!

Compare this to creating a PDF from a word-processing or page-layout document. The computer already thinks of the text in these documents as text, so it can embed the text in the PDF as text. The text is thus searchable, extractable, and all that good stuff. (Within limits. PDF is horrible for text-mining, for reasons I may decide to discuss sometime.)

To make the text in a scanned picture searchable, you must use Optical Character Recognition (OCR) technology on the picture. OCR tools look at the picture and try to figure out what letters, numbers, and punctuation it contains. Once you've OCRed the picture, you may embed the text in the PDF along with the picture, whereupon you may be able to search and extract it.

But no OCR, no text, as far as computers are concerned.

Was that clear?

More like this

Ever since individual personal computers first came on-line in large numbers, they have been utilised as a huge opt-in distributed computing array by projects such as SETI at Home and Folding at Home. But there are information processing tasks that can be distributed yet are still impossible to…
Graphics software for Linux is superior to most other software for several reasons. Since the Linux system is inherently more efficient than other systems, memory-hungry graphics operations will always run faster, better, and more reliably on a Linux box than on, say, a Windows box, all else being…
I don't mean blog posts or emails. For blog posts I use souped up gedit, and for emails I use pico. (There was a time when I thought I'd be using emacs for both of those, but emacs suffers from a deep philosophical dysfunction.) I'm talking about longer documents that have sections with headings…
New Nail in Google Cloud Coffin: Here's what Google fears: If its cloud-computing system crashes, or inadvertently lets companies view their rivals' confidential documents all over the world, the entire system of cloud-based business-information processing collapses. Companies' most precious…

The only thing I think needs to be added to this explanation is the fact that OCR is astonishingly unreliable. God help you if your document contains the word "bum"; most OCR software will render that as "burn". Flecks of dirt will be interpreted as punctuation or as stray bits of letters. Poor-quality or old-fashioned type (which is more crowded than modern type, usually) will also contribute scads of "scannoes". Nobody has ever managed to write a program that can reliably recover the clean, abstract letter-sequences that underlie the blurry blobs of ink that make up real printing.

Project Gutenberg has been crowd-sourcing the proofreading of scanned texts for years (here), and it's an eye-opening exercise. (Virtuous, too, if you'll forgive the plug.) Every page has OCR errors.