I don’t think Optical Character Recognition (OCR) works that well, frankly. But it can be done and it can be better than retyping piles of text.

It does seem to work nicely when the text is nice and clean on nice clean white paper with a good contract between ink and background and no garbage on the page. But in my experience, when I have those conditions, it is because i have an electronic version already! When I have a PDF file that consists of scans of photocopies, OCR tends to see flecks of yeck as accents (or entire letters) and things get messy.


Nonetheless, it can be a useful technology and it works well in Linux. One of the things you do in Linux that is different than, say, Windows, is to use brute force and hands on processing with OCR. This is better than most other solutions because it allows you to make more adjustments and have more control over the process. It takes more mucking around but you get better results, can define a work flow for your particular needs, and have more fun.

I mean, seriously, how much more fun can you have than running OCR from the command line???

I bring all this up because I came across a reasonable overview of how to do it and wanted to share it with you. It is here.

Comments

  1. #1 Rocky
    June 26, 2008

    I work for the Postal Service (U.S.) and we make use of some pretty sophisticated OCR tech to read the addresses of the letters in the mailstream. And in only a few 100 microseconds (usually). Of course it is only trying to read three to five lines but it actually does a pretty amazing job (about 90%+ is read on the fly). There are problems, of course; certain fonts, grandmother’s handwriting, the goofy way people will put their addresses on a letter, etc. Overall, though, the process simply amazes me. I’ve watched it evolve over twenty years (it’s my job to take care of this stuff on a local basis).

    The problem for the standard user is simply one of technology. To do what the Postal Service does requires some pretty sophisticated computers and programs, far beyond what the average user could deploy. Good OCR can be done, we just can’t afford it.

  2. #2 Greg Laden
    June 26, 2008

    Rocky … right, but …. I get 90 percent too with my 80 dollar scanner and free software. The problem is that 90 percent for the post office in this context is fantastic because ine out of ten of the things you need to do are done by the machine. That is impressive. But when you want a typed up document, 90 percent is a horrible rate of accuracy.

    What I’m saying is that in one context 90 percent is “good ocr” and in another 90 percent is “not good enough ocr”

    (I hasten to add that I’m sure the post office is scanning things at a very high rate of speed, which makes that even more impressive)

Current ye@r *