Seed Media Group

Greg Laden's Blog

Evolution, Life Sciences, Science Education, Human Evolution, and Stuff

Search this blog

Profile

greg.jpg


My name is Greg Laden. You can find out about me here, contact me here, and for all the gory details, have a look at this...

Top Posts on This Site




openlab08-submit.150.png

Recent Posts

Recent Comments

Archives

Blogroll

Join the best atheist themed blogroll!

« Alert: Atheist To Appear on Christian Talk Radio Show Today | Main | How about a little rational thought for dinner? »

Optical Character Recognition in Linux

Category: Computer TricksLinux
Posted on: June 25, 2008 2:15 PM, by Greg Laden

I don't think Optical Character Recognition (OCR) works that well, frankly. But it can be done and it can be better than retyping piles of text.

It does seem to work nicely when the text is nice and clean on nice clean white paper with a good contract between ink and background and no garbage on the page. But in my experience, when I have those conditions, it is because i have an electronic version already! When I have a PDF file that consists of scans of photocopies, OCR tends to see flecks of yeck as accents (or entire letters) and things get messy.

Nonetheless, it can be a useful technology and it works well in Linux. One of the things you do in Linux that is different than, say, Windows, is to use brute force and hands on processing with OCR. This is better than most other solutions because it allows you to make more adjustments and have more control over the process. It takes more mucking around but you get better results, can define a work flow for your particular needs, and have more fun.

I mean, seriously, how much more fun can you have than running OCR from the command line???

I bring all this up because I came across a reasonable overview of how to do it and wanted to share it with you. It is here.

Comments

I work for the Postal Service (U.S.) and we make use of some pretty sophisticated OCR tech to read the addresses of the letters in the mailstream. And in only a few 100 microseconds (usually). Of course it is only trying to read three to five lines but it actually does a pretty amazing job (about 90%+ is read on the fly). There are problems, of course; certain fonts, grandmother's handwriting, the goofy way people will put their addresses on a letter, etc. Overall, though, the process simply amazes me. I've watched it evolve over twenty years (it's my job to take care of this stuff on a local basis).

The problem for the standard user is simply one of technology. To do what the Postal Service does requires some pretty sophisticated computers and programs, far beyond what the average user could deploy. Good OCR can be done, we just can't afford it.

Posted by: Rocky | June 26, 2008 10:00 AM

Rocky ... right, but .... I get 90 percent too with my 80 dollar scanner and free software. The problem is that 90 percent for the post office in this context is fantastic because ine out of ten of the things you need to do are done by the machine. That is impressive. But when you want a typed up document, 90 percent is a horrible rate of accuracy.

What I'm saying is that in one context 90 percent is "good ocr" and in another 90 percent is "not good enough ocr"

(I hasten to add that I'm sure the post office is scanning things at a very high rate of speed, which makes that even more impressive)

Posted by: Greg Laden | June 26, 2008 10:10 AM

Post a Comment

(Email is required for authentication purposes only. Comments are moderated for spam, your comment may not appear immediately. Thanks for waiting.)





Having problems commenting? (UPDATED)

Blogs in the Network

Advertisement

Top Five: Most German

Search All Blogs

Top Science Stories

powered by SEED - seedmagazine.com