Using our powers for good - how web security software can help to transcribe old books

Blogging on Peer-Reviewed ResearchWhat would you do if someone asked you to help transcribe an old book onto a website? Chances are, you'd say no on the basis that you have other things to do, or simply that it just doesn't sound very interesting. And yet, millions of people every day are helping with precisely this task, and most are completely unaware that they're helping out.

i-b4c254448a7330a90f1017ada5751e65-Recaptcha.jpgIt's all thanks to a computer program developing by Luis von Ahn and colleagues at Carnegie Mellon University. Their goal was to slightly alter a simple task that all web users encounter and convert it from wasted time into something productive. That task - and you will all have done this before - is to look at an image of a distorted word and type what it is in a box. It often turns up when you're trying to post on a blog or sign up for an account.

The distorted word is called a CAPTCHA and, playing fast and loose with the spirit of acronyms, it stands for "Completely Automated Public Turing test to tell Computers and Humans Apart". Their point is to make users prove that they are human, because modern computer programs cannot discern the distorted letters as well as humans can. The CAPTCHAs are visual sentinels that protect  against automated programs that would otherwise overbuy tickets for sale at inflated prices, set up millions of fake email accounts for spamming or inundate polls, forums and blogs with comments.

They have become so commonplace that von Ahn estimates that people type in over 100 million CAPTCHAs every day. And even though the goals of improving web security is a worthwhile one, these efforts add up to hundreds of thousands of hours that are effectively wasted on a daily basis. Now, von Ahn's team have found a way of tapping this effort and putting it to better use - to help decipher scanned words, and usher old printed books into the digital age.

Reverse-Turing tests

As von Ahn writes, the goal of these projects is to "preserve human knowledge and to make information more accessible to the world." Digitising books makes them simpler to search and store, but doing so is easier said than done. Books can be scanned and their words decoded by "optical recognition software" but these programmes are still far from perfect. And any weaknesses they have are exacerbated by the faded ink and yellowing paper of the very texts they are most interested in preserving.

i-a676299e556b06805a0067f74d1cbfb9-Unreadabletex.jpg

So recognition software is automated but only about 80% accurate. Humans are far more accurate; if two fleshy scribes work independently and check any discrepancies in their transcripts, they can achieve an accuracy of over 99%. We, however, are far from automated and usually quite expensive to hire.

The new system, aptly named reCAPTCHA, combines the best of both worlds by asking people to decipher words that software cannot, while solving CAPTCHAs. Instead of random words or characters, it creates CAPTCHAs using words from scanned texts than recognition software has struggled to read.

Two different recognition programmes scour the texts in question and when if their readings differ, words are classified as "suspicious". These are placed alongside a "control" word that is already known. The pair is distorted even further, and used to make a CAPTCHA. The user has to solve both words to prove their humanity - if they get the control word right, the system assumes that they are genuine and gains a bit of confidence that their guess for the suspicious word is also right.

Every suspicious word is sent to multiple users and if the first three people to see it all provide the same guess, it shunts over to the pool of control words. If the humans disagree, a voting system kicks in and the most popular answer is taken as the right one. Users have an option to discard the word if it's illegible, and if this happens six times without any guesses being made, the word is marked as "unreadable" and discarded.

At first, von Ahn's team tested the reCAPTCHA system using 50 scanned articles from the New York Times archive taken as far back as 1860 and totalling just over 24,000 words. The reCAPTCHA system achieved an excellent accuracy of 99.1%, getting only 216 words wrong and far outstripping the meagre 83.5% rate managed by standard recognition software.

Human transcription services guarantee an accuracy of 99% or better, so reCAPTCHA certainly lives up to that exacting standard. Indeed, when humans were asked to do the same task, they made 189 errors, just 27 fewer than the programme. The neck-and-neck nature of the two scores is all the more impressive because unlike a human reader, reCAPTCHA cannot make use of context to decode a word's identity.

Virtual security

That's all well and good, but are there selfish reasons for a website to use reCAPTCHA, if its goal of preserving its own security (quite understandably) outweighs any interest in text conservation? Certainly, according to the researchers. Because the new system only uses words that are unrecognisable to current optical character recognition software, it's actually more secure than current CAPTCHAs are.

Conventional CAPTCHAs use a small number of predictable rules to distort a set of characters and various groups have developed learning programmes that can them with over 90% accuracy. But the same techniques always fail to solve reCAPTCHAs because on top of the usual twists, this system has two extra levels of 'encryption' - the random fading of the underlying text and 'noisy' distortion caused by the scanning process. There's a certain irony in making something state-of-the-art out of the old and the inaccurate.

It's an interesting advance - von Ahn was in fact the person responsible for developing CAPTCHAs in their current form, so it's perhaps unsurprising that his team have developed the next escalation of this technology.

Some might suggest that CAPTCHAs are a bit annoying anyway, so having to fill two out would seem like too onerous a task for today's short attention spans. Not so - most CAPTCHAs are strings of random characters and these take just as long to solve as two actual English words.

Recycling effort

These guarantees, along with the prospect of doing something worthy, has already turned reCAPTCHA into a bit of an online hit. It's being used by over 40,000 websites and it's already making an impact. In its first year, web users solved over 1.2 billion reCAPTCHAs and deciphered over 440 million words - the equivalent of 17,600 books. At the moment, the programme is deciphering over 4 million suspicious words (about 160 books) every day. For human scribes to do the same task in the same time-frame, you'd need a workforce of over 1,500 people working 40-hour weeks.

It's a fantastic idea - turning web users into unwitting satellite processors, and making constructive use of a necessary but ultimately unproductive activity. This ethos, of treating human processing power as a resource that can be conserved as electricity or gas should be, underlies a lot of the team's other work. They have developed online games that can analyse photos and audio recordings, and their work has inspired another group to create Fold It, a game in which people compete to work out the ideal structure of a protein.

Even pictures of cats can be put to good use. A Microsoft programme called ASIRRA uses images of cats and dogs as CAPTCHAs. Users have to select all the images of one of the other, but the twist is that all the photos come from animal shelters and users who take a liking to one of the animals can adopt it.

Now if only someone could harness the countless hours of effort wasted on trolling or posting comments on YouTube, we'd all be laughing.

Reference: Science doi: 10.1126/science.1160379

Categories

More like this

Thank you for this post. I did my first two words today, I look forward to many more.

Nice post, I'd heard a few months ago that CAPTCHA had essentially been beaten by some program. I never trust OCR, we use it in my office to scan to word docs (works mostly) and excel (never gets the correct formatting, and I mean never). I always tell my employees, if they want a document digital and don't need to edit it - scan it to a PDF, no need for OCR.

So far all these uses are good, perfect examples of the power of collaboration. I thoroughly agree. If I may slightly misquote Tom Paine, it reflects all that is best about society, working together to achieve what is impossible individually. But there is a slight suspicion that sometime somewhere a powerful individual in charge of a large collaborative network will secretly use these techniques, and before we know it, we have all together helped to build a new deadly virus?

But there is a slight suspicion that sometime somewhere a powerful individual in charge of a large collaborative network will secretly use these techniques, and before we know it, we have all together helped to build a new deadly virus?

The most important weakness of deadly viruses is that they kill the host (computer). To be sure a virus in development is deadly, you have to watch it kill hosts (testers' computers). Those hosts are then presumably unavailable for future tests.

A major strength of distributed computing is that viruses will get caught early by the people most qualified to recognize them, power users. Most power users have access to more than one computer. People would compare notes and realize what was going on. This assumes, by the way, that existing anti-virus programs don't already flag the program as malicious based on its behavior.

If the virus was particularly complex, then it is tempting to think that different users in the distribution would test small parts of the program in isolation. That may be possible. But at some point you have to put the pieces together to test larger subprocesses, and any that cause harm to users' computers would undoubtedly be detected long before the full virus could be constructed.

In order to effectively fool users and bypass existing defense mechanisms, you'd need an innovative virus almost as complicated as the one you envision producing in the distributed environment.

In other words, good scary sci-fi, but not plausible.

By speedwell (not verified) on 15 Aug 2008 #permalink

It shouldn't take long until bad guys are using a variation of this system to defeat CAPTCHA.

Example: An owner of a porn site requires users to enter solve a CAPTCHA each time they download a file. The words to be solved are actually redirected in realtime from a queue being used by an automated program trying to get around a CAPTCHA protecting it from that access to another site.