Posted by: Rowan | June 14, 2009

Things you don’t know you’re doing

reCaptcha in use at Last.fm

I know captchas are a pain and cause accessibility issues, but they’re one way of keeping the bots out of places meant for humans.

I came across a March 09 presentation from the guy that invented them, one Luis von Ahn from Carnegie Mellon University, which might cast them in a different light for you. The video (down below) is 12 minutes long if you want to watch it but here’s a nutshell.

Let’s start with the New York Times. Their archive stretches back to 1851 but it wasn’t until 1980 [correction: 1980-something] that they had digital records of their content. Right now they’re part way through digitising all that content and expect to be finished later this year. They make a scanned image of their paper pages and use OCR software to extract the words into a searchable electronic record, no surprise there. But they’re finding that the reliability of OCR for old material that may have yellowed or faded can be as low as 70%. Which leaves a whole lot of work to be done manually, on 129 years worth of NYT. No small task.

Enter Recaptcha. The next time you come across something like the above example from Last.fm’s sign-up screen, here’s what’s happening.

An image of the word that the NYT’s OCR software can’t decipher gets sent across to Recaptcha via a web service. Recaptcha munges up the unknown word a bit more and pairs it up with a known (munged) word that it can verify, then sends the pair out as a two-word captcha to places like Last.fm – or Twitter or Facebook or 100,000 other sites. Along comes Rowan Smith to open a Last.fm account, who enters the two words in the captcha and sends them off for verification. Recaptcha confirms I’m human by matching the known word with what I typed in, and hands me back to Last.fm to complete the account setup.

The other word I typed in was the one that NYT’s OCR software couldn’t recognise. Well, I just decoded it for them. Recaptcha thanks me immensely and sends the human-translated answer back to the NYT. Just like that. Dead simple. Brilliant, actually. And NYT is only one example – other digitisation projects like Google and The Internet Archive can and do use the same web service.

The captcha might be a pain, but I do like the warm fuzzies I get from doing my bit for humanity, in the time it takes to deal with one. I hope the 400,000,000 other people who have contributed so far, at a rate of 35,000,000 newly digitised words per day, do too.

If you can handle watching a computer science type for 11:50, here’s the man himself. Don’t be put off – it’s actually quite good.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: