Last week we held our quarterly A2G book club discussion. The A2G book group is definitely unique in the world of business book clubs. When a coworker picks a title, the only requirement for the book chosen is that it somehow relates to marketing or our company culture, which means we rarely stick to traditional business books. Titles in the past year have included: Into Thin Air, Good to Great and Losing My Virginity.
This quarter we read The Meaning of Everything: The Story of The Oxford English Dictionary by Simon Winchester. The Meaning of Everything tells the personality-filled history of the OED and the long, strange 72 year journey it took to get it to print. On the surface it’s a book about words and the people who love them, but if you dig a little deeper it’s the story of a group of innovators who saw a great need and jumped in to fill the hole, no matter how trying or arduous the task. I think that’s something that anyone in any field can relate to.
It’s a book club tradition to open up the meeting with each members’ favorite passage. This time we changed things up a bit and instead went around the room and volunteered our favorite (apoplectic, yes, hoi polloi and onomatopoeia) and least favorite (moist, no, racked and buoy) word. We use the English language everyday to communicate with each other, but often take for granted that words themselves can be interesting and as individual as the person using them. As we each offered our words, it quickly became obvious that we were all a bunch of word nerds!
During our wordy discussion, we got on the subject of Project Gutenberg and archive.org. If you aren’t familiar, these websites provide the largest databases of online books and documents. Both of these sites take text that isn’t subject to copyright, upload it and distribute it on the web for free. Like the creation of the OED, uploading all that information seems to be an almost impossible task. Are the books simply scanned? I did some research and it turns out, yes…and no. Most texts are photographically scanned and then transformed into text using “Optical Character Recognition” (OCR) in order to make them manageable size files. This could be a completely automated system, but unfortunately OCR isn’t perfect and there are many words that cannot be read by a computer. So, that’s where the folks at reCAPTCHA come in.
As you probably know, a CAPTCHA is a program that can tell whether it’s user is a human or a computer. It usually consists of two mildly distorted words at the bottom of web registration forms.
If a human encounters a CAPTCHA, they enter the words and gain access to the desired information. Spamming programs aren’t able to read distorted text, so they can’t get through. reCAPTCHA has raised the bar one step further. The words they use in their spam blockers are actually text that OCR can’t make out.
So, every time you encounter a reCAPTCHA form on the internet and you enter the words, you are helping these projects by deciphering text from books, old newspapers, and classic radio shows ensuring they get digitized to share with future generations. According to reCAPTCHA, over 200 million CAPTCHA’s are solved everyday around the world taking each user about 10 seconds of their time. That is not a lot of time for single individuals, but when added up together it constitutes about 150,000 hours of work each day. If every CAPTCHA on the internet helped digitize books, these projects just got a lot more manageable. Too bad the editors of the first Oxford English Dictionary didn’t have access to reCAPTCHA. It probably would have taken them a fraction of the 72 years to complete their masterpiece!
For more information about reCAPTCHA you can read here and here.
For more information about how you can add reCAPTCHA to your website click here.

