The next time a Web site asks you to read a string of crooked letters as a security precaution, don't grimace. You could be helping to digitize a deteriorating historical document. A team of computer scientists has taken a common Internet tool for screening out spam and adapted it to help convert text from old books and manuscripts into electronic files. The effort might not put professional transcribers out of business, but it could cut the cost of creating digital libraries.
In the battle between Web security designers and spammers, programs called Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) have proven an effective foil. The programs require online users to read a distorted word or line of text and retype it in a designated box--something that few optical scanners or digital-text readers can do. Insidious programs deployed by spammers can penetrate sites such as Gmail and lift their e-mail address lists. CAPTCHAs block the attempt by requiring an extra step before providing access. They are used online about 200 million times every day.
Computer scientist Luis von Ahn of Carnegie Mellon University in Pittsburgh, Pennsylvania, and colleagues thought all that effort could be put to another use, too. "Since each [CAPTCHA] takes about 10 seconds of human time," von Ahn says, "we figured humanity as a whole was wasting about 500,000 hours every day typing." And that much time constituted a valuable resource in efforts to digitize old books with deteriorating pages and faded text.
The team developed a new program, called reCAPTCHA, which collects words flagged as unreadable by optical scanners as they digitize texts. Those words, in the form of computer optical scans, are then sent to cooperating Web sites and used in place of random CAPTCHAs. The software presents one optically unreadable word and one "control" CAPTCHA word. Getting the control word right identifies the user as a human, and the program records his or her response to the unreadable word and adds it to a database. To improve accuracy even further, reCAPTCHA sends the most difficult words to multiple users and selects the consensus response as correct. This process can peg more than 99% of words accurately, the team reports online today in Science.
The reCAPTCHA system now automatically collects about 4 million responses every day from 40,000 Web sites, the equivalent of 1500 people working full-time and transcribing 60 words per minute, von Ahn says. The service, available at www.recaptcha.net, is free to any Web site that requests it. After a year of operation, reCAPTCHA has helped resolve about 440 million words for client users that are digitizing newspaper and document archives; von Ahn says his team just completed the entire 1908 archive from The New York Times, for example.
Information scientist Paul Kantor of Rutgers University in New Brunswick calls reCAPTCHA "an ingenious idea." It creates the opportunity to sell the labor of third parties to interested customers, he says, all at the cost of "at most, a fraction of a second more" to generate the benefit--"wow!" And cryptographer Josh Benaloh of Microsoft Research in Redmond, Washington, says the approach "is simple, brilliant, and makes people who hear about the idea smile while asking themselves, 'Why didn't I think of this?' "


)
)
)
)
)
)
)
)