Decoding reCAPTCHA

Project team:

Curtis Jones
Jacob Russell

Problem:

As more services are hosted on the Internet the threat of abuse and misconduct grows as well. Some abuse comes in the form of automated programs - a.k.a. "bots" - that sign up for services with the sole intent of abuse. A popular example is a bot which signs up for fraudulent email accounts to spread advertisements and spam.

These bots are deterred using an automated verification system called CAPTCHA which stands for "Completely Automated Public Turing test to Tell Computers and Humans Apart". The images are composed of two or more words that are distorted in a way that makes it very difficult for a computer to read but remains relatively easy for a human. The following are examples from Google's reCAPTCHA service.

The problem now becomes, given an image containing two sequences of characters, can the two sequences be identified using an automated application? Through this decoding can we improve CAPTCHA performance to prevent automated abuse?

Background:

Many approaches have been applied to breaking CAPTCHAs, but there is a large number of types of CAPTCHAs. This leads to many different approaches to solve different parts of CAPTCHAs. We focus on the reCAPTCHAs because (to the best of our knowledge) they are not broken yet. Mori and Malik[5] use the Shape Context algorithm to differentiate the text that they would like to solve with other text. The reCAPTCHAs do not have extraneous text, so, althouth, the approach may still hold we would rather try a sure approach. For ReCAPTCHAs, Beede[2] provides a high level description of an approach with a good, high level explanation of preprocessing. Baecher et al.[1] and Huang et al.[3] used Color Filling Segmentation to segment. Segmentation is the most important part, according to Wilkins[6] because after segmentation OCR performs better. After segmentation in [1], they used a Holistic classifier at the word level to do their OCR. Bursztein et al.[7] survey techniques for solving CAPTCHAs and recommend the use of Support Vector Machines or K-Nearest Neighbors classifiers for classifying characters.

Method:

The process of decoding a CAPTCHA image will be broken down into three major steps

Preprocessing

Binzarize: Reduce all color and grayscale to black and white.
De-noise. Sharpen blurry characters and remove noise from the edges of words to improve recognition.
De-skew. Sequences which have a bowing or skew should be straightened to improve recognition.
Ellipse removal. Many reCAPTCHA images include an ellipse that inverts a portion of the word. Detection will be implemented to remove the ellipse and preserve as much of the original word as possible to improve recognition.

Segmentation

The image will be classified and then segmented to improve character recognition. The first segmentation will attempt to isolate the two words in the CAPTCHA image. Subsequent segmentation will aim to isolate individual characters.

Recognition.

Character recognition will be done using a combination of the support vector machines (SVM) and k-nearest neighbor (KNN) techniques. Traditional OCR trained on generic images of text fails at solving modern CAPTCHA images, training on images from the domain, should help.

Data set:

A training data set will be built from 1,000 images downloaded from Google's reCAPTCHA service. All images will be hand tagged with the correct label. A data set is available in [2], containing exactly 1000 images, but the reCAPTCHA service itself has been updated since to include more advanced features.

Milestone goal:

The goal for the milestone is 50% accuracy in identifying text-only CAPTCHA images. It is difficult to estimate the percentage that are likely to be correct because CAPTCHAs are constantly updated to be made more difficult and data from scholarly results is not available for exact comparison. Mori and Malik[5] report a success rate of 92% but they used EZ-Gimpy and Gimpy CAPTCHAs, not reCAPTCHAs. On more recent CAPTCHAs, Wilkins[6] reports 17.5% success, but used a very small data set (200 images), and was attempting to give guidelines for improving CAPTCHAs, not break them.

References:

P. Baecher, N. Buscher, M. Fischlin, and B. Milde, "Breaking reCAPTCHA: A Holistic Approach via Shape Recognition." in Future Challenges in Security and Privacy for Academia and Industry., vol. 354, pp. 56-67, 2011.
R. Beede, Analysis of reCAPTCHA Effectiveness, Class Project, Univ. Colorado, 2010. Available : http://www.rodneybeede.com/downloads/CSCI5722 - Computer Vision, Final Paper, Rodney Beede, Fall 2010.pdf
S. Y. Huang, Y. K. Lee, G. Bell, and Z. H. Ou, "An Efficient Segmentation Algorithm for CAPTCHAs with line cluttering and character warping", Multimedia Tools and Applications, vol. 48(2), pp. 267-289, 2010.
S. Khanna, "Breaking the Multi-Colored Box: A Study of CAPTCHA.", M.S. thesis, University of Tenessee at Chattanooga, 2009.
G. Mori and J. Malik, "Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA.", Proceedings of the 2003 IEEE Conference on Computer Vision, 2003, pp. 134-141.
J. Wilkins, "Strong Captcha Guidelines v1.2", Available: http://www.bitland.net/captcha.pdf, 2009.
E. Bursztein, M. Martin, and J.C. Mitchell, "Text-based CAPTCHA Strengths and Weaknesses", ACM Computer and Communication Security 2011, 2011.