TensorFlow CAPTCHA solver

Posted January 03, 2019 by Rokas Balsys

Tensorflow CAPTCHA solver introduction

People on the Internet are more or less familiar with the term CAPTCHAs - those annoying images that contain text you have to type in before you can access a website. CAPTCHA stands for 'Completely Automated Public Turing test to tell Computers and Humans Apart'. The main purpose of CAPTCHA is to prevent automate stuff on internet with bots in other words saying that’s a test used in computing to determine whether or not the user is human.

CAPTCHA is just a text with noises, different colors, with rotated symbols or different way changed to make it harder for a computer to recognize. Sometimes even for a human it’s hard to recognize what is written on image, so it is quite difficult task to make bots who could brake these images.

If you are reading this tutorial you probably know that no matter how hard captcha is, it’s already possible to solve it with rise of deep learning and computer vision. But probably you don’t know how to do that, so keep reading to find it out.

No matter how much CAPTCHA evolves, there always will be people who come up with methods to break it. One of the most famous methods is to use machine learning approach, and our main focus going to be a specific type of Neural Network called Convolutional Neural Network (CNN).

CNN works similar to how our brain are able to recognize things, and differentiate one object from another. To provide a better intuition, when you look at this picture below, you can immediately tell that these two animals are not the same species, but you will ask, how? Answer would be, thats obvious...

dog and cat

So it comes from the fact that we have seen possibly a million pictures of dogs, cats and other animal, as well as seen them in real life. When we were a kids we were told that they are different. Then, our brain slowly understood the distinctions between these two animals. Our memories gives us a capability to correctly recognize which one is a dog, and which one is a cat by seeing many differences between them.

Using the same concept, we are going to do the same for our CAPTCHA detection Neural Network. Well, not exactly the same because our computer does not perceive the picture the same as we do. They see bunches of symbols that indicate an intensity of color on that particular pixel. If we have an RGB image, one of the way to display them is as an array is RGBA. Layers in CNNs are special as they are organized in 3 Dimensions, width, height and depth. This fact allows us to feed in a picture to the network. The final layer which is the fully connected layer tells us what it predicts.

To make everything more clear, here is an example of photo what we see and what computer sees in same image, in this photo we see a puppy:


If we would like to see picture in computer way, we may use this simple script on this image:

from PIL import Image

im = Image.open("puppy.jpg","r")
pix_val = list(im.getdata())

As a result we receive thousands of numbers, which represent every pixel from photo as RGB value:


Now that we have a basic understanding of what CNN do, we will use this method to breakdown CAPTCHA and see how accurate we can solve it. We used R-CNN with my previous tutorial when we tried to detect counter strike enemies and shoot them.

Creating structured model to break CAPTCHA:

Let’s us look at the CAPTCHA again. Let’s assume that it will come in a combination of 26 English alphabets and 0–9 numerical numbers. At the end with our method we’ll be able to solve CAPTCHA’s with different amount of symbols.

To use any machine learning system, we need to collect training data. To break a CAPTCHA system, we want trained model that works like this:


When we will have our training data, we could use it to train a convolutional neural network that looks like this:


With enough training data collected, our approach must work, but we can make the problem even simpler to solve. The simpler the problem, the less training data and the less computational power and time we’ll need to solve it. We know that CAPTCHA images are always made up of some amount of separated symbols. If we could somehow split the image apart so that each letter would be a separate image, then we only need to train the neural network to recognize a single letter at a time:


So we are teaching our CNN to detect a single letters from a captcha and not full string from it at a time, this way we’ll need way less training data. I will talk more about training data in second tutorial. For now this is the result we would expect to get:


From image above you can see that we give our CAPTCHA image to trained CNN and as output it gives us another CAPTCHA image with detections. But detections are not always 100%, this detection percent depend on training data we use. Moreover our CNN may detect even more symbols than there is on CAPTCHA image, to solve this problem we must use some king of filter. As you can see from picture above our model saw letter I as 60% instead of letter T, but after using filter we still receive T. So we will talk and develop filter in last tutorial steps.


By the way, as I was searching for other out of the box CAPTCHA solving models, I couldn’t find them, so I decided to make one by myself. When I finish this tutorial series you will be able to download full code and if you’ll have all TensorFlow libraries on your computer, then you will be able to give captcha to this model and receive the result.

That will be it for this part. I believe we have a good understanding of what our approach is, next we will be working with our CAPTCHA image dataset and training CNN using Tensorflow. I will go through step by step that everyone could train his own CAPTCHA breaking model or use my model.