Tensorflow CAPTCHA solver introduction

In this tutorial, we will grab all symbols from CAPTCHA, checking the order of them, detection accuracy, and overlapping, and use these components to write the final out-of-the-box CAPTCHA solver.

People on the Internet are more or less familiar with the term CAPTCHAs — those annoying images contain the text you have to type in before accessing a website. CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. The primary purpose of CAPTCHA is to prevent automated stuff on the internet with bots, saying that's a test used in computing to determine whether or not the user is human.

CAPTCHA is just a text with noises, different colors, rotated symbols, or other ways changed to make it harder for a computer to recognize. Sometimes, even for a human, it's hard to identify what is written on an image, so it isn't easy to make bots who could break these images.

If you are reading this tutorial, you probably know that no matter how hard a captcha is, it's already possible to solve it with the rising of deep learning and computer vision. But probably you don't know how to do that, so keep reading to find it out.

No matter how much CAPTCHA evolves, there always will be people who come up with methods to break it. One of the most famous methods is to use a machine learning approach, and our main focus will be a specific type of Neural Network called Convolutional Neural Network (CNN).

CNN works similar to how our brain can recognize things and differentiate one object from another. To provide a better intuition, when you look at this picture below, you can immediately tell that these two animals are not the same species, but you will ask, how? The answer would be, that is obvious…

So it comes from the fact that we have seen possibly a million pictures of dogs, cats, and other animals and seen them in real life. When we were kids, we were told that they are different. Then, our brains slowly understood the distinctions between these two animals. Our memories allow us to correctly recognize which one is a dog and which one is a cat by seeing many differences between them.

Using the same concept, we are going to do the same for our CAPTCHA detection Neural Network. Well, not the same because our computer does not perceive the picture the same as we do. They see bunches of symbols that indicate an intensity of color on that particular pixel. If we have an RGB image, one way to display them is as an array is RGBA. Layers in CNNs are particular as they are organized in 3 Dimensions, width, height, and depth. This fact allows us to feed in a picture to the network. The final layer, which is the fully connected layer, tells us what it predicts.

Here is an example of a photo that we see and what the computer sees in the same image to make everything more straightforward. In this photo, we see a puppy:

If we would like to see a picture in computer way, we may use this simple script on this image:

from PIL import Image

im = Image.open("puppy.jpg","r")
pix_val = list(im.getdata())
print(pix_val)

As a result, we receive thousands of numbers, which represent every pixel from the photo as an RGB value:

Now that we understand what CNN does, we will use this method to break down CAPTCHA and see how accurately we can solve it. We used R-CNN with my previous tutorial when we tried to detect counter-strike enemies and shoot them.

Creating a structured model to break CAPTCHA:

Let's look at the CAPTCHA again. Let's assume that it will come in a combination of 26 English alphabets and 0–9 numerical numbers. In the end, with our method, we'll be able to solve CAPTCHA with different amounts of symbols.

To use any machine learning system, we need to collect training data. To break a CAPTCHA system, we want a trained model that works like this:

When we have our training data, we could use it to train a convolutional neural network that looks like this:
With enough training data collected, our approach must work, but we can make the problem even more straightforward to solve. The simpler the problem, the less training data and the less computational power and time we'll need to solve it. We know that CAPTCHA images are always made up of some amount of separated symbols. If we could somehow split the image apart so that each letter would be a separate image, then we only need to train the neural network to recognize a single letter at a time:
So we are teaching our CNN to detect a single letter from a captcha and not an entire string from it at a time. This way, we'll need way less training data. I will talk more about training data in the second tutorial. For now, this is the result we would expect to get:
From the image above, you can see that we give our CAPTCHA image to trained CNN, and as output, it gives us another CAPTCHA image with detections. But detections are not always 100%. This detection percent depends on the training data we use. Moreover, our CNN may detect even more symbols than there are on CAPTCHA images. To solve this problem, we must use some filters. As you can see from the picture above, our model saw the letter "I" as 60% instead of the letter "T", but after using the filter, we still receive "T". So we will talk and develop filters in the last tutorial steps.

Conclusion:

By the way, as I was searching for other out-of-the-box CAPTCHA solving models, I couldn't find them, so I decided to make one by myself. When I finish this tutorial series, you will be able to download the entire code. If you have all TensorFlow libraries on your computer, you will give a captcha to this model and receive the result.

That will be it for this part. I believe we have a good understanding of what our approach is. Next, we will be working with our CAPTCHA image dataset and training CNN using Tensorflow. I will go through step by step to train my CAPTCHA breaking model or use my model.