Convolutional Neural Networks (CNN) explained

Posted May 08, 2019 by Rokas Balsys

Convolutional Neural Networks: Introduction:

Take a moment to observe and look around you. Even if you are sitting still on your chair or lying on your bed, your brain is constantly trying to analyze the dynamic world around you. Without your conscious effort your brain is continuously making predictions and acting upon them.


After just a brief look at this photo you identified that there is a restaurant at the beach. You immediately identified some of the objects in the scene as plate, table, lights etc. You probably also guessed that weather is excellent to take a night walk. How were you able to make those predictions? How did you identify the numerous objects in the picture?

It took nature millions of years of evolution to achieve this remarkable feat. Our eye and our brain work in perfect harmony to create such beautiful visual experiences. The system which makes this possible for us is the eye, our visual pathway and the visual cortex inside our brain.

There is a system inside us which allows us to make sense of the picture above, the text in this article and all other visual recognition tasks we perform everyday.

We’ve been doing this since our childhood. We were taught to recognize a dog, a cat or a human being. Can we teach computers to do so? Can we make a machine which can see and understand as well as humans do?

Answer is yes! Similar to how a child learns to recognize objects, we need to show an algorithm millions of pictures before it is be able to generalize the input and make predictions for images it has never seen before.

Computers “see” the world in a different way than we do. They can only “see” anything in form of numbers, something like this:


To teach computers to make sense out of this array of numbers is a challenging task. Computer scientists have spent decades to build systems, algorithms and models which can understand images. Today in the era of Artificial Intelligence and Machine Learning we have been able to achieve remarkable success in identifying objects in images, identifying the context of an image, detect emotions etc. One of the most popular algorithm used in computer vision today is Convolutional Neural Network or CNN.

Convolutional Neural Networks:

Convolutional Neural Networks have a different architecture than regular Neural Networks. Regular Neural Networks transform an input by putting it through a series of hidden layers. Every layer is made up of a set of neurons, where each layer is fully connected to all neurons in the layer before. Finally, there is a last fully-connected layer — the output layer — that represent the predictions.

Convolutional Neural Networks are a bit different. First of all, the layers are organised in 3 dimensions: width, height and depth. Further, the neurons in one layer do not connect to all the neurons in the next layer but only to a small region of it. Lastly, the final output will be reduced to a single vector of probability scores, organized along the depth dimension.

CNN is composed of two major parts:

  • Feature Extraction:
    In this part, the network will perform a series of convolutions and pooling operations during which the features are detected. If you had a picture of a zebra, this is the part where the network would recognize its stripes, two ears, and four legs.
  • Classification:
    Here, the fully connected layers will serve as a classifier on top of these extracted features. They will assign a probability for the object on the image being what the algorithm predicts it is.

There are squares and lines inside the red dotted region which we will break it down later. The green circles inside the blue dotted region named classification is the neural network or multi-layer perceptron which acts as a classifier. The inputs to this network come from the preceding part named feature extraction.

Feature extraction is the part of CNN architecture from where this network derives its name. Convolution is the mathematical operation which is central to the efficacy of this algorithm. Lets understand on a high level what happens inside the red enclosed region. The input to the red region is the image which we want to classify and the output is a set of features. Think of features as attributes of the image, for instance, an image of a cat might have features like whiskers, two ears, four legs etc. A handwritten digit image might have features as horizontal and vertical lines or loops and curves. Later we'll see how do we extract such features from the image.

Feature Extraction: Convolution:

Convolution in CNN is performed on an input image using a filter or a kernel. To understand filtering and convolution you will have to scan the screen starting from top left to right and moving down a bit after covering the width of the screen and repeating the same process until you are done scanning the whole screen.

For instance if the input image and the filter look like following:


The filter (green) slides over the input image (blue) one pixel at a time starting from the top left. The filter multiplies its own values with the overlapping values of the image while sliding over it and adds all of them up to output a single value for each overlap until the entire image is traversed:


In the above animation the value 4 (top left) in the output matrix (red) corresponds to the filter overlap on the top left of the image which is computed as:


$$ (1 \times 1 + 0 \times 1 + 1 \times 1) + (0 \times 0 + 1 \times 1 + 1 \times 0) + (1 \times 0 + 0 \times 0 + 1 \times 1) = 4 $$ Similarly we compute the other values of the output matrix. Note that the top left value, which is 4, in the output matrix depends only on the 9 values (3x3) on the top left of the original image matrix. It does not change even if the rest of the values in the image change. This is the receptive field of this output value or neuron in our CNN. Each value in our output matrix is sensitive to only a particular region in our original image.

In the case of images with multiple channels (e.g. RGB), the Kernel has the same depth as that of the input image. Matrix Multiplication is performed between $K_n$ and $I_n$ stack $([K1, I1], [K2, I2], [K3, I3])$ and all the results are summed with the bias to give us a squashed one-depth channel Convoluted Feature Output:


Each neuron in the output matrix has overlapping receptive fields. The animation below will give you a better sense of what happens in convolution. Conventionally, the first ConvLayer is responsible for capturing the Low-Level features such as edges, color, gradient orientation, etc. With added layers, the architecture adapts to the High-Level features as well, giving us a network which has the wholesome understanding of images in the dataset, similar to how we would.


Feature Extraction: padding

There are two types of results to the operation — one in which the convoluted feature is reduced in dimensionality as compared to the input, and the other in which the dimensionality is either increased or remains the same. This is done by applying Valid Padding or Same Padding in the case of the latter. In above example our padding is 1.

In our example when we augment the 5x5x1 image into a 7x7x1 image and then apply the 3x3x1 kernel over it, we find that the convoluted matrix turns out to be of dimensions 5x5x1. It means our output image is with same dimensions as our output image (Same Padding).

On the other hand, if we perform the same operation without padding, in the output we'll receive an image with reduced dimensions. So our (5x5x1) image will become (3x3x1).

Feature Extraction: example

Lets say we have a handwritten digit image like the one below. We want to extract out only the horizontal edges or lines from the image. We will use a filter or kernel which when convoluted with the original image dims out all those areas which do not have horizontal edges:


Notice how the output image only has the horizontal white line and rest of the image is dimmed. The kernel here is like a peephole which is a horizontal slit. Similarly for a vertical edge extractor the filter is like a vertical slit peephole and the output would look like:


Feature Extraction: Non-Linearity

After sliding our filter over the original image the output which we get is passed through another mathematical function which is called an activation function. The activation function usually used in most cases in CNN feature extraction is ReLu which stands for Rectified Linear Unit. Which simply converts all of the negative values to 0 and keeps the positive values the same:


After passing the outputs through ReLu functions they look like:


So for a single image by convolving it with multiple filters we can get multiple output images. For the handwritten digit here we applied a horizontal edge extractor and a vertical edge extractor and got two output images. We can apply several other filters to generate more such outputs images which are also referred as feature maps.

Feature Extraction: Pooling

After a convolution layer once you get the feature maps, it is common to add a pooling or a sub-sampling layer in CNN layers. Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the Convolved Feature. This is to decrease the computational power required to process the data through dimensionality reduction. Furthermore, it is useful for extracting dominant features which are rotational and positional invariant, thus maintaining the process of effectively training of the model. Pooling shortens the training time and controls over-fitting.

There are two types of Pooling:

  • Max Pooling and Average Pooling. Max Pooling returns the maximum value from the portion of the image covered by the Kernel.
    Max Pooling also performs as a Noise Suppressant. It discards the noisy activation altogether and also performs de-noising along with dimensionality reduction.

  • Average Pooling returns the average of all the values from the portion of the image covered by the Kernel.
    Average Pooling simply performs dimensionality reduction as a noise suppressing mechanism. Hence, we can say that Max Pooling performs a lot better than Average Pooling.

The Convolutional Layer and the Pooling Layer, together form the i-th layer of a Convolutional Neural Network. Depending on the complexities in the images, the number of such layers may be increased for capturing low-levels details even further, but at the cost of more computational power.

After going through the above process, we have successfully enabled the model to understand the features. Moving on, we are going to flatten the final output and feed it to a regular Neural Network for classification purposes.

ConvolClassification - Fully Connected Layer (FC Layer):

Adding a Fully-Connected layer is a (usually) cheap way of learning non-linear combinations of the high-level features as represented by the output of the convolutional layer. The Fully-Connected layer is learning a possibly non-linear function in that space. Example of CNN network:


Now that we have converted our input image into a suitable form, we shall flatten the image into a column vector. The flattened output is fed to a feed-forward neural network and backpropagation applied to every iteration of training. Over a series of epochs, the model is able to distinguish between dominating and certain low-level features in images and classify them using the Softmax Classification technique.

So now we have all the pieces required to build a CNN. Convolution, ReLU and Pooling. The output of max pooling is fed into the classifier we discussed initially which is usually a multi-layer perceptron layer. Usually in CNNs these layers are used more than once i.e. Convolution -> ReLU -> Max-Pool -> Convolution -> ReLU -> Max-Pool and so on. We won’t discuss the fully connected layer right now.


CNN is a very powerful algorithm which is widely used for image classification and object detection. The hierarchical structure and powerful feature extraction capabilities from an image makes CNN a very robust algorithm for various image and object recognition tasks.

There are various architectures of CNNs available which have been key in building algorithms which power and shall power AI as a whole in the foreseeable future. Some of them are: LeNet, AlexNet, VGGNet, GoogLeNet, ResNet, ZFNet and etc.

If you liked this or have some feedback or follow-up questions please comment below.

Thanks for Reading!

In my next tutorial we'll start building my first CNN model with tensorflow.