YOLOv4 with TensorFlow 2

Posted August 12 by Rokas Balsys

YOLOv4 custom object detection with TensorFlow 2

To be a state-of-the-art model, YOLOv4 needs to be at the leading-edge of deep learning. YOLOv4 authors have worked on techniques to improve the accuracy of the model while training and in post-processing.

These techniques are called bag-of-freebies and bag-of-specials. In my previous tutorial, we mostly covered the YOLOv4 backbone, the neck and part of bag-of-specials (BoS) for the detector. So, in this part, we will continue bag-of-specials (BoS) and we'll cover bag-of-freebies (BoF).

Here's the complete list, from the paper:

  • BoS for backbone: Mish activation, Multi-input weighted residual connections (MiWRC), cross-stage partial connections (Talked in the previous tutorial);
  • BoS for detector: Mish activation, SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS;
  • BoF for backbone: CutMix and Mosaic data augmentation, DropBlock regularization, Class label smoothing;
  • BoF for detector: CIoU-loss, CmBN, Self-Adversarial Training, Eliminate grid sensitivity, Using multiple anchors for single ground truth, Cosine annealing scheduler, Optimal hyperparameters, and Random training shapes.

Bag of Specials (BoS) for backbone

Mish activation:

If you would look at the YOLOv4 backbone, you may see that it uses Mish activation function in the backbone. Mish is another activation function with a close similarity to ReLU. As claimed by the paper, Mish can outperform them in many deep networks across different datasets:


Mish is a novel smooth and non-monotonic neural activation function which can be defined as:


Turns out that this activation function shows very promising results. For example, using a Squeeze Excite Network with Mish (on CIFAR-100 dataset) resulted in an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively.

Multi-input weighted residual connections (MiWRC):

In recent years, researchers have paid a lot of attention to feature maps and what will be fed into the CNN layer. Almost all recent object detection models start to break away from the tradition that only the previous layer is used for the next layer.

How layers are connected in object detection becomes more and more important now. I already mentioned that in my previous tutorial while talking about FPN and PAN. Now I will show the diagram (d) below that shows another neck design called BiFPN that has better accuracy and efficiency trade-offs according to the BiFPN paper:


In YOLOv4 paper, authors compare its performance with the EfficientDet which is considered as one of the state-of-the-art technology by YOLOv4. As shown below, EfficientDet uses the EfficientNet as the backbone feature extractor and BiFPN as the neck:


As a reference, the diagram below is the architecture for the EfficientNet which builds on the MBConv layers that compose of the inverted residual block:


The key advantage of this is, that it requires much lower computation resources than the traditional convolution layer.

In many ML and DL problems, we learn a low-dimensional representation of the input. We extract the core information of the data by creating an “information” bottleneck. That forces us to discover the most important information which is the core principle of learning. Following this principle, an Inverted residual block takes a low-dimensional representation as input and manipulate it with convolution (linear operation) and non-linear operations. But there is a major issue on the non-linear parts like ReLU. Non-linear operations stretch or compress regions un-proportionally. In such compression, inputs may map to the same regions/points. For example, ReLU may collapse the channel in this low dimensional space and inevitably lose information.

To address that, we can temporarily expand the dimension (the number of channels). With the hope that we have lots of channels now and information may still be preserved in some channels after the non-linear operations. Here is the detail of an inverted residual block:


As shown above, the low-dimensional representation is first expanded to tk channels. Then, it is filtered with lightweight 3 × 3 depthwise convolution. Features are then subsequently reduced back to a low-dimensional at the end of the module. Non-linear operations are added when it remains in the higher-dimensional space:


A residual connection is added from the beginning to the end of the module. The diagram on the left is the traditional residual block and the right is the described inverted residual block:


The main contribution of EfficientDet on YOLOv4 paper is the Multi-input weighted residual connections. In the EfficientDet paper, it observes that different input features are at different resolutions and it contributes to the output feature unequally. But I already mentioned in my previous tutorial that we add those features equally, instead of weighting input features differently.

Bag of Specials (BoS) for detector

I already talked about Mish activation above and I covered SPP, SAM and PAN path-aggregation blocks in my previous tutorial, so I don’t want to repeat my self. So, I will take a brief look at DIoU-NMS.


We already should have heard about NMS and that they filter out other boundary boxes that predict the same object and retains one with the highest confidence:


DIoU (will cover it later together with CIoU-loss) is employed as a factor in non-maximum suppression (NMS). This method takes IoU and the distance between the central points of two bounding boxes when suppressing redundant boxes. This makes it more robust for the cases with occlusions.

Bag of Freebies (BoS) for backbone

CutMix data augmentation:

Best way to explain CutMix data augmentation is with the following image from CutMix paper:


From the image above we can see that Cutout data augmentation removes a region of an image. This forces the model not to be overconfident on specific features in making classifications. However, a portion of the image is filled with useless information and this is a waste. In CutMix a portion of an image is cut-and-paste over another image. The ground truth labels are readjusted proportionally to the area of the patches, e.g. 0.6 like a dog and 0.4 like a cat.

Conceptually, CutMix has a broader view of what an object may compose of. The cutout area forces the model to learn object classification with different sets of features. This avoids overconfidence. Since that area is replaced with another image, the amount of information in the image and the training efficient will not be impacted significantly also. While reading YOLOv4 I first time faces such a technique, that was quite interesting to me.

Mosaic data augmentation:

Another technique is a Mosaic data augmentation method that combines 4 training images into one for training (instead of 2 in CutMix). This enhances the detection of objects outside their normal context. In addition, each mini-batch contains a large variant of image (4×) and therefore, reduces the need for large mini-batch sizes in estimating the mean and the variance:


DropBlock regularization:

We used to know what is Dropout, but in YOLOv4, authors applied fully-connected layer dropoff to force the model to learn from a variety of features instead of being too confident on a few. However, this may not work for convolution layers. Neighboring positions are highly correlated. So even some of the pixels are dropped (the middle diagram below), the spatial information remains detectable. DropBlock regularization builds on a similar concept that works on convolution layers:


So, instead of dropping individual pixels, a block of block_size × block_size of pixels is dropped.

Class label smoothing:

In machine learning or deep learning, we usually use a lot of regularization techniques, such as L1, L2, dropout, etc., to prevent our model from overfitting. In classification problems, sometimes our model would learn to predict the training examples extremely confidently. This is not good for generalization.

Label smoothing adjusts the target upper bound of the prediction to a lower value say 0.9. And it will use this value instead of 1.0 in calculating the loss. This concept mitigates overfitting:


So, label smoothing is a regularization technique for classification problems to prevent the model from predicting the labels too confidently during training and generalizing poorly. More about label smoothing you can read on this paper.

Bag of Freebies (BoS) for detector


A loss function tells us how to adjust weights to reduce total cost. So in situations where we make wrong predictions, we expect it to give us direction on where to move. But this is not happening when we use IoU, when the ground truth box and the prediction do not overlap. Let's consider two predictions (both do not overlap with the ground truth), in this situation the IoU loss function cannot tell which prediction is better, even if one may be closer to the ground truth than the other:


Here is another example from IoU paper:


Generalized IoU (GIoU) fixes above problem by refining the loss as:


But this loss function tends to expand the prediction boundary box first until it is overlapped with the ground truth. Then it shrinks to increase IoU. This process requires more iterations than theoretically needed. To solve this, Distance-IoU Loss (DIoU) was introduced:


where b and bᵍᵗ denote the central points of B and Bᵍᵗ, ρ is the Euclidean distance, and c is the diagonal length of the smallest enclosing box covering the two boxes. And then the DIoU loss function can be defined as:


Finally, Complete IoU Loss (CIoU) was introduced to:

  • Increase the overlapping area of the ground truth box and the predicted box;
  • Minimize their central point distance;
  • Maintain the consistency of the boxes aspect ratio.

This is the final definition from the paper:


CmBN (Cross mini-Batch Normalization):

The original Batch normalization collects the mean and the variance of the samples within a mini-batch to whiten the layer input. However, if the mini-batch size is small, these estimations will be noisy. One solution is to estimate them among many mini-batches. However, as weights are changing in each iteration, the statistics collected under those weights may become inaccurate under the new weight. In Cross-Iteration Batch Normalization (CBM), it estimates those statistics from k previous iterations with the following adjustment:


CmBN is a modified representation of CBN, which assumes that a batch contains four mini-batches.

CmBN (Cross mini-Batch Normalization):

Self-Adversarial Training (SAT) is a data augmentation technique. First, it performs a forward pass on a training sample. Traditionally, in the backpropagation, we adjust the model weights to improve the detector in detecting objects in a specific image. While in SAT, we change the image such that it can degrade the detector performance the most. i.e. it creates an adversarial attack targeted for the current model even though the new image may look visually the same, as in the following example:


Then the model is trained with this new image with the original boundary box and class label. This helps to generalize the model and to reduce overfitting.

Eliminate grid sensitivity:

I couldn’t find a lot of information about this step, but it’s known that old YOLO models do not do a good job of making predictions right around the boundaries of anchor box regions. So, it is useful to define box coordinates slightly differently to avoid this problem. So, this technique was presented in YOLOv4. We know that the boundary box is computed as follows:


For the case bₓ = cₓ and bₓ = cₓ+1, we need tₓ to have a huge negative and positive value, respectively. But we can multiply σ with a scaling factor (>1.0) to make it easier. Here is the source for proposed changes:


Cosine annealing scheduler:

I already using this technique from my YOLOv3 implementation, but let's cover it again. The cosine schedule adjusts the learning rate according to a cosine function. It starts by reducing the large learning rate slowly. Then it reduces the learning rate quickly halfway and finally ends up with a tiny slope in reducing the learning rate:


In the diagram above is shown how the learning rate is decay (learning rate warmup is also applied) and its impact on the mAP. It may be not very obvious, but this method has more constant progress rather than plateau for a long while before making progress again.

Optimal hyperparameters:

I can’t find a lot of information related to this topic, but as I understood authors offered to select hyperparameters by using genetic algorithms. For example, we select 100 sets of hyperparameters randomly. Then, we use them to train 100 models. Later, we select the top 10 performed models. For each selected model, then it’s created 10 slightly mutated hyperparameters according to its original. And after that, we retrain the models with the new hyperparameters and select the best models again. As we keep the iterations, we should find the best set of hyperparameters. Alternatively, we can start with the default hyperparameters and then we start the mutations. But anyway this method takes an enormous amount of time…

Random training shapes:

I can’t find a lot of information about this part, but it’s known that most single-stage object detectors are trained with a fixed input image shape. To improve generalization, the authors offered to train the model with different image sizes. (Multi-Scale Training in YOLO)


So in summary, the authors offered a state-of-the-art detector which is faster in terms of Frames per Second (FPS) and more accurate on MS COCO AP50:::95 and AP50 than all available alternative detectors. Also, what is nice that YOLOv4 can be trained and used on a conventional GPU with 8–16GB VRAM which are broadly available. I must say a big well done to the authors for having verified a large number of features and selected for use such as them for improving the accuracy of both the classifier and the detector. These features can be used as best-practice for future studies and developments. I also implemented YOLOv4 architecture on top of my YOLOv3 implementation, to run model within TensorFlow 2. But sadly I didn’t have time to implement all these Bag-Of-Freebies to improve the training process… Maybe later I’ll find time to do that, but now I leave it as it is. So, you can clone my GitHub repository and test it out! How to train it written in the readme file.