Yolo v3 with TensorFlow 2

Posted May 2 by Rokas Balsys

##### Yolo v3 Introduction to object detection with TensorFlow 2

When I got started learning YOLO v3, I noticed that it’s really difficult to understand both the concept and implementation. Even though there are tons of blog posts and GitHub repository about it, most of them are presented in complex architectures.

I am not going to cover how Yolo works in theory step by step. I’ll try to cover more in detail all parts from the past tutorial and parts I missed last time. If you are interested you can check my past tutorial, where I tried to explain a whole theory in it. Although I implemented it in TensorFlow 1.15. But more and more people write to me about errors, because they try my code within TF 2.0 or above, so I decided that it’s time to write YOLO v3 implementation to TensorFlow 2.1.

Based on that experience, I will try to write code in this tutorial to make it easy and reusable for many beginners who just got started learning object detection. Without over-complicating things, with this tutorial, you will be able to simply implement Yolo v3 in TensorFlow 2.1. In this tutorial series, I will give you solutions on how to train the Yolo model for your own custom dataset locally or even on Google Colab (received a lot of requests).

#### Prerequisites

• Familiar with Python 3
• Understand object detection and Convolutional Neural Networks (CNN).
• Basic TensorFlow usage.

##### Introduction to YOLO algorithm

In 2015, Redmon J et al. Proposed the YOLO network, which is characterized by combining the candidate box generation and classification regression into a single step. When predicting, the feature map is divided into 7x7 cells, and each cell is predicted, which greatly reduces the calculation complexity. Accelerate the speed of target detection, frame rate up to 45 fps!

After a lapse of one year, Redmon J once again proposed YOLOv2. Compared with the previous generation, the mAP on the VOC2007 test set increased from 67.4% to 78.6%. However, because a cell is only responsible for predicting a single object, facing the goal of overlap, the recognition was not good enough.

Finally, in April 2018, the author released the third version of YOLOv3.The mAP-50 on the COCO dataset was increased from 44.0% of YOLOv2 to 57.9%. Compared with RetinaNet with 61.1% mAP, RetinaNet has an input size of 500. In the case of × 500, the detection speed is about 98 ms/frame, while YOLOv3 has a detection speed of 29 ms/frame when the input size is 416 × 416.

The above picture is enough to prove, that YOLOv3 has achieved a very high accuracy rate under the premise of ensuring speed.

##### YOLO v3 idea

The author treats the object detection problem as a regression problem in the YOLO algorithm and divides the image into an S × S grid. If the center of a target falls into a grid, the grid is responsible for detecting the target.

Each grid will output bounding box, confidence, and class probability map. among them:

• The bounding box contains 4 values: x, y, w, h, (x, y) represents the center of the box. (W, h) represents the width and height of the box;
• Confidence indicates the probability of containing objects in this prediction box, which is actually the IoU value between the prediction box and the real box;
• The class probability indicates the class probability of the object, and the YOLOv3 uses a two-class method.

##### YOLO v3 architecture

For those of you who don’t have a lot of experience with Yolo v3 or other object detections, I recommend to read my past tutorials and try to understand how the algorithm actually works.

As its name suggests, YOLO (You Only Look Once) applies a single forward pass neural network to the whole image and predicts the bounding boxes and their class probabilities as well. This technique makes YOLO quite fast without losing a lot of accuracies.

As mentioned in the original paper , YOLOv3 has 53 convolutional layers called Darknet-53 is shown in the following figure, which is mainly composed of Convolutional and Residual structures. It should be noted that the last three layers Avgpool, Connected and softmax layer are used for classification training on the Imagenet dataset. When we use the Darknet-53 layer to extract features from the picture, these three layers are not used.

Darknet-53 Implemented in code:

def darknet53(input_data):
input_data = convolutional(input_data, (3, 3,  3,  32))
input_data = convolutional(input_data, (3, 3, 32,  64), downsample=True)

for i in range(1):
input_data = residual_block(input_data,  64,  32, 64)

input_data = convolutional(input_data, (3, 3,  64, 128), downsample=True)

for i in range(2):
input_data = residual_block(input_data, 128,  64, 128)

input_data = convolutional(input_data, (3, 3, 128, 256), downsample=True)

for i in range(8):
input_data = residual_block(input_data, 256, 128, 256)

route_1 = input_data
input_data = convolutional(input_data, (3, 3, 256, 512), downsample=True)

for i in range(8):
input_data = residual_block(input_data, 512, 256, 512)

route_2 = input_data
input_data = convolutional(input_data, (3, 3, 512, 1024), downsample=True)

for i in range(4):
input_data = residual_block(input_data, 1024, 512, 1024)

return route_1, route_2, input_data


How strong is Darknet-53? Looking at the picture below, the author compares and concludes that Darknet-53 is comparable to the most advanced classifiers talking about accuracy, and it has fewer floating-point operations and the fastest calculation speed. Compared with ReseNet-101, the speed of Darknet-53 network is 1.5 times that of the former; although ReseNet-152 and its performance are similar, but it takes more than 2 times.

In addition, Darknet-53 can also achieve the highest measurement floating-point operation per second, which means that the network structure can make better use of the GPU, thereby making it more efficient and faster.

But from above Darknet-53 architecture figure it’s quite impossible to understand or imagine how Yolo v3 works, so here is another figure with Yolo v3 architecture:

From the above architecture image, you can see that YOLO makes detection in 3 different scales in order to accommodate different objects size by using strides of 32, 16, and 8. This means, if we’ll feed an input image of size 416 x 416, YOLOv3 will make detection on the scale of 13 x 13, 26 x 26, and 52 x 52.

For the first scale, YOLOv3 downsamples the input image into 13 x 13 and makes a prediction at the 82nd layer. The 1st detection scale yields a 3-D tensor of size 13 x 13 x 255.

After that, YOLOv3 takes the feature map from layer 79 and applies one convolutional layer before upsampling it by a factor of 2 to have a size of 26 x 26. This upsampled feature map is then concatenated with the feature map from layer 61. The concatenated feature map is then subjected to a few more convolutional layers until the 2nd detection scale is performed at layer 94. The second prediction scale produces a 3-D tensor of size 26 x 26 x 255.

The same design is again performed one more time to predict the 3rd scale. The feature map from layer 91 is added one convolutional layer and is then concatenated with a feature map from layer 36. The final prediction layer is done at layer 106 yielding a 3-D tensor of size 52 x 52 x 255. In summary, Yolo predicts over 3 different scales detection, so if we feed an image of size 416x416, it produces 3 different output shape tensor, 13 x 13 x 255, 26 x 26 x 255, and 52 x 52 x 255.

But still, seeing Darknet-53 and Yolo v3 structure, we can’t fully understand all layers. This is why I have one more figure with the overall architecture of the YOLOv3 network. In the picture below, we can see that the input picture of size 416x416 gets 3 branches after entering the Darknet-53 network. These branches undergo a series of convolutions, upsampling, merging, and other operations. Three feature maps with different sizes are finally obtained, with shapes of [13, 13, 255], [26, 26, 255] and [52, 52, 255]:

Implemented in code:

def YOLOv3(input_layer, NUM_CLASS):
# After the input layer enters the Darknet-53 network, we get three branches
route_1, route_2, conv = darknet53(input_layer)
# See the orange module (DBL) in the figure above, a total of 5 Subconvolution operation
conv = convolutional(conv, (1, 1, 1024,  512))
conv = convolutional(conv, (3, 3,  512, 1024))
conv = convolutional(conv, (1, 1, 1024,  512))
conv = convolutional(conv, (3, 3,  512, 1024))
conv = convolutional(conv, (1, 1, 1024,  512))
conv_lobj_branch = convolutional(conv, (3, 3, 512, 1024))

# conv_lbbox is used to predict large-sized objects , Shape = [None, 13, 13, 255]
conv_lbbox = convolutional(conv_lobj_branch, (1, 1, 1024, 3*(NUM_CLASS + 5)), activate=False, bn=False)

conv = convolutional(conv, (1, 1,  512,  256))
# upsample here uses the nearest neighbor interpolation method, which has the advantage that the
# upsampling process does not need to learn, thereby reducing the network parameter
conv = upsample(conv)

conv = tf.concat([conv, route_2], axis=-1)
conv = convolutional(conv, (1, 1, 768, 256))
conv = convolutional(conv, (3, 3, 256, 512))
conv = convolutional(conv, (1, 1, 512, 256))
conv = convolutional(conv, (3, 3, 256, 512))
conv = convolutional(conv, (1, 1, 512, 256))
conv_mobj_branch = convolutional(conv, (3, 3, 256, 512))

# conv_mbbox is used to predict medium-sized objects, shape = [None, 26, 26, 255]
conv_mbbox = convolutional(conv_mobj_branch, (1, 1, 512, 3*(NUM_CLASS + 5)), activate=False, bn=False)

conv = convolutional(conv, (1, 1, 256, 128))
conv = upsample(conv)

conv = tf.concat([conv, route_1], axis=-1)
conv = convolutional(conv, (1, 1, 384, 128))
conv = convolutional(conv, (3, 3, 128, 256))
conv = convolutional(conv, (1, 1, 256, 128))
conv = convolutional(conv, (3, 3, 128, 256))
conv = convolutional(conv, (1, 1, 256, 128))
conv_sobj_branch = convolutional(conv, (3, 3, 128, 256))

# conv_sbbox is used to predict small size objects, shape = [None, 52, 52, 255]
conv_sbbox = convolutional(conv_sobj_branch, (1, 1, 256, 3*(NUM_CLASS +5)), activate=False, bn=False)

return [conv_sbbox, conv_mbbox, conv_lbbox]


##### Residual module

The most significant feature of the residual module is the use of a short cut mechanism (somewhat similar to the short circuit mechanism in the circuit) to alleviate the gradient disappearance problem caused by increasing the depth in the neural network, thereby making the neural network easier to optimize. It uses identity mapping to establish a direct correlation channel between input and output so that the network can concentrate on learning the residual between input and output.

Implemented in code:

def residual_block(input_layer, input_channel, filter_num1, filter_num2):
short_cut = input_layer
conv = convolutional(input_layer, filters_shape=(1, 1, input_channel, filter_num1))
conv = convolutional(conv       , filters_shape=(3, 3, filter_num1,   filter_num2))

residual_output = short_cut + conv
return residual_output


##### Extract features

To know the prediction process of YOLO in detail, it is necessary to first understand what are feature maps and embeddings.

#### Feature map

When we talk about CNN networks, we always hear the word feature map. It is also called feature mapping. Simply put, the input image is convolved with the convolution kernel to obtain image features.

Generally speaking, when the CNN network extracts features from the bottom up of the image, the number of feature maps (in fact, the number of convolution kernels) will increase, and the spatial information will decrease, and its features will also become more and more abstract. For example, the famous VGG16 network, its feature map changes like this:

The feature map is getting smaller and smaller in space size, but getting deeper and deeper in the channel size. This is the feature of VGG16.

#### Feature vector

When it comes to feature maps, we can often hear it mentioned in the field of face recognition. Generally speaking, it is actually a feature map that is extracted by the last fully connected layer into a feature vector. As early as 2006, Hinton, the originator of deep learning, published a paper in “SCIENCE”. For the first time, a self-encoding network was used to extract feature vectors (a 2D or 3D vector) from mnist handwritten digits. It is worth mentioning that this paper also opened the road to the rise of deep learning:

When CNN networks extract features from the bottom-up image, the resulting feature map is generally getting smaller and smaller in space size, and getting deeper and deeper in channel size:

This is related to the mapping of ROI (region of interest) to Feature Map. In the above picture: After an ROI in the original image is mapped in the CNN network space, the space size on the feature map will become smaller, or even a point, but the channel information of this point will be very rich. The information is the feature representation of the image information in the ROI area mapped on the CNN network. Since the pixels in the image are closely connected in space, this results in great redundancy in space. Therefore, we often eliminate this redundancy by reducing the dimension in space and increasing the dimension in the channel, and try to obtain its most essential features in the smallest dimension:

For example, The red ROI in the upper left corner of the original image is mapped by CNN, and only one point is obtained on the feature map space, but this point has 85 channels. So, the dimension of ROI has changed from the original [32, 32, 3] to the current 85-dimension.

This is actually an 85-dimensional feature vector obtained after the CNN network performs feature extraction on the ROI. The first 4 dimensions of this feature vector represent candidate box information, the middle dimension represents the probability of judging the presence or absence of objects, and the following 80 dimensions represent the classification probability information for 80 categories.

#### Multi-scale detection

YOLO performs coarse, medium, and fine meshing of the input image to enable the prediction of large, medium, and small objects, respectively. If the size of the input picture is 416X416, then the coarse, medium, and fine grid sizes are 13x13, 26x26, and 52x52 respectively. In this way, it is scaled by 32, 16 and 8 times in length and width respectively:

#### Dimensions of the Bounding Box

The output of the three branches of the YOLOv3 network will be sent to the decode function to decode the channel information of the Feature Map. In the following picture: the black dotted box represents the a priori box (anchor), and the blue box represents the prediction box. The dimensions of the bounding box are predicted by applying a log-space transformation to the output and then multiplying with an anchor:

• b denote the length and width of the prediction frame respectively, and P denote the length and width of the a priori frame respectively.
• t represents the offset of the center of the object from the upper left corner of the grid, and C represents the coordinates of the upper left corner of the grid.

Implemented in code:

def decode(conv_output, NUM_CLASS, i=0):
# where i = 0, 1 or 2 to correspond to the three grid scales
conv_shape       = tf.shape(conv_output)
batch_size       = conv_shape[0]
output_size      = conv_shape[1]

conv_output = tf.reshape(conv_output, (batch_size, output_size, output_size, 3, 5 + NUM_CLASS))

conv_raw_dxdy = conv_output[:, :, :, :, 0:2] # offset of center position
conv_raw_dwdh = conv_output[:, :, :, :, 2:4] # Prediction box length and width offset
conv_raw_conf = conv_output[:, :, :, :, 4:5] # confidence of the prediction box
conv_raw_prob = conv_output[:, :, :, :, 5: ] # category probability of the prediction box

# next need Draw the grid. Where output_size is equal to 13, 26 or 52
y = tf.range(output_size, dtype=tf.int32)
y = tf.expand_dims(y, -1)
y = tf.tile(y, [1, output_size])
x = tf.range(output_size,dtype=tf.int32)
x = tf.expand_dims(x, 0)
x = tf.tile(x, [output_size, 1])

xy_grid = tf.concat([x[:, :, tf.newaxis], y[:, :, tf.newaxis]], axis=-1)
xy_grid = tf.tile(xy_grid[tf.newaxis, :, :, tf.newaxis, :], [batch_size, 1, 1, 3, 1])
xy_grid = tf.cast(xy_grid, tf.float32)

# Calculate the center position of the prediction box:
pred_xy = (tf.sigmoid(conv_raw_dxdy) + xy_grid) * STRIDES[i]
# Calculate the length and width of the prediction box:
pred_wh = (tf.exp(conv_raw_dwdh) * ANCHORS[i]) * STRIDES[i]

pred_xywh = tf.concat([pred_xy, pred_wh], axis=-1)
pred_conf = tf.sigmoid(conv_raw_conf) # object box calculates the predicted confidence
pred_prob = tf.sigmoid(conv_raw_prob) # calculating the predicted probability category box object

# calculating the predicted probability category box object
return tf.concat([pred_xywh, pred_conf, pred_prob], axis=-1)


#### NMS processing

Non-Maximum Suppression, as the name implies, suppresses elements that are not maximal. NMS removes those bounding boxes that have a higher overlap rate and a lower score. The algorithm of NMS is straightforward, and the iterative process is as follows:

• Process 1: Determine whether the number of bounding boxes is greater than 0, if not, then end the iteration;
• Process 2: Select the bounding box A with the highest score according to the score order and remove it;
• Process 3: Calculate the IoU of this bounding box A and all remaining bounding boxes and remove those bounding boxes whose IoU value is higher than the threshold, repeat the above steps;

Implemented in code:

def nms(bboxes, iou_threshold, sigma=0.3, method='nms'):
classes_in_img = list(set(bboxes[:, 5]))
best_bboxes = []

for cls in classes_in_img:
cls_mask = (bboxes[:, 5] == cls)
# Process 1: Determine whether the number of bounding boxes is greater than 0
while len(cls_bboxes) > 0:
# Process 2: Select the bounding box with the highest score according to score order A
max_ind = np.argmax(cls_bboxes[:, 4])
best_bbox = cls_bboxes[max_ind]
best_bboxes.append(best_bbox)
cls_bboxes = np.concatenate([cls_bboxes[: max_ind], cls_bboxes[max_ind + 1:]])
# Process 3: Calculate this bounding box A and
# Remain all iou of the bounding box and remove those bounding boxes whose iou value is higher than the threshold
iou = bboxes_iou(best_bbox[np.newaxis, :4], cls_bboxes[:, :4])
weight = np.ones((len(iou),), dtype=np.float32)

assert method in ['nms', 'soft-nms']

if method == 'nms':

if method == 'soft-nms':
weight = np.exp(-(1.0 * iou ** 2 / sigma))

cls_bboxes[:, 4] = cls_bboxes[:, 4] * weight
score_mask = cls_bboxes[:, 4] > 0.

return best_bboxes


In the end, all the bounding box A is what we want. Let ’s take a simple example: if the five bounding boxes and the scores are: A: 0.8, B: 0.05, C: 0.9, D: 0.5, E: 0.6, the set score threshold is 0.3, the calculation steps are as follows:

• Step 1: The number of bounding boxes is 5, satisfying the iteration conditions;
• Step 2: Select the bounding box A with the highest score and sort it out according to the score order;
• Step 3: Calculate the IoU of the bounding box A and the other four bounding boxes. Assume that the obtained IoU values are: B: 0.2, C: 0.7, D: 0.01, E: 0.08, and remove the bounding box C;
• Step 4: Now only the bounding boxes B, D, E are left, satisfying the iteration conditions;
• Step 5: Select the bounding box D with the highest score according to the score order and remove it;
• Step 6: Calculate the IoU of the bounding box D and the other two bounding boxes. Assume that the obtained IoU values are: B: 0.06, E: 0.8, and remove the bounding box E;
• Step 7: Now only the bounding box B is left, satisfying the iteration conditions;
• Step 8: Select the bounding box B with the highest score according to the score order and remove it;
• Step 9: At this time, the number of bounding boxes is zero, and the iteration ends.

Finally, we get the bounding boxes A, B, and D, but the score of the bounding box B is very low, which indicates that the bounding box has no objects, so it should be discarded. In the YOLO algorithm, there are two cases of NMS processing: one is that all prediction frames are processed together with NMS, and the other is that NMS processing is performed separately for each category of prediction frames. In the latter case, there will be a phenomenon that the prediction box belongs to both category A and category B, which is more suitable for the case where multiple objects exist in a small cell at the same time.

So, up to this moment we covered all the theory we needed for simple use of YOLO v3. In the next tutorial, I’ll cover other functions needed for custom object detector training, now we can try to implement a simple detection example.

##### Implementation

So this is only the first tutorial, not to make it too difficult, I’ll do simple YOLOv3 object detection. To make it work with TensorFlow 2.1 we need to do the following steps:

• Construct and compile Yolov3 model in TensorFlow and Keras;
• Transfer weights from original Darknet weights to constructed model;
• Test object detection with image and video.

I will not show full code in this text tutorial. If you are interested to try this, you can download it on GitHub.

First, you need to get my project:
git clone https://github.com/pythonlessons/TensorFlow-2.x-YOLOv3.git

Next, you may want to install the required python packages:
pip install -r ./requirements.txt