YOLOv4 with TensorFlow 2

Posted September 23 by Rokas Balsys


Make YOLO do object detection faster with Multiprocessing

This tutorial is a brief introduction to multiprocessing in Python. At the end of this tutorial, I will show how I use it to make TensorFlow and YOLO object detection to work faster.

What is multiprocessing? Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken into smaller routines that run independently. The operating system allocates these threads to the processors improving the performance of the system.

1.png

Why multiprocessing? Consider a computer system with a single processor. If it is assigned several processes at the same time, it will have to interrupt each task and switch briefly to another, to keep all of the processes going.

However, the default Python interpreter was designed with simplicity in mind and has a thread-safe mechanism, the so-called "GIL" (Global Interpreter Lock). In order to prevent conflicts between threads, it executes only one statement at a time (so-called serial processing, or single-threading). This is how our usual Python script works, we do tasks linearly...

In this tutorial, we will see how we can spawn multiple subprocesses to avoid some of the GIL's disadvantages.

The multiprocessing module in Python's Standard Library has a lot of powerful features. If you want to read about all the tips, tricks, and details, I would recommend using the official documentation as an entry point. In this tutorial, I only will show concrete examples, that are related to my YOLO object detection tasks.

So there will be two parts:

  • Multiprocessing communication between processes;
  • Using Multiprocessing with YOLO Object Detection in pre-processing and post-processing.

Multiprocessing communication between processes

Effective use of multiple processes usually requires communication between them, so that work could be divided between processes and results can be aggregated. Multiprocessing supports two types of communication channels between processes: Queue and Pipe.

Queue

If you have basic knowledge about computer data-structure, you probably know about the Queue.

2.png

Python Multiprocessing modules provide Queue class that is exactly a First-In-First-Out data structure. They can store any Python object (though simple ones are best) and are extremely useful for sharing data between processes.

Queues are especially useful when passed as a parameter to a Process' target function to enable the Process to consume data. By using put() function we can insert data to the queue and using get() we can get items from queues. See the following code for a quick communication example:

With the above code, we start two processes, one that puts data to queue and one that picks. Right now my goal is to check, how long it takes to put numbers from 0 to 1000000 into the queue and then read it.

The results were the following:
Queue is now empty! 10.352750539779663

Because we will be working with images, let's try sending 100 random image data with the following code:

The results were the following:
Queue is now empty! 1.0990705490112305

Pipe

A pipe can have only two endpoints. Hence, it is preferred over the queue when only two-way communication is required.

3.png

The multiprocessing module provides Pipe() function which returns a pair of connection objects connected by a pipe. The two connection objects returned by Pipe() represent the two ends of the pipe. Each connection object has send() and recv() methods (among others).

Same as before, we'll modify the Queue's code to use Pipe. See the following code for a quick communication example:

The results were the following:
Pipe is now empty! 9.079113721847534

Same as before, because we will be working with images, let's test sending 100 random images through the pipe with the following code:

The results were the following:
Pipe is now empty! 1.6539678573608398

So, now we have some results, lets put them in one table then we'll be able to compare them:

4.png

So, Simple data was numbers list in 1–1000000 range, Numpy data was 100 of 416x416x3 random images. The results were quite interesting and it's hard to say why they were so. While we were sending simple data to Pipe it was around 14% faster than using Queue. But when we were sending random image data the same way, Queue was faster by around 50%. I can't answer why there is such a difference between the results, but anyway I don't really matter, I simply gonna choose a method depending on my data type.


Using Multiprocessing with YOLO Object Detection in pre-processing and post-processing

Usually, we want to use multiprocessing to make tasks finish faster, in my YOLO object detection implementation this is relevant for a few methods: video detection, realtime detection and object tracking. It's not relevant for use for image detection, because usually we do it one time, and that doesn't take much time. But for other methods, I mentioned, multiprocessing can save us a very large amount of time. While using multiprocessing in realtime tasks, we may get more frames per second and much smoother results.

So, I will work on the video detection process. This process we can apply for both, video or real-time detection because steps are actually the same, only the source is different. I drew a process, how it looks right now when we execute one function at a time while processing a video record:

5.png

So, as you can see from the above image, first, we must have a frame where we want to detect objects, after detection we must post-process detection results, then we draw result (bounding boxes) on a frame, then we can show the processed frame and do same steps with next frame. We do this loop until we reach the last frame.

Problem. Usually, we use multiprocessing for math tasks, where we can divide calculations through the processes. But now, it's harder because we can only do the next step if our previous step has finished its job. Although, YOLO detection takes most of the time (around 80%) and we can't do anything with that, except we would have several GPU's, but usually no one has… lol. So my idea was to divide each of these functions to each process. This means that YOLO detection would be not waiting for other tasks to finish their job. So, it's hard to explain, so again I drew what I mean:

6.png

Actually this is not a real parallel process, but this is the best I could find in this current task. As you can see, each process has its own results queue, which is necessary to share data between different processes. To achieve this kind of result I took my single process video detection function and divided it into smaller functions. Here is part of the code from my GitHub repository:

So, I am not going to talk line by line about details, it's just an overview if it's worth changing code into parallel processing. So to compare everything, I will check how long it takes to process a video without multiprocessing and with it. Also, I know that most of the time takes YOLO detection, so we'll try to take this into the evaluation.

I picked to make tests with the YOLOv4 model, with a 416x416 input size. My CPU is i7–7700k and I have 1080TI GPU. TensorRT is converted to FP32 precision model to keep the same accuracy as the original model. To get results and make them comparable I had to modify my origina utils.py functions, which can be found on the GitHub gist link.

I was measuring 2-time parameters: YOLO detection time and Total time to process the whole video. To make detections comparable I am starting to count start time after the first image is detected because we don't want to measure the model loading time. Although, loading the TensorRT model takes 3–4 times longer than the TensorFlow model, so we must take model loading time into consideration.

In the table below are my testing results for test.mp4 video with and without multiprocessing. The whole video is only14s length.

7.png

First, let's discuss TensorFlow YOLOv4 results. As you can see, while not using multiprocessing, detection time was 23 seconds and post-processing time was around 7.73 seconds. While using multiprocessing, detection time was slower by 4.42 seconds, but post-processing was faster by 5.35 seconds, that's 325% improvement. But comparing the final results, it's only 3% faster, I think that even if I was using GPU, TensorFlow still requires a lot of CPU resources and while we are using CPU for post-processing, detection get's slower because of lower free CPU resources.

In my previous tutorial, we were comparing results of TensorRT and TensorFlow, so I decided that it should be interesting to see its results in multiprocessing also. So, comparing detection time, there was almost no degradation for detection speed, it dropped only by 0.86 seconds, but post-processing didn't improve that much as with the TensorFlow model, only 208%. But while seeing the final results, we can see that improvement was 24%! That's a much better improvement than with the TensorFlow model.

Conclusion:

So, results with TensorRT were much better, it seems that the compiled frozen model requires less CPU for detection than the TensorFlow model, also its more than twice faster. So, if you have GPU, I recommend only using TensorRT models for your final project, for development you can you TensorFlow because it's faster to debug and do modifications. The goal of this tutorial was to measure if there gonna be an improvement in time while using multiprocessing. The final results tell us, yes, it's worth implementing multiprocessing. Code gets messy and it's hard to debug, but if you need at least a few percent of improvement, it's worth doing so!

Now while I have positive results, I will try to use multiprocessing in my future projects. See you in the next tutorial, hope this was useful for you.