YOLO has found its way from youth language into the world of artificial intelligence and machine learning – after semantic modification, it stands for the acronym You Only Look Once. Redmon et al. first presented YOLO in 2016 as a new approach for real-time object recognition. In this article, I will explain how the YOLO network is structured and what its advantages and difficulties are.
The YOLO algorithm is based on a convolutional neural network (CNN). As the name suggests, this algorithm requires only a single forward propagation (i.e. pass) through the CNN to recognise objects. In this run, the CNN predicts both different class probabilities and bounding boxes simultaneously. Bounding boxes describe the spatial extent of the object. The class probability indicates the probability with which the object belongs to a certain class, e.g. ‘cat’.
In a first step, the image is divided into an S x S grid. If an object falls into one of the grid cells, this cell is responsible for recognising the object. For this purpose, each cell determines bounding boxes and confidence scores. The confidence scores indicate how certain the model is that the bounding box contains an object and how accurately the box matches the object. The confidence score is the same as the intersection over union of the prediction and the actual object. If there is no object in the cell, the confidence score is zero. At the same time, the class probability is determined for each cell. Based on a certain threshold value, which is usually 0.5, the algorithm makes the final prediction, which is then used for detection.
The model
Advantages of the YOLO network
YOLO object detection is considered as a single regression problem, which makes the algorithm extremely fast and eliminates the need for a complex network architecture. Thus, a video can be processed in real time with less than 25 ms delay. Moreover, YOLO learns generalisable representations of objects. This makes it possible to test for unexpected inputs after training. In the speed of object recognition, the YOLO network is clearly ahead of other common networks, such as RetinaNet.
But even YOLO is not perfect and has limitations.
Limitations of YOLO
Because the number of boxes per cell is fixed, the network often does not recognise smaller objects that are close together. The accuracy of YOLO is therefore not superior to conventional approaches and even performs slightly worse on average.
Prospects
In the past few years, YOLO has been continuously developed and extended by several features. The approach is still considered state-of-the-art object recognition and is extremely promising.