Object detection using a sliding window has existed before the recent rise of machine learning in computer vision. While interacting with non-technical clients data science consultants at Datalya often do get asked what is sliding window algorithm. In this article, we will try to explain the sliding window algorithm for everyone.
Like any machine learning algorithm, the first requirement of a sliding window algorithm is to prepare a labeled training set. Imagine we want to build a car detection algorithm using a sliding window. Our training images (X) will be images with and without a car object. We will closely crop cars out of car images and label cropped images with $y=1$ . In our training set, we also need negative examples where an image does not have any car but background. For negative examples, we will set $y=0$.
Now we can feed labeled training examples to the convolutional network (convnet) so that it can learn how to detect a car in an image. Once we have trained convnet on closely cropped training images, the next question would be, how do we go about detecting vehicles in test images as test images would not be closely cropped. This is where Sliding Window Algorithm comes to rescue.
To detect a car in a test input image, we start by picking a sliding window of size (x) and then feeding the input region (x) to trained convnet by sliding window over every part of the input image. For each input region, convnet outputs whether it has a car or not. We run a sliding window multiple times over the image with different window sizes, from smaller to larger, hoping a window size would fit the car and allow convnet to detect it.
Computational cost is a considerable disadvantage of the sliding window algorithm. We have to crop so many regions and run convnet for each of them individually. Increasing window and stride size makes it faster but at the cost of decreased accuracy.
Until the recent rise of machine learning, object detection using sliding window has been working fine for linear and straightforward classifiers, typically based on hand-engineered features. Classifiers with traditional settings usually do not require substantial computational resources. However, modern convnets are computational very demanding, and that makes the sliding window algorithm very slow. Additionally, the sliding window does not localize objects accurately unless sliding window and strides are tiny.