Object detection using sliding window has existed before recent rise of machine learning in computer vision. Like any machine learning algorithm, first requirement of sliding window algorithm is to prepare labeled training set. Imagine we want to build a car detection algorithm using sliding window. Our training images (X) will be images with and without car object. We will closely crop cars out of car images and label cropped images with $y=1$ . In our training set we also need negative examples where an image does not have any car but background. For negative examples we will set $y=0$.
Now we can feed labeled training examples to convolutional network (convnet) so that it can learn how to detect car in an image. Once we have trained convnet on closely cropped training images, the next question would be how do we go about detecting cars in test images as test images would not be closely cropped. This is where Sliding Window Algorithm comes to rescue.
To detect a car in a test input image, we start by picking sliding window of size (x) and then feeding input region (x) to trained convnet by sliding window over every part of input image. For each input region, convnet outputs whether it has a car or not. We run sliding window multiple times over the image with different window size, from smaller to larger, hoping a window size would fit the car and allow convnet to detect it.
Computational cost is a huge disadvantage of sliding window algorithm. We have to crop so many regions and run convnet for each of them individually. Increasing window and stride size makes it faster but at cost of decreased accuracy.
Until recent rise of machine learning, object detection using sliding window has been working fine for simple and linear classifiers, typically based on hand engineered features. Classifiers with traditional settings usually do not require huge computational resources. However, modern convnets are computational very demanding and that makes sliding window algorithm very slow. Additionally sliding window does not localize object accurately unless sliding window and strides are very small.