In image classification with localization, we train a neural network to detect objects and then localize by predicting coordinates of the bounding box around it. While working with computer vision projects, our machine learning consultants often do get asked what is landmark detection. In this article, we will explain how landmark detection works in deep learning for computer vision applications.
Given input images, the output of such an algorithm would be
a) probability of finding an object, and
b) if the object exists, coordinates of the bounding box ($b_{x}$, $b_{y}$, $b_{h}$ and $b_{w}$) around it.
In many computer vision applications, neural network often needs to recognize essential points of interest (than bounding box) in the input image. We refer to these points as landmarks. In such applications, we want the neural network to output coordinates ($x$, $y$) of landmark points than those of bounding boxes.
Let's take a look at a specific example of a face recognition algorithm. Imagine, we want a neural network to learn positions of two corner of the human eye (i.e., four landmarks) and output eight numbers as given below;
- ($l_{1x}$, $l_{1y}$)
- ($l_{2x}$, $l_{2y}$)
- ($l_{3x}$, $l_{3y}$)
- ($l_{4x}$, $l_{4y}$)
But what if we want to localize tens of landmarks along upper and lower linings of eye and mouth along with few other essential face landmarks on face.
($l_{1x}$, $l_{1y}$), ($l_{2x}$, $l_{2y}$), ..., ($l_{Nx}$, $l_{Ny}$)
To do so, first, we would need to decide upon positions of landmarks and then label training images with the landmarks. Putting landmarks on training images can be an uphill task if landmarks and training images are in large numbers. It is important to note that sequence of landmarks must be consistent in all training images. More concretely, if the first landmark is on the right corner of the right eye, it should be so across all labeled examples in the training set.
To train our algorithm, we will pass training examples to convolutional network (convnet) so that it can learn features and then inject them to fully connected (FC). FC would end up into $1+2N$ output units; 1 binary unit (person or not) and 2N units for N landmark points ($l_{1x}$, $l_{1y}$), ($l_{2x}$, $l_{2y}$), ..., ($l_{Nx}$, $l_{Ny}$)
Landmark detection is a fundamental building block of computer vision applications such as face recognition, pose recognition, emotion recognition, head recognition to put the crown on it, and many more in augmented reality.