What is Landmark Detection in Computer Vision?




In image classification with localization, we train neural network to detect object and then localize by predicting coordinates of bounding box around it. Given input images, the output of such an algorithm would be a) probability of finding object, and b) if object exists, coordinates of bounding box ($b_{x}$, $b_{y}$, $b_{h}$ and $b_{w}$) around it.

In many computer vision applications, neural network often needs to recognize important points of interest (than bounding box) in input image. These points are referred as landmarks. In such applications, we want neural network to output coordinates ($x$, $y$) of landmark points than those of bounding box.

Let's take a look at specific example of a face recognition algorithm. Imagine, we want neural network to learn positions of two corner of human eye (i.e. 4 landmarks) and output 8 numbers as given below;

  • ($l_{1x}$, $l_{1y}$)
  • ($l_{2x}$, $l_{2y}$)
  • ($l_{3x}$, $l_{3y}$)
  • ($l_{4x}$, $l_{4y}$)

But what if we want to localize tens of landmarks along upper and lower linings of eye and mouth along with few other important face landmarks on face.

($l_{1x}$, $l_{1y}$), ($l_{2x}$, $l_{2y}$), ..., ($l_{Nx}$, $l_{Ny}$)

To do so, first we would need to decide upon positions of landmarks and then label training images with the landmarks. Putting landmarks on training images can be laborious task if landmarks and training images are in large number. It is important to note that sequence of landmarks must be consistent in all training images. More concretely, if first landmark is on right corner of right eye it should be so across all labeled examples in the training set.

To train our algorithm, we will pass training examples to convolutional network (convnet) so that it can learn features and then inject them to fully connected (FC). FC would end up into $1+2N$ output units; 1 binary unit (person or not) and 2N units for N landmark points ($l_{1x}$, $l_{1y}$), ($l_{2x}$, $l_{2y}$), ..., ($l_{Nx}$, $l_{Ny}$)

Landmark detection is a basic building block of computer visions applications such as face recognition, pose recognition, emotions recognition, head recognition to put crown on it, and many more in augmented reality.