Computer vision has seen great success during the last few years, mainly due to advances in deep learning and the availability of training data. Applications like unlock of phone and house doors using face recognition, and classification of pictures in the smartphone's gallery app are all direct consequences of the application of deep learning in computer vision. At Datalya, our machine learning consultants apply deep learning algorithms in computer vision problems such as;

- image recognition
- image classification with localization
- object detection
- neural style transfer

High dimensional feature vectors (due to large-sized images) is one of the challenges when we apply deep learning to computer visions. It is impossible to train a regular deep neural network on even an average-sized images. To address this challenge, the machine learning community has come up with an exceptionally successful technique, convolutional neural networks (CNN).

## Convolution Operation

The convolution operation is a fundamental building of convolutional neural networks. It allows the network to detect horizontal and vertical edges of an image and then based on those edges build high-level features (like ears, eyes etc.) in the following layers of neural network.

Convolution operation involves an input matrix and a filter, also known as the kernel. Input matrix can be pixel values of a grayscale image whereas a filter is a relatively small matrix that detects edges by darkening areas of input image where there are transitions from brighter to darker areas. There can be different types of filters depending upon what type of features we want to detect, e.g. vertical, horizontal, or diagonal, etc.

In convolution operation, we start by superimposing a filter onto the input matrix at the top-left corner and then do element-wise multiplication and add up all the values. The output value is written to the top-left cell of the output matrix. Next we shift our filter by one element to the left and repeat the same multiplication and addition to get next output value. During shift operation when we hit the right boundary of the input matrix, we shift one element down and move all the way back to the right boundary of the input matrix. We continue this until we reach the bottom-right of the input matrix.

**Do we need to write code for convolution operation by ourselves? **

Fortunately, the answer is 'No'. The main machine learning frameworks such as TensorFlow, Keras, etc. come with convolution operation. We simply need to give them input matrix and filter (along with other parameters which we will discuss later) and they will give us back output matrix.

**Do we need to specify values of a filter?**

No, because in deep learning these values (weights) can be learned by learning algorithms through backpropagation.

## Padding

Padding is an extended form of the convolution operation. In deep neural networks, we do convolution operation with padding to get the output matrix of the same size as that of the input matrix. Let's say out input matrix is 6x6 and we convolve it with a filter of size 3x3. Without padding the output matrix would be 4x4 because there are only 4x4 places to superimpose filter on the input matrix. That gives the following formula;

$$ (n-f+1) \times (n-f+1) $$

The straight convolution operation has two downsides;

1) it shrinks image each time we convolve, and

2) the corner pixels of the input matrix are looked at only one time as it is touched once by the filter. On the other hand, the middle pixel of the 6x6 input matrix sees many overlaps with a 3x3 filter. That means we are throwing away information of edge pixels by underusing them in convolution.

Imagine, we have a 100-layered neural network and convolve at every layer, then eventually we would have a very small shrunk image.

We can fix this problem by padding the input image before applying the convolution operation. Typically we pad one additional pixel to the border. This would give us an output matrix of the same size as the input matrix;

$$n+2p-f+1 \times n+2p-f+1 \times$$

**Valid and Same Convolution**

Convolution operation without padding is called valid convolution. The one with padding is known as the same convolution - the output size is the same as input because of padding. Typically value of $p$ is chosen by formula

$$p = \frac{f-1}{2}$$

Typically, the convolution filter size is odd because it makes padding size easier to chose.

## Strided Convolution

Stride convolutions are another variant of convolution operation and basic building block of CNN. In strided convolution operation, we use stride (s) when we shift the filter over the input matrix.

Straight convolution operation has stride, $s=1$, as we only shift filter by one column or row (after hitting the right boundary of the input matrix). However, strided convolution has $s$ is greater than 1 as it steps over by 2 or move columns and rows.

We can calculate output values in a similar fashion as in straight convolution i.e. element-wise product and then sum of all product values. The dimensions of the output matrix are governed by

$$\frac{n+2p-f}{s} \times \frac{n+2p-f}{s}$$

where

$nxn$ - dimensions of input matrix,

$fxf$ - dimensions of filter,

$p$ - is padding,

$s$ - strides

If $\frac{n+2p-f}{s}$ is not an integer, we will round it down.

## Convolution over Volume

Colour RGB images have 3 channels, red, green, and blue. In order to feature in such images, we use 3D filters with additional dimensions for color channels. The filter itself has 3 layers corresponding to three color channels.

$$H \times W \times C$$

Number of channels (C) in input image and filter must match, like

$$6 \times 6 \times 3 * 3 \times 3 \times 3 = 4 \times 4 $$

To compute output, we superimpose each filter layer to the corresponding input layer, take a product, and then add up 27 numbers to one. The shift operation is similar to 2D convolution. You can image a 3-layered filter as a volume which to corresponding input volume and then shrunk to one element of output matrix (through multiplication and addition of 27 products).

## Multiple Filters

So far we have a discussion application of one filter for convolution operation. Let's say we want to detect N types of different edges in the input image. That would require us to have N filters and convolve them to input images. If we have two 3x3x3 filters and a 6x6x3 input, the convolution operation will give us two 4x4 output layers i.e. 4x4x2 volume.

$$ n \times n \times c * f \times f \times c = n-f+1 \times n-f+1 \times c $$

The idea of convolutions over volume is very powerful. This allows us to operate on RGB channels and at the same time detect multiple types of edges.

## One Layer of Convolutional Network

We have discussed the core concept of convolution operation and related techniques of padding, strides, and convolution over volume. Now let's see how we can put them together in one layer of the convolutional network.

Let's say we have 3D input (6x6x3) and are only interested in 2 types of features/edges. Consequently we decided to convolve input with two filters (3x3x3) i.e. convolution over volume. This would give us two 4x4 outputs. In a convolutional neural network, we will add bias to each of the two outputs and then put them through relu non-linearity function. The final two output (4x4) matrices when stacked up would give us an output volume (4x4x2).

If you remember, a typical layer of neural network has the following 2 operation (linear function and non-linear relu);

$$ z^{[1]} = w^{[1]}a^{[0]} + b^{[1]}$$

$$a^{1} = g(z^{[1]})$$

In convolutional network, $a{[0]}$ is 3D input matrix and the two filters are $w{[1]}$. The ouput volume we get after applying relu non-linearity is activation $a^{[1]}$.

This is how we go from $a^{[0]}$ (6x6x3) to $a^{[1]}$ (4x4x2). First we apply convolution operation to compute linear function ($z^{[1]}$) and then add biases to the output matrices and put them through relu non-linearity to get $a^{[1]}$.

In this example, we have two filters in order to detect two features. In practice, we may have a much larger input matrix and hundreds of filters. As we add more filters we add, the depth of output volume will increase.

The filters and biases represent parameters of the layer of the convolutional network. For instance, if we have ten 3x3x3 filters the total number of parameters would be 280, where each filter contributes 28 (3x3x3 filter values + 1 bias) parameters. It is important to note that even if the size of the input image increases the number of parameters will not change. This makes convolutional neural networks less prone over-fitting.