#Convolution

Convolution is one of the most important operations in image processing. In particular, it is fundamental to modern deep learning.

Fundamentals

Convolution is an operation on an image. For our purposes, an image is simply an n-dimensional array or list of real numbers. For instance, a vector would be a 1-dimensional image, while a matrix is a 2-dimensional image.

For a convolution operation over n dimensions, we take a small n-dimensional image (for instance, a 3x3 matrix in 2D) and move it across our image. At each cell, we take a weighted average using the weights of the convolution filter (sometimes called a kernel) and the values of the image. This average (or sum, depending on how you look at it) gives us our filtered image.

Let's work an example in 1D. Suppose this is our image:

1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1

Let's say we use the 1D convolution filter of [1/3, 1/3, 1/3]. We would look at the image one 3-cell window at a time:

Starting image:
                             1, 2, 3, 4, 3, 2, 1, 2, 3, 4, 3, 2, 1 ...

[1/3 * 1 + 1/3 * 2 + 1/3 * 3] = 2
   [1/3 * 2 + 1/3 * 3 + 1/3 * 4] = 3
      [1/3 * 3 + 1/3 * 4 + 1/3 * 3] = 3 1/3
         [1/3 * 4 + 1/3 * 3 + 1/3 * 2] = 3
            [1/3 * 3 + 1/3 * 2 + 1/3 * 1] = 2
               [1/3 * 2 + 1/3 * 1 + 1/3 * 2] = 1 2/3
                    ...

Resulting image: 2, 3, 3 1/3, 3, 2, 1 2/3, 2 ...

As we can see, the process of convolution is extremely simple. However, it turns out to be an exceptionally powerful tool.

1D Convolution

Consider the following problem: you're using an ultrasonic sensor to detect when someone walks in front of a door, at which point you will turn on a light. You have a constant stream of numbers, giving an echo distance in centimeters. However, the sensor occasionally sends an erratic value. If you simply use the direct sensor data, you might turn on the light when an erratic value is sent, even if no one is actually in front of the door. How can we avoid this issue?

We might keep a running average of the last 3 numbers. Instead of using the last sensor value as our running data, we use the average of the last 3 sensor values. Under the assumption that the sensor values are otherwise smooth, and that we won't get 3 erratic readings in a row, this will allow the 2 sensor values preceding the erratic value to pull the sensor reading towards an actually accurate sensor distance. For instance:

Actual distance of subject:
 5  6  5  4   5  5  5  7 16 18 16 15

Sensor reading:
 5  6  5  4  17  5  5  7 16 18 16 15

Averaged reading:
na na 5.3 5 8.7 8.7 9 5.7 9.3 13.7 16.7 16.3

In this case, the fifth value (17) is erratic. However, taking a running average allows the erratic value to be 8.7 rather than 17, which is much lower.

It should be clear by now that this is really just an instance of 1D convolution.

There are many other useful 1D convolution filters in addition to the smoothing one we just saw. Here's a list:

Basic Smoothing

$$ \begin{bmatrix} 1/3 & 1/3 & 1/3\ \\ \end{bmatrix} $$

This is the basic averaging filter that we just saw, and serves to remove erratic changes in value in our image / signal. In the sensor example, it's worth noting that each value will have a 2-reading lag. For instnace, if a subject actually does step in front of the ultrasonic, the proper distance will only be registered after 3 readings instead of after 1.

We can make this filter stronger and allow it to mitigate larger changes in value by making it larger, such as $[0.2\ 0.2\ 0.2\ 0.2\ 0.2]$. However, the larger and stronger the filter, the longer the lag.

Edge Detection

$$ \begin{bmatrix} 1 & -2 & 1\ \\ \end{bmatrix} $$

This filter essentially measures the difference between each point with its neighboring points, meaning that only large changes in value will have a large magnitude. For instance, in our ultrasonic example, suppose we only want to measure motion in front of the sensor, when the reading changes. Suppose our reading is perfect, so our data looks like:

5 5 5 5 5 5 15 15 15 15...

Using our convolution filter:

na na 0 0 0 10 0 0 0...

We can see that we have a measurable value only at the change.

2D Convolution

See this GIMP page for good visual examples.

Gaussian Average

$$ \frac{1}{16}\ \begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \\ \end{bmatrix} $$

Edge Detection

$$ \begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \\ \end{bmatrix} $$

Sharpen

$$ \begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \\ \end{bmatrix} $$

Connection to Fourier Transforms

Until now, we've been understanding convolution as an operation that happens over images, in the image space. However, a more complete and rigorous understanding is obtained by analyzing convolution in terms of the signal domain, using Fourier transforms.

You'll notice that convolution filters for the most part fall into two categories: making things pointy or making things smooth. Sharpen and edge detection make things pointy. Noise removal and gaussian blur make things smooth.

All the pointy filters are essentially high-pass filters in the signal domain, since they remove the low-frequency (smooth) signals and leave the high-frequency (pointy, noisy) ones. On the other hand, the smooth filters are essentially low-pass filters in the signal domain, removing high-frequency (noisy) signals and leaving low-frequency (smooth) ones.

This connection is very deep and very complicated. I will not cover the mathematical details of this connection, but you can find them here. A proper understanding of this connection to FFT is necessary for a full utilization of convolution.