Computer Vision

Taylor series of an image

Taylor series is the main reason that the field of computer vision exists*. This may seem like a bold statement, but think about it. Classical computer vision problems like image alignment, optical flow, depth estimation (from stereo) or shape from motion, all were first solved with Taylor series expansion.

And this should not be very surprising. Most computer vision problems are just optimization problems, and it is most natural that the first computer vision researchers used existing literature on optimization, which is most mature for continuous functions.

But this meant that researchers had to represent an image as a function — a continuous function. It is by no means obvious how one can represent an image as a function. Enter Taylor series. This is most easily explained with an example of optimization problem.

Suppose that you are given an image $I(x,y)$, and the same image but shifted horizontally by an unknown quantity, i.e. $I(x+dx, y)$. How can you find the amount of horizontal shift, $dx$? This is a simplified version of the classical computer vision problem of image alignment.

To apply traditional tools of the optimization literature, one must find a function $f(z)$ whose value is minimal at $dx$. Then the problem is translated into the problem of minimizing $f(dx)$. Such a function is simply the (squared) norm of the difference between the two images, $$f(z)=||I(x+z, y)-I(x+dx, y)||_2^2,$$ because the minimal value of this function is $0$ and is attained with $z=dx$.

Well, we made progress but we are not quite there. To be able to solve the problem, we need to express $I(x+z,y)$ as a function where the parameter $z$ is explicit; this will allow us to use the traditional tools on function minimization — setting the first derivative w.r.t. $z$ to $0$ and so on. But how can we express $I(x+z,y)$ as a function of $z$? Think about it, the entries of $I(x,y)$ are just pixel intensities for each pixel $(x,y)$. Let’s try to think what this function really is: The intensity of a pixel is determined by your camera’s sensor — it is the activation level of a cell of your camera sensor. Of course it is determined by where your camera is looking at, the level of and direction of illumination, as well as other factors. Now try to express this “function” analytically… Good luck!

This is why it really is hard to underestimate the importance of Taylor series… How on earth can one otherwise hope to express $I(x+z,y)$ as a function of $z$? With Taylor series, mystery is solved: we can write $I(x+z,y)$ as an infinite series $$I(x+z,y)=I(x,y) + z \frac{\partial I}{\partial x}+ \frac{z^2}{2!} \frac{\partial^2I}{\partial x^2} + …$$

Ta-da! We managed to express $I(x+z,y)$ as a function of $z$. Even better, $z$ appears just as powers $z^n$; thus the derivative of this approximation, which we’ll use to find the minimal of $f(z)$, is computed trivially. Now you may complain that we traded one problem for another, because now we have derivative terms $\frac{\partial^n I}{\partial x^n}$, and it’s fair to ask: How can we compute the derivative of the function $I$ if we don’t know the function $I$? Fortunately, there is a satisfactory answer to this. Recall that the partial derivative of a function is defined as

$$I\frac{\partial I}{\partial x}= \lim_{dx \to 0} \frac{I(x+dx, y)-I(x,y)}{dx}$$

*A slightly less exaggerated and ~undeniably true version of this statement is: Tayler series is the main reason that the field of computer vision was established the way it was established.