The minimum polynomial initially looks awfully similar to the characteristic polynomial, and it is not clear if learning it will have any practical utility. But it does. In fact, one of the most popular and efficient optimization techniques, namely conjugate gradients, relies on the Krylov sequences, which is build upon the concept of minimum polynomial.

First, let’s see the how the characteristic and minimum polynomials are defined. The characteristic polynomial $p(x)$ of a matrix $\mathbf A$ is $$p(x) = \prod\limits_{i=1}^ S(x-\lambda_i)^{a_i},$$ where $S$ is the number of distinct eigenvalues of $\mathbf A$, and $a_i$ is the algebraic multiplicity of a $\lambda_i$—the number of times $\lambda_i$ is repeated. The minimum polynomial $c(x)$, on the other hand, is defined as $$c(x) = \prod\limits_{i=1}^ S(x-\lambda_i)^{k_i},$$ where $k_i$ is the index of the $i$th eigenvalue (i.e., the size of the largest Jordan block for $\lambda_i$; see p.592 of Carl D. Meyer). In general, $k_i \leq a_i$. If a matrix is nonderogatory (p 645), then we have $a_i=k_i$. But it is most interesting what happens when this is not the case—when $k_i<a_i$. In other words, it is most interesting to see what happens when an eigenvalue is repeated (while still having few 1’s on the superdiagonals of its Jordan segment); or, in practice, when all the eigenvalues of a matrix $\mathbf A$ fall in a few clusters. Because as we’ll see, this is the key to efficiency when solving large linear systems.

Before we explain how having few eigenvalues leads to efficiency, we need to define the Krylov sequences that are mentioned in the beginning. The Krylov sequences are a direct extension of the concept of minimum polynomial to vectors. We say that $v(x)$ is the minimum polynomial for a vector $\mathbf b$ relative to a square matrix $\mathbf A$ if it is the monic polynomial of minimal degree such that $v(\mathbf A)\mathbf b = \mathbf 0$ (p646 of Carl D. Meyer). This definition may be a little obscure, but we don’t need to fully understand it to appreciate the importance of minimum polynomials. All we need to understand is the degree of this polynomial, which is computed as follows. If $\mathcal K_j$ is defined as $$\mathcal K_j = \{\mathbf{b}, \mathbf{Ab}, \mathbf{A^2b}, \dots \mathbf{A}^j \mathbf{b}\},$$ then the minimal degree of this polynomial is the maximal $j$ such that $\mathcal K_j$ is a linearly independent set of vectors. And the set $\mathcal K_j$ is called a Krylov sequence.

Now we are in place to put the pieces together and see why the minimum polynomials matter. The most critical theorem is the following:

For any linear system $\mathbf{Ax=b}$, the solution $\mathbf x \in \mathbb{R}^n$ is in $\text{span} \mathcal K_k=\text{span}\{\mathbf{b}, \mathbf{Ab}, \mathbf{A^2b}, \dots \mathbf{A}^k \mathbf{b}\}$, where $k$ is the degree of the minimum polynomial of $\mathbf b$ relative to $\mathbf A$

(p 650 of Carl D. Meyer)

This theorem is critical, because it says that whenever $k < n$, we don’t have to search the entire $\mathbb R^n$ for the solution of $\mathbf{Ax=b}$, but just a $k$-dimensional subspace of it. And $k<n$ is more likely to happen for any $\mathbf b$ if the matrix $\mathbf A$ has few distinct eigenvalues${}^*$. Take as an example the identity matrix $\mathbf I_n$. For any $\mathbf b$, the solution to $$\mathbf{Ix=b}$$ lies always in a one-dimensional Krylov space spanned by $\mathcal K_1=\{\mathbf b\}$, and this happens precisely because $\mathbf I_n$ has only one distinct eigenvalue, namely $1$, and therefore $k_i<a_i$ (see 2nd paragraph). And the conjugate gradients algorithm is the mechanism that takes advantage of this fact — the fact that the solution always lies in the Krylov space, which can be smaller than the entire $\mathbb R^n$.

In practice, if $\mathbf A$ is collected from real data or is a product of floating point computations, then it is unlikely that any two eigenvalues of $\mathbf A$ will be identical. But Krylov methods like conjugate gradient can still be efficient if all eigenvalues fall into a small number of distinct clusters (p659 of Carl D. Meyer). What’s more interesting is that, it is possible to precondition a matrix $\mathbf A$ to have few distinct (clusters of) eigenvalues by “left-multiplying”${}^**$ it with a matrix $\mathbf{M}^{-1}$. Of course, it is difficult (and probably impossible) to find preconditioning matrices that work for all linear systems, but still, if the structure of the problem allows us, we may think of ways of devising an $\mathbf M^{-1}$ that forces the eigenvalues of $\mathbf A$ to be distributed in a small number of tight clusters.

And this, my friends, is precisely why we should care about having few distinct eigenvalues.

Footnotes:
* More precisely, here we need to make the link between the minimum polynomial of the vector $\mathbf b$ relative to $\mathbf A$ and the minimum polynomial of the (matrix) $\mathbf A$. This link is made in p647 of Carl D. Meyer. But, in a nutshell, we can say that if $\mathbf A$ has few distinct eigenvalues (that have small index $k_i$), then it doesn’t matter which $\mathbf b$ the linear system $\mathbf{Ax=b}$ is being solved for; an efficient solution will always be possible. But if this not the case (e.g., if the matrix $\mathbf A$ is nonderogatory), then an efficient solution in less than $n$ steps can still be possible, but now it depends on $\mathbf b$. E.g., if $\mathbf b$ is an eigenvector of $\mathbf A$, then the minimum polynomial of $\mathbf b$ relative to $\mathbf A$ will be of degree 1 even if the minimum polynomial of $\mathbf A$ is of degree $n$.

** We put quotation marks here because we don’t actually left-multiply the matrix (this would probably be very inefficient especially for large $n$) but do the corresponding operation in some other way.