Optimal low-rank matrix approximation
Suppose you have an m by n matrix A, where m and n are very large, that you'd like to compress. That is, you'd like to come up with an approximation of A that takes less data to describe.
For example, consider a high resolution photo that as a matrix of gray scale values. An approximation to the matrix could be a lower resolution representation that takes less space.
I heard a rumor many years ago that space probes would compress images by interpreting an image as a matrix and sending back a few eigenvalues and eigenvectors. That sounded fascinating, but what about images that aren't square? If a matrix M is not square, then you can't have Mx = Ix for a scalar I because the left and right sides of the equation would have different dimensions.
Truncated SVDApproximate a rectangular matrix requires using something more general than eigenvalues and eigenvectors, and that is singular values and singular vectors.
Suppose our matrix A has the singular value decomposition
This can be written as
where the Ifi are the singular values and p is the number of singular values that are non-zero. The ui and vi are the ith columns of U and V respectively. Singular values are nonnegative, and listed in decreasing order.
We can form an approximation to A by truncating the sum above. Instead of taking all the singular values, and their corresponding left and right singular vectors, we only take the k largest singular values and their corresponding vectors.
This is called the truncated SVD.
We started by assuming A had dimensions m by n and that both were large. Storing A requires mn numbers. Storing Ak requires only k(1 + m + n) numbers. This is a big savings if k is much smaller than m and n.
Eckart-Young theoremSo how good an approximation is Ak ? Turns out it is optimal, in the least squares sense. This is the content of the Eckart-Young theorem. It says that the best least squares (2-norm) approximation of A by a rank k matrix is given by Ak.
Not only that, the theorem says the 2-norm error is given by the first singular value that we didn't use, i.e.
Singular value decomposition and pseudoinverse