Can we trust our computers ?
Few months ago, I gave a talk at the IAS (Princeton) on matrix perturbation (see the video). Here is a brief review.
Let be an matrix with singular values and with corresponding singular vectors . One perturbs by a small matrix ( stands for error). The problem is to measure how this perturbation changes and . If you are not familiar with singular values, just think of eigenvalues (assuming that and are symmetric). In the whole discussion, we consider singular vectors with unit norm.
This is a fundamental question in matrix theory and numerical analysis, and there is a vast literature on the topic. It is easy to imagine why this problem is critical in practice and draws so much attention. Most data in real-life problems are given in matrix form, and performing spectral analysis on these matrices is one of the most, if not the most, common routines done by data analysts. On the other hand, all data contain some degree of noise/error, and that leads to the obvious question
Are the answers we get from our computers good approximations of the truth ?
Let us present the most classical perturbation bounds. For singular values, we have the Weyl’s bound
where is the spectral norm of .
This bound is sharp, and shows that the impact of is continuous, namely it the error caused by decreases to zero as tends to zero. This is expected, the less error, the better answer we are supposed to get.
The computation of the singular vectors, however, hides a big surprise. Consider the following matrix
where is a small positive number.
The perturbed matrix has the form
It is clear that the eigenvalues of are , with eigenvectors . On the other hand, the eigenvalues of are also with eigenvectors . Thus, the eigenvectors of do not approach those of , as . As a matter of fact, they are always 45 degree apart no matter how small is.
Researchers have found out that the reason for this is that the eigenvalues themselves are too close. Let be the singular vectors of . We now present a well known result of Davis-Kahan-Wedin from the 1960s. We adopt a tradition in numerical analysis and measure the distance between unit vectors and by the sine of the angle between them (measured to be between 0 and 180 degree). In this form their result shows
where is the distance from to the closest singular value. Again, this bound is sharp.
Let us now fast forward a half century. Bounds like (1) and (2) are still used very frequently in the literature. However, recent studies in big data (and data analysis in general) provide new challenges and opportunities.
First, one needs to model noise. In most situations, researchers think about noise as random, as it is truly hard to model it any other way. Let us take a toy model (which is rather naive, but it serves well as an illustration) where is a random matrix with iid entries of mean zero and variance one (one can of course renormalize and assume the variance be ; we leave it to the reader to work out the details). Now, how do (1) and (2) perform ?
Results from random matrix theory shows that with overwhelming probability, the spectral norm of is about . Thus, we have
Again, bounds of this type are routinely used in applications. Let us, however, take a further look on the matrix . While (3) and (4) hold for any , in practice we often know something about . A well known fact in data analysis is that data usually possesses some kind of structure. Would we be able to use this extra information to our advantage ? The goal of the lecture is to give an affirmative answer to this question.
One of the most popular assumptions made in applications is that the data matrix has low rank. It is motivated by the following fact: while the size of the matrix can be ernomous, its entries are governed by much fewer parameters. A good example is the Netflix matrix, where the rows are viewers and columns are titles of movies and the entries are ratings. While there are million of viewers and ten of thousands of titles, the ratings are determined by a much smaller number of parameters such as gender and age of the viewer, his/her occupation, gender of the movie, when it was made etc. Thus, the entries are functions of few variables, and the low rank assumption can be seen as a first order approximation. This assumption plays an important part in many algorithms attacking the 1.000.000 dollars Netflix problem.
Our key observation is that when has low rank, say , the true dimension of the problem is , not . Thus, one may hope to improve (3) and (4) to
The main part of the talk showed that we can achieve results close to these hopes, which significantly improve the classical bounds, using tools from random matrix theory and high dimensional geometry. Several applications and open problems are also discussed at the end. See this link for the relevant paper.