Bỏ qua nội dung

Few months ago, I gave a talk at the IAS (Princeton) on matrix perturbation (see the video). Here is a brief review.

Let ${A}$ be an ${n \times n}$ matrix with singular values ${\sigma_1 \ge \dots \ge \sigma_n}$ and with corresponding singular vectors ${u_1, \dots, u_n}$. One perturbs ${A}$ by a small matrix ${E}$ ( ${E}$ stands for error). The problem is to measure how this perturbation changes ${\sigma_i}$ and ${u_i}$. If you are not familiar with singular values, just think of eigenvalues (assuming that ${A}$ and ${E}$ are symmetric). In the whole discussion, we consider singular vectors with unit norm.

This is a fundamental question in matrix theory and numerical analysis, and there is a vast literature on the topic. It is easy to imagine why this problem is critical in practice and draws so much attention. Most data in real-life problems are given in matrix form, and performing spectral analysis on these matrices is one of the most, if not the most, common routines done by data analysts. On the other hand, all data contain some degree of noise/error, and that leads to the obvious question

Are the answers we get from our computers good approximations of the truth ?

Let us present the most classical perturbation bounds. For singular values, we have the Weyl’s bound $\displaystyle | \sigma_i ( A+E) - \sigma_i A | \le \| E \| , \ \ \ \ \ (1)$

where ${\| E \| := \sigma_1 (E)}$ is the spectral norm of ${E}$.
This bound is sharp, and shows that the impact of ${E}$ is continuous, namely it the error caused by ${E}$ decreases to zero as ${\| E \|}$ tends to zero. This is expected, the less error, the better answer we are supposed to get.

The computation of the singular vectors, however, hides a big surprise. Consider the following matrix ${A}$ $\displaystyle \begin{pmatrix} 1+\epsilon & 0 \\ 0 & 1-\epsilon \end{pmatrix} .$

Let ${E}$ be $\displaystyle \begin{pmatrix} -\epsilon & \epsilon \\ \epsilon & \epsilon \end{pmatrix} ,$

where ${\epsilon}$ is a small positive number.
The perturbed matrix ${A+E}$ has the form $\displaystyle \begin{pmatrix} 1 & \epsilon \\ \epsilon & 1 \end{pmatrix}.$

It is clear that the eigenvalues of ${A}$ are ${1\pm \epsilon}$, with eigenvectors ${(1,0), (0,1)}$. On the other hand, the eigenvalues of ${A+E}$ are also ${1 \pm \epsilon}$ with eigenvectors ${\frac{1}{\sqrt 2}(1,1), \frac{1}{\sqrt 2} (1,-1) }$. Thus, the eigenvectors of ${A+E}$ do not approach those of ${A}$, as ${\epsilon \rightarrow 0}$. As a matter of fact, they are always 45 degree apart no matter how small ${\epsilon}$ is.

Researchers have found out that the reason for this is that the eigenvalues themselves are too close. Let ${v_i}$ be the singular vectors of ${A+E}$. We now present a well known result of Davis-Kahan-Wedin from the 1960s. We adopt a tradition in numerical analysis and measure the distance between unit vectors ${u}$ and ${v}$ by the sine of the angle between them (measured to be between 0 and 180 degree). In this form their result shows $\displaystyle \sin \angle (u_i, v_i) \le \frac{ \| E \| }{ \delta_i }, \ \ \ \ \ (2)$

where ${\delta_i}$ is the distance from ${\sigma_i (A)}$ to the closest singular value. Again, this bound is sharp.
\vskip2mm

Let us now fast forward a half century. Bounds like (1) and (2) are still used very frequently in the literature. However, recent studies in big data (and data analysis in general) provide new challenges and opportunities.

First, one needs to model noise. In most situations, researchers think about noise as random, as it is truly hard to model it any other way. Let us take a toy model (which is rather naive, but it serves well as an illustration) where ${E}$ is a random matrix with iid entries of mean zero and variance one (one can of course renormalize and assume the variance be ${\sigma^2}$; we leave it to the reader to work out the details). Now, how do (1) and (2) perform ?

Results from random matrix theory shows that with overwhelming probability, the spectral norm ${\|E \|}$ of ${E}$ is about ${(2+ o(1)) \sqrt n }$. Thus, we have $\displaystyle | \sigma_i ( A+E) - \sigma_i A | \le (2+o(1)) \sqrt n , \ \ \ \ \ (3)$

and $\displaystyle \sin \angle (u_i, v_i) \le \frac{ (2+o(1)) \sqrt n }{ \delta_i }. \ \ \ \ \ (4)$

Again, bounds of this type are routinely used in applications. Let us, however, take a further look on the matrix ${A}$. While (3) and (4) hold for any ${A}$, in practice we often know something about ${A}$. A well known fact in data analysis is that data usually possesses some kind of structure. Would we be able to use this extra information to our advantage ? The goal of the lecture is to give an affirmative answer to this question.

One of the most popular assumptions made in applications is that the data matrix has low rank. It is motivated by the following fact: while the size of the matrix can be ernomous, its entries are governed by much fewer parameters. A good example is the Netflix matrix, where the rows are viewers and columns are titles of movies and the entries are ratings. While there are million of viewers and ten of thousands of titles, the ratings are determined by a much smaller number of parameters such as gender and age of the viewer, his/her occupation, gender of the movie, when it was made etc. Thus, the entries are functions of few variables, and the low rank assumption can be seen as a first order approximation. This assumption plays an important part in many algorithms attacking the 1.000.000 dollars Netflix problem.

Our key observation is that when ${A}$ has low rank, say ${r \ll n}$, the true dimension of the problem is ${r}$, not ${n}$. Thus, one may hope to improve (3) and (4) to $\displaystyle | \sigma_i ( A+E) - \sigma_i A | \le C \sqrt r , \ \ \ \ \ (5)$

and $\displaystyle \sin \angle (u_i, v_i) \le \frac{ C \sqrt r }{ \delta_i }. \ \ \ \ \ (6)$

The main part of the talk showed that we can achieve results close to these hopes, which significantly improve the classical bounds, using tools from random matrix theory and high dimensional geometry. Several applications and open problems are also discussed at the end. See this link for the relevant paper.

From → Chưa phân loại

One Comment