資訊檢索導論第十八章筆記(英文)
Matrix decompositions and latent semantic indexing
term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.
- develop a class of operations from linear algebra, known as matrix decomposition
- use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix
- examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing
Linear algebra review
- eigenvalues of C
For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:
The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.
In a similar fashion, the left eigenvectors of C are the M-vectors y such that
The number of nonzero eigenvalues of C is at most rank©.
Note:
- the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is small
- For a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.
Matrix decompositions
a square matrix can be factored into the product of matrices derived from its eigenvectors
- Two theorems
- Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition
where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order
If the eigenvalues are distinct, then this decomposition is unique.
- Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition
build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.
Term–document matrices and singular value decompositions
M * N term-document matrix C, thus C is very unlikely to be symmetric.
- Theorem SVD
- Illustration of the SVD
there are two cases:
- M > N
- M < N
Low-rank approximations
- Forbenius Norm
Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be
the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy
When k is far smaller than r , we refer to Ck as a low-rank approximation.
- The SVD can be used to solve the low-rank matrix approximation problem.
We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:
The rank of Ck is at most k.
this procedure yields the matrix of rank k with the lowest possible Frobenius error.
- the form of Ck
where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.
Latent semantic indexing
LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing
-
use SVD to construct a low-rank approximation Ck to the term-document matrix
-
map each row/column to a k-dimensional space
-
use the new k-dimensional LSI represnetation to compute similarities between vectors
query vector is mapped into its representation in the LSI space
Note:
The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection
a value of k in the low hundreds can actually increase precision on some query benchmarks.
This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy
LSI works best in applications where there is little overlap between queries and documents.
- soft clustering
LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.