資訊檢索導論第十八章筆記(英文)

阿新 • • 發佈：2020-10-08

Matrix decompositions and latent semantic indexing

term-document matrix: an M * N matrix C, each of whose rows represents a term and each of whose columns represents a document in the collection.

develop a class of operations from linear algebra, known as matrix decomposition
use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix

examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing

Linear algebra review

eigenvalues of C

For a square M× Mmatrix C and a vector x that is not all zeros, the values of λ satisfying:

在這裡插入圖片描述

The eigenvector corresponding to the eigenvalue of largest magnitude is called the principal eigenvector.

In a similar fashion, the left eigenvectors of C are the M-vectors y such that

在這裡插入圖片描述

The number of nonzero eigenvalues of C is at most rank©.

Note:

the effect of small eigenvalues (and their eigenvectors) on a matrix–vector product is small
For a symmetric matrix S, the eigenvectors corresponding to distinct eigenvalues are orthogonal. Further, if S is both real and symmetric, the eigenvalues are all real.

Matrix decompositions

a square matrix can be factored into the product of matrices derived from its eigenvectors

Two theorems

Let S be a square real-valued M× M matrix with M linearly independent eigenvectors. Then there exists an eigen decomposition

在這裡插入圖片描述

where the columns of U are the eigenvectors of S and is a diagonal matrix whose diagonal entries are the eigenvalues of S in decreasing order

在這裡插入圖片描述

If the eigenvalues are distinct, then this decomposition is unique.

Let S be a square, symmetric real-valued M× M matrix with M linearly independent eigenvectors. Then there exists a symmetric diagonal decomposition

在這裡插入圖片描述

build on this symmetric diagonal decomposition to build low-rank approximations to term–document matrices.

Term–document matrices and singular value decompositions

M * N term-document matrix C, thus C is very unlikely to be symmetric.

Theorem SVD

在這裡插入圖片描述

Illustration of the SVD

在這裡插入圖片描述

there are two cases:

M > N
M < N

Low-rank approximations

Forbenius Norm

Given an M × N matrix C and a positive integer k, we wish to find an M × N matrix Ck of rank at most k, so as to minimize the Frobenius norm of the matrix difference X = C − Ck , defined to be

在這裡插入圖片描述

the Frobenius norm of X measures the discrepancy between Ck and C; our goal is to find a matrix Ck that minimizes this discrepancy

When k is far smaller than r , we refer to Ck as a low-rank approximation.

The SVD can be used to solve the low-rank matrix approximation problem.
We then derive from it an application to approximating term–document matrices. We invoke the following three-step procedure to this end:

在這裡插入圖片描述

The rank of Ck is at most k.

this procedure yields the matrix of rank k with the lowest possible Frobenius error.

在這裡插入圖片描述

the form of Ck

在這裡插入圖片描述

where ui and vi are the ith columns of U and V, respectively. Thus, uiviT is a rank-1 matrix, so that we have just expressed Ck as the sum of k rank-1 matrices each weighted by a singular value.

Latent semantic indexing

LSI, The low-rank approximation to C yields a new representation for each document in the collection. We will cast queries into this low-rank representation as well, enabling us to compute query– document similarity scores in this low-rank representation. This process is known as latent semantic indexing

use SVD to construct a low-rank approximation Ck to the term-document matrix
map each row/column to a k-dimensional space
use the new k-dimensional LSI represnetation to compute similarities between vectors

在這裡插入圖片描述

query vector is mapped into its representation in the LSI space

Note:
The computational cost of the SVD is significant, One approach to this obstacle is to build the LSI representation on a randomly sampled subset of the documents in the collection
a value of k in the low hundreds can actually increase precision on some query benchmarks.
This suggests that, for a suitable value of k, LSI addresses some of the challenges of synonymy
LSI works best in applications where there is little overlap between queries and documents.

soft clustering

LSI can be viewed as soft clustering by interpreting each dimension of the reduced space as a cluster and the value that a document has on that dimension as its fractional membership in that cluster.

資訊檢索導論第十八章筆記(英文)

Matrix decompositions and latent semantic indexing

Linear algebra review

Matrix decompositions

Term–document matrices and singular value decompositions

Low-rank approximations

Latent semantic indexing

資訊檢索導論第十八章筆記(英文)

【Laravel-海賊王系列】第十八章，事務的巢狀執行原理

第十八章 Linux中打包壓縮zip，tar命令

【C++】《C++ Primer 》第十八章

第十八章 django

第十八章-面試真題

流暢的python，Fluent Python 第十五章筆記

統計學習方法第十八章作業：PLSA 概率潛在語義分析演算法程式碼實現

第十八章常用庫NFstream

第十八章【高階篇】分散式快取Redis6.X新特性講解拓展

第十八章 Tagging_Redis-6.2.1 伺服器部署

第十八章 Ansible-playbook-Role基礎介紹

第十八章 Nginx Rewrite重寫

【明日方舟x泰坦隕落】同人《我在羅德島上開著泰坦攻略》Part.4 第十八章破冰困境

【閱讀筆記】《資訊檢索導論》第二章詞項詞典及倒排記錄表

資訊安全系統設計與實現：第十一章學習筆記

資訊保安工程師-軟考中級-備考筆記：第十一章網路物理隔離技術原理與應用

資訊安全系統設計與實現：第十四章學習筆記

《UNIX環境高階程式設計》(APUE) 筆記第十一章 - 執行緒

《UNIX環境高階程式設計》(APUE) 筆記第十二章 - 執行緒控制

資訊檢索導論第十八章筆記(英文)

Matrix decompositions and latent semantic indexing

Linear algebra review

Matrix decompositions

Term–document matrices and singular value decompositions

Low-rank approximations

Latent semantic indexing

相關推薦