lucene計算文字相似度演算法

阿新 • • 發佈：2019-01-18

Leveraging term vectors
所謂term vector, 就是對於documents的某一field,如title,body這種文字型別的, 建立詞頻的多維向量空間.每一個詞就是一維, 這維的值就是這個詞在這個field中的頻率.

如果你要使用term vectors, 就要在indexing的時候對該field開啟term vectors的選項:

        Field options for term vectors
        TermVector.YES – record the unique terms that occurred, and their counts, in each document, but do not store any positions or offsets information.
        TermVector.WITH_POSITIONS – record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.
        TermVector.WITH_OFFSETS – record the unique terms and their counts, with the offsets (start & end character position) of each occurrence of every term, but no positions.
        TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their counts, along with positions and offsets.
        TermVector.NO – do not store any term vector information.
        If Index.NO is specified for a field, then you must also specify TermVector.NO.

        這樣在index完後, 給定這個document id和field名稱, 我們就可以從IndexReader讀出這個term vector(前提是你在indexing時建立了terms vector):
        TermFreqVector termFreqVector = reader.getTermFreqVector(id, "subject");
        你可以遍歷這個TermFreqVector去取出每個詞和詞頻, 如果你在index時選擇存下offsets和positions資訊的話, 你在這邊也可以取到.

有了這個term vector我們可以做一些有趣的應用:

1) Books like this

        比較兩本書是否相似,把書抽象成一個document檔案, 具有author, subject fields. 那麼現在就通過這兩個field來比較兩本書的相似度.
author這個field是multiple fields, 就是說可以有多個author, 那麼第一步就是比author是否相同,
String[] authors = doc.getValues("author");
BooleanQuery authorQuery = new BooleanQuery(); // #3
for (int i = 0; i < authors.length; i++) { // #3
    String author = authors[i]; // #3
    authorQuery.add(new TermQuery(new Term("author", author)), BooleanClause.Occur.SHOULD); // #3
}
authorQuery.setBoost(2.0f);

最後還可以把這個查詢的boost值設高, 表示這個條件很重要, 權重較高, 如果作者相同, 那麼就很相似了.
第二步就用到term vector了, 這裡用的很簡單, 單純的看subject field的term vector中的term是否相同,

TermFreqVector vector = // #4
reader.getTermFreqVector(id, "subject"); // #4
BooleanQuery subjectQuery = new BooleanQuery(); // #4
for (int j = 0; j < vector.size(); j++) { // #4
TermQuery tq = new TermQuery(new Term("subject", vector.getTerms()[j]));
subjectQuery.add(tq, BooleanClause.Occur.SHOULD); // #4
}

2) What category?

這個比上個例子高階一點, 怎麼分類了,還是對於document的subject, 我們有了term vector.
所以對於兩個document, 我們可以比較這兩個文章的term vector在向量空間中的夾角, 夾角越小說明這個兩個document越相似.
那麼既然是分類就有個訓練的過程, 我們必須建立每個類的term vector作為個標準, 來給其它document比較.
這裡用map來實現這個term vector, (term, frequency), 用n個這樣的map來表示n維. 我們就要為每個category來生成一個term vector, category和term vector也可以用一個map來連線.建立這個category的term vector, 這樣做:

遍歷這個類中的每個document, 取document的term vector, 把它加到category的term vector上.

private void addTermFreqToMap(Map vectorMap, TermFreqVector termFreqVector) {
    String[] terms = termFreqVector.getTerms();
    int[] freqs = termFreqVector.getTermFrequencies();
    for (int i = 0; i < terms.length; i++) {
        String term = terms[i];
        if (vectorMap.containsKey(term)) {
            Integer value = (Integer) vectorMap.get(term);
            vectorMap.put(term, new Integer(value.intValue() + freqs[i]));
        } else {
            vectorMap.put(term, new Integer(freqs[i]));
        }
   }
}

首先從document的term vector中取出term和frequency的list, 然後從category的term vector中取每一個term, 把document的term frequency加上去.OK了

有了這個每個類的category, 我們就要開始計算document和這個類的向量夾角了

cos = A*B/|A||B|
A*B就是點積, 就是兩個向量每一維相乘, 然後全加起來.

這裡為了簡便計算, 假設document中term frequency只有兩種情況, 0或1.就表示出現或不出現

private double computeAngle(String[] words, String category) {
    // assume words are unique and only occur once
    Map vectorMap = (Map) categoryMap.get(category);
    int dotProduct = 0;
    int sumOfSquares = 0;
    for (int i = 0; i < words.length; i++) {
        String word = words[i];
        int categoryWordFreq = 0;
        if (vectorMap.containsKey(word)) {
            categoryWordFreq = ((Integer) vectorMap.get(word)).intValue();
        }
        dotProduct += categoryWordFreq; // optimized because we assume frequency in words is 1
        sumOfSquares += categoryWordFreq * categoryWordFreq;
    }
    double denominator;
    if (sumOfSquares == words.length) {
        // avoid precision issues for special case
        denominator = sumOfSquares; // sqrt x * sqrt x = x
    } else {
        denominator = Math.sqrt(sumOfSquares) *
        Math.sqrt(words.length);
    }
    double ratio = dotProduct / denominator;
    return Math.acos(ratio);
}

這個函式就是實現了上面那個公式還是比較簡單的.

3) MoreLikeThis

對於找到比較相似的文件，lucene還提供了個比較高效的介面，MoreLikeThis介面

對於上面的方法我們可以比較每兩篇文件的餘弦值，然後對餘弦值進行排序，找出最相似的文件，但這個方法的最大問題在於計算量太大，當文件數目很大時，幾乎是無法接受的，當然有專門的方法去優化餘弦法，可以使計算量大大減少，但這個方法精確，但門檻較高。

這個介面的原理很簡單，對於一篇文件中，我們只需要提取出interestingTerm（即tf×idf高的詞），然後用lucene去搜索包含相同詞的文件，作為相似文件，這個方法的優點就是高效，但缺點就是不準確，這個介面提供很多引數，你可以配置來選擇interestingTerm。

MoreLikeThis mlt = new MoreLikeThis(ir);

Reader target = ...

// orig source of doc you want to find similarities to

Query query = mlt.like( target);

Hits hits = is.search(query);

用法很簡單，這樣就可以得到，相似的文件

這個介面比較靈活，你可以不直接用like介面，而是用

retrieveInterestingTerms(Reader r)

這樣你可以獲得interestingTerm，然後怎麼處理就根據你自己的需要了。

lucene計算文字相似度演算法

lucene計算文字相似度演算法

DSSM演算法-計算文字相似度

python實現機器學習中的各種距離計算及文字相似度演算法

用gensim doc2vec計算文字相似度，Python可以跑通的程式碼

計算文字相似度方法大全-簡單說

用gensim doc2vec計算文字相似度

文字相似度演算法

simhash計算文字相似度

幾種文字相似度演算法的C++實現

計算文字相似度-java實現

解析TF-IDF演算法原理：關鍵詞提取，自動摘要，文字相似度計算

計算句子文字相似度－編輯距離計算

文字相似度bm25演算法的原理以及Python實現(jupyter notebook)

nlp中文字相似度計算問題

基於神經網路的文字相似度計算【醫療大資料】

Doc2Vec計算句子文件向量、求文字相似度

Python 文字挖掘：使用gensim進行文字相似度計算

word2vec詞向量訓練及中文文字相似度計算

文字相似度計算的幾個距離公式（歐氏距離、餘弦相似度、Jaccard距離、編輯距離）

文字相似度-bm25演算法原理及實現

lucene計算文字相似度演算法

相關推薦