文本相似性計算

阿新 • • 發佈：2017-09-30

for 向量 dex 文檔 red div 我們 number calc

文本相似性計算三個階段：
1. 字面的匹配相似
2. 詞匯的匹配相似
3. 語義的匹配相似

一、JaccardSimilarity方法
對文本進行分詞，然後對每一個單詞分配一個唯一的ID（token），為了計算文本之間的相似性。JaccardSimilarity方法的計算方法是：
兩個集合的交集/兩個集合的並集
二、文本的向量化
文本->向量化為向量->向量空間中的某一個點->求兩個點（即兩個文本）之間的距離->得到文檔的相似性
2.1 簡單的向量化
為每一個詞語分配一個唯一的ID，假設所有的詞語個數為N，用數組表示就是大小為N數組的下表。然後，如果文檔中對應位置的詞出現就將該位置置為1
2.2 TF-IDF向量化
通過TF-IDF向量化的方法，可以將每個詞向量化成一個表示權重的小數，而不是0或1，它已經帶有了文本的信息了。向量化後每一個詞都帶上了TF-IDF信息了，而TF-IDF的作用就是保留詞在文檔中的權重信息，這就相當於保留了文本的信息。於是，我們通過token的概念和TF-IDF方法，就把一個本文向量化了，並且向量化完了以後還保留了文本本身的信息，每一個向量就是一個前面提到的詞袋。

實踐：利用gensim的庫corpora、models、similarities實現文檔相似性的計算：　

 1 訓練語料:LDA_text.txt
 2 Human machine interface for lab abc computer applications Human Human
 3 A survey of user opinion of computer system response time
 4 The EPS user interface management with system
 5 System and human system engineering testing of EPS
 
 6 Relation of user perceived response time to error measurement
 7 The generation of random binary unordered trees
 8 The intersection graph of paths in trees
 9 Graph minors IV Widths of trees and well quasi ordering
10 Graph minors A survey

View Code

 1 from gensim import corpora, models, similarities
 
 2 
 3 if __name__==‘__main__‘:
 4     f = open(‘./LDA_test.txt‘, ‘r‘)
 5     stop_list = ‘for a of the and to in‘.split()
 6     texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f.readlines()]
 7 
 8     # build dictionary
 9     dictionary = corpora.Dictionary(texts) # construct dictionary for all documents and the length of dic is the number of uniq words
10     corpus = [dictionary.doc2bow(text) for text in texts] # transform document to bag of words representaion according to dictionray
11 
12     # calculates the idf for each word in document
13     tfidf_model = models.TfidfModel(corpus)
14     tfidf = tfidf_model[corpus] # [] method transform the bow representation to tfidf
15 
16     query = ‘human system with System engineering testing‘
17     query_bow = dictionary.doc2bow(query.split())
18     query_tfidf = tfidf_model[query_bow] # calculate the query itfidf reprensentation using the bow  reprensentation
19     # print query_tfidf
20     similarity = similarities.Similarity(‘Similarity-index‘,tfidf, num_features=600)
21     similarity.num_best = 3
22     print ‘query =‘, query
23     for item in similarity[query_tfidf]:
24         print ‘ ‘.join(texts[item[0]]), item[1] # 打印top3相似的文檔和文檔相似性

View Code

1 輸出結果
2                         document                                       score
3 system human system engineering testing eps     0.775833368301
4 eps user interface management with system       0.349639117718
5 human machine interface lab abc computer applications human human       0.240938946605
6

View Code

三、向量空間模型（VSM）
對本文進行向量化完了之後，就是將文本映射為向量空間中的一個點。然後，通過計算向量空間中的兩個點之間距離的方法計算文本之間的相似性：
3.1 歐式距離
3.2 余弦相似度距離
四、LDA主題模型
前述的方法構建文本向量的方法：只是機械的計算了詞的向量，並沒有任何上下文的關系，所有思想還停留在機器層面，還沒有到更高層次上來
4.2 LDA
P(W(詞)|D(文章))=P(W(詞)|T(主題))*P(T(主題)|D(文章))
（1）P(W(詞)|D(文章)) 這個其實是可以直接統計出來的
（2）P(W(詞)|T(主題)) 這個是模型的一部分，是要求出來的
（3）P(T(主題)|D(文章)) 這個是最後分類的結果
因此，模型的關鍵是求出來每一個詞所屬的主題分布情況。當來了一片新的文檔後，統計出該文檔屬於每一個主題的概率分布。

文本相似性計算

for 向量 dex 文檔 red div 我們 number calc 文本相似性計算三個階段： 1. 字面的匹配相似 2. 詞匯的匹配相似 3. 語義的匹配相似一、JaccardSimilarity方法對文本進行分詞，然後對每一個單詞分配一個唯一的ID（token）

文本相似性計算

文本相似性計算

python文本處理---計算fasta文件中不同氨基酸的數目

Shell腳本中計算字符串長度的5種方法及從文本獲取某一行

計算手動輸入的文本長度

python文本聯系--計算字符串中各個字符的數量

文本預處理和計算TF-IDF值

awk文本工具按列計算和

【Python】Python 網頁爬蟲 & 文本處理 & 科學計算 & 機器學習 & 數據挖掘兵器譜

1. 文本相似度計算-文本向量化

如何計算文件相似性

計算文本HasH

ASP.NET MVC5 中百度ueditor富文本編輯器的使用

NicEditor——超輕量級文本編輯器

修復Extjs5.1.4表格設置enableTextSelection: true之後，文本仍然不能選擇的BUG

Mac Finder 裏新建文本

調用百度富文本

關於百度富文本編輯器UEdit的初始化內容失敗問題

thinkphp3.2.3 整合富文本編輯器

基於OpenGL編寫一個簡易的2D渲染框架-05 渲染文本

sed 強大的流文本編輯器淺析及示例演示

文本相似性計算

相關推薦