用docsim/doc2vec/LSH比較兩個文件之間的相似度

阿新 • • 發佈：2019-01-02

在我們做文字處理的時候，經常需要對兩篇文件是否相似做處理或者根據輸入的文件，找出最相似的文件。

如需轉載，請註明出處。

幸好gensim提供了這樣的工具，具體的處理思路如下，對於中文文字的比較，先需要做分詞處理，根據分詞的結果生成一個字典，然後再根據字典把原文件轉化成向量。然後去訓練相似度。把對應的文件構建一個索引，原文描述如下：

The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.

第一種方法，使用docsim(推薦使用，結果比較穩定)

示例程式碼：為了清楚的檢視結果，對訓練資料做了標號

# 訓練樣本
raw_documents = [
    '0無償居間介紹買賣毒品的行為應如何定性',
    '1吸毒男動態持有大量毒品的行為該如何認定',
    '2如何區分是非法種植毒品原植物罪還是非法制造毒品罪',
    '3為毒販販賣毒品提供幫助構成販賣毒品罪',
    '4將自己吸食的毒品原價轉讓給朋友吸食的行為該如何認定',
    '5為獲報酬幫人購買毒品的行為該如何認定',
    '6毒販出獄後再次夠買毒品途中被抓的行為認定',
    '7虛誇毒品功效勸人吸食毒品的行為該如何認定',
    '8妻子下落不明丈夫又與他人登記結婚是否為無效婚姻',
    '9一方未簽字辦理的結婚登記是否有效',
    '10夫妻雙方1990年按農村習俗舉辦婚禮沒有結婚證 一方可否起訴離婚',
    '11結婚前對方父母出資購買的住房寫我們二人的名字有效嗎',
    '12身份證被別人冒用無法登記結婚怎麼辦？',
    '13同居後又與他人登記結婚是否構成重婚罪',
    '14未辦登記只舉辦結婚儀式可起訴離婚嗎',
    '15同居多年未辦理結婚登記，是否可以向法院起訴要求離婚'
]
corpora_documents = []
for item_text in raw_documents:
    item_str = util_words_cut.get_class_words_list(item_text)
    corpora_documents.append(item_str)

# 生成字典和向量語料
dictionary = corpora.Dictionary(corpora_documents)
corpus = [dictionary.doc2bow(text) for text in corpora_documents]

similarity = Similarity('-Similarity-index', corpus, num_features=400)

test_data_1 = '你好，我想問一下我想離婚他不想離，孩子他說不要，是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
test_corpus_1 = dictionary.doc2bow(test_cut_raw_1)
similarity.num_best = 5
print(similarity[test_corpus_1])  # 返回最相似的樣本材料,(index_of_document, similarity) tuples

print('################################')

test_data_2 = '家人因涉嫌運輸毒品被抓，她只是去朋友家探望朋友的，結果就被抓了，還在朋友家收出毒品，可家人的身上和行李中都沒有。現在已經拘留10多天了，請問會被判刑嗎'
test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2)
test_corpus_2 = dictionary.doc2bow(test_cut_raw_2)
similarity.num_best = 5
print(similarity[test_corpus_2])  # 返回最相似的樣本材料,(index_of_document, similarity) tuples

執行結果如下：

/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.py
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
Loading model cost 0.521 seconds.
Loading model cost 0.521 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(61 unique tokens: ['丈夫', '法院', '結婚', '住房', '出資']...) from 16 documents (total 89 corpus positions)
starting similarity index under -Similarity-index
[(14, 0.40824830532073975), (15, 0.40824830532073975), (10, 0.35355338454246521)]
################################
creating sparse index
creating sparse matrix from corpus
PROGRESS: at document #0/16
created <16x400 sparse matrix of type '<class 'numpy.float32'>'
	with 86 stored elements in Compressed Sparse Row format>
creating sparse shard #0
saving index shard to -Similarity-index.0
saving SparseMatrixSimilarity object under -Similarity-index.0, separately None
loading SparseMatrixSimilarity object from -Similarity-index.0
[(6, 0.50395262241363525), (2, 0.47140452265739441), (4, 0.33333337306976318), (1, 0.29814240336418152), (5, 0.29814240336418152)]

Process finished with exit code 0

對於第1個測試問題：原文件中14,15,10和其相似，後面是對應的相似度

對於第2個測試問題：原文件中6,2,4,1,5和其相似，後面是對應的相似度

第二種方法，使用doc2vec

看了gensim的官方文件，寫的不好，同樣是使用上面的資料做測試，程式碼及結果如下：

# 使用doc2vec來判斷
cores = multiprocessing.cpu_count()
print(cores)
corpora_documents = []
for i, item_text in enumerate(raw_documents):
    words_list = util_words_cut.get_class_words_list(item_text)
    document = TaggedDocument(words=words_list, tags=[i])
    corpora_documents.append(document)

print(corpora_documents[:2])

model = Doc2Vec(size=89, min_count=1, iter=10)
model.build_vocab(corpora_documents)
model.train(corpora_documents)

print('#########', model.vector_size)

test_data_1 = '你好，我想問一下我想離婚他不想離，孩子他說不要，是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
print(test_cut_raw_1)
inferred_vector = model.infer_vector(test_cut_raw_1)
print(inferred_vector)
sims = model.docvecs.most_similar([inferred_vector], topn=3)
print(sims)

控制檯列印的相關資訊如下：

Pattern library is not installed, lemmatization won't be available.
'pattern' package not found; tag filters are not available for English
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
4
Loading model cost 0.513 seconds.
Loading model cost 0.513 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
consider setting layer size to a multiple of 4 for greater performance
collecting all words and their counts
PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
collected 61 word types and 16 unique tags from a corpus of 16 examples and 89 words
min_count=1 retains 61 unique words (drops 0)
min_count leaves 89 word corpus (100% of original 89)
deleting the raw counts dictionary of 61 items
sample=0 downsamples 0 most-common words
downsampling leaves estimated 89 word corpus (100.0% of prior 89)
estimated required memory for 61 words and 89 dimensions: 91828 bytes
constructing a huffman tree from 61 words
built huffman tree with maximum node depth 7
resetting layer weights
training model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0
expecting 16 sentences, matching count from corpus used for vocabulary survey
[TaggedDocument(words=['無償', '居間', '介紹', '買賣', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '動態', '持有', '毒品', '認定'], tags=[1])]
worker thread finished; awaiting finish of 0 more threads
training on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/s
under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
######### 89
['離婚', '孩子', '自動', '生效', '離婚']
[  2.54629389e-03   1.87756249e-03  -9.76708368e-04  -5.15014399e-03
  -7.54948880e-04  -3.74549557e-03   5.37392031e-03   3.35739669e-03
  -3.50345811e-03   2.63415743e-03  -1.32059853e-03  -4.15759953e-03
  -2.39425618e-03  -6.20105816e-03  -1.42006821e-03  -4.64246795e-03
   3.78829846e-03   1.47493952e-03   4.49652784e-03  -5.57655795e-03
  -1.40081509e-04  -7.10823014e-03  -5.34327468e-04  -4.21888893e-03
  -2.96280603e-03   6.52066898e-04   5.98943839e-03  -4.01164964e-03
   2.49637989e-03  -9.08742077e-04   4.65002051e-03   9.24886088e-04
   1.67128560e-03  -1.93383044e-03  -4.58135502e-03   1.78024184e-03
  -9.60796722e-04   7.26479106e-04   4.50814469e-03   2.58095766e-04
  -4.53767460e-03  -1.72883295e-03  -3.89566552e-03   4.85864235e-03
   5.90517826e-04   4.30173194e-03   3.37816169e-03  -1.08716707e-03
   1.85196218e-03   1.94042712e-03   1.20989932e-03  -4.69703926e-03
  -5.35873650e-03  -1.35291950e-03  -4.62053996e-03   2.15436472e-03
   4.05823253e-03   8.01778078e-05  -3.84314684e-03   1.11574796e-03
  -4.36050585e-03  -3.31182266e-03  -2.15692003e-03  -2.09038518e-03
   4.50274721e-03  -1.85286190e-04  -5.09306230e-03  -1.12043330e-04
   8.25022871e-04   2.60405545e-03  -1.73542544e-03   5.14509249e-03
  -9.16058663e-04   1.01291772e-03  -7.90049613e-04   4.20650374e-03
  -3.00139328e-03   3.34924040e-03  -2.11520446e-03   4.79168072e-03
   2.11459701e-03  -3.07943812e-03  -5.09956060e-03  -2.34926818e-03
   7.30032055e-03  -5.31428820e-03  -2.96888268e-03   4.95154131e-03
   3.09590902e-03]
[(15, 0.2670447528362274), (14, 0.18831682205200195), (10, 0.07022987306118011)]
precomputing L2-norms of doc weight vectors

使用doc2vec結果不是很穩定，可能是我沒有正確的使用吧，不過我看官方文件也沒有找到比較有用的資訊

文件相關連結如下： https://radimrehurek.com/gensim/models/doc2vec.html

第三種方式：使用LSH(LSH原理請見百度搜索)

sciket-learn提供了lsh的實現，當然github上也有lsh的實現。sciket-learn上是提供的lsh樹。

LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.

還是使用同樣的測試資料，程式碼如下：

# 使用lsh來處理
tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
train_documents = []
for item_text in raw_documents:
    item_str = util_words_cut.get_class_words_with_space(item_text)
    train_documents.append(item_str)
x_train = tfidf_vectorizer.fit_transform(train_documents)

test_data_1 = '你好，我想問一下我想離婚他不想離，孩子他說不要，是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)
x_test = tfidf_vectorizer.transform([test_cut_raw_1])

lshf = LSHForest(random_state=42)
lshf.fit(x_train.toarray())

distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)
print(distances)
print(indices)

控制檯打錢的資訊如下，基本和docsim一致

[[ 0.42264973  0.42264973  0.48875208]]
[[10 15 14]]

以上是自己找出來的用來比較文字相似度的實現，不過一般lsh比較適合做短文字的比較。

用docsim/doc2vec/LSH比較兩個文件之間的相似度

用docsim/doc2vec/LSH比較兩個文件之間的相似度

用python比較兩個文件中內容的不同之處, 並輸出行號和內容.

比較兩個文件中，一個文件比另一個文件多的行

使用vscode比較兩個文件的差別

linux命令比較兩個文件

diff 比較兩個文件的不同

Python使用difflib模塊比較兩個文件內容異同，同時輸出html易瀏覽

Linux下comm命令比較兩個文件的異同

Python 連接MongoDB並比較兩個字符串相似度的簡單示例

[java]用md5來判斷兩個文件是否完全相同

shell 比較兩個文本不同

給定a、b兩個文件，各存放50億個url，每個url各占用64字節，內存限制是4G，如何找出a、b文件共同的url？

IO練習兩個文件夾進行copy（含子目錄）

兩個文件比較之comm命令

python difflib模塊實現兩個文件差異對比，並輸出html格式。

老男孩教育每日一題-第84天-兩個文件，把第一個文件中的第2、3行內容添加到第二個文件的第3行後面

跟用戶、用戶組相關的4個文件

兩個文件內容差異對比,

利用sort和uniq求兩個文件的並集，交集和差集

兩個文件不同以及生成SQL插入語句

用docsim/doc2vec/LSH比較兩個文件之間的相似度

相關推薦