用docsim/doc2vec/LSH比較兩個文件之間的相似度
在我們做文字處理的時候,經常需要對兩篇文件是否相似做處理或者根據輸入的文件,找出最相似的文件。
如需轉載,請註明出處。
幸好gensim提供了這樣的工具,具體的處理思路如下,對於中文文字的比較,先需要做分詞處理,根據分詞的結果生成一個字典,然後再根據字典把原文件轉化成向量。然後去訓練相似度。把對應的文件構建一個索引,原文描述如下:
The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.
第一種方法,使用docsim(推薦使用,結果比較穩定)
示例程式碼:為了清楚的檢視結果,對訓練資料做了標號執行結果如下:# 訓練樣本 raw_documents = [ '0無償居間介紹買賣毒品的行為應如何定性', '1吸毒男動態持有大量毒品的行為該如何認定', '2如何區分是非法種植毒品原植物罪還是非法制造毒品罪', '3為毒販販賣毒品提供幫助構成販賣毒品罪', '4將自己吸食的毒品原價轉讓給朋友吸食的行為該如何認定', '5為獲報酬幫人購買毒品的行為該如何認定', '6毒販出獄後再次夠買毒品途中被抓的行為認定', '7虛誇毒品功效勸人吸食毒品的行為該如何認定', '8妻子下落不明丈夫又與他人登記結婚是否為無效婚姻', '9一方未簽字辦理的結婚登記是否有效', '10夫妻雙方1990年按農村習俗舉辦婚禮沒有結婚證 一方可否起訴離婚', '11結婚前對方父母出資購買的住房寫我們二人的名字有效嗎', '12身份證被別人冒用無法登記結婚怎麼辦?', '13同居後又與他人登記結婚是否構成重婚罪', '14未辦登記只舉辦結婚儀式可起訴離婚嗎', '15同居多年未辦理結婚登記,是否可以向法院起訴要求離婚' ] corpora_documents = [] for item_text in raw_documents: item_str = util_words_cut.get_class_words_list(item_text) corpora_documents.append(item_str) # 生成字典和向量語料 dictionary = corpora.Dictionary(corpora_documents) corpus = [dictionary.doc2bow(text) for text in corpora_documents] similarity = Similarity('-Similarity-index', corpus, num_features=400) test_data_1 = '你好,我想問一下我想離婚他不想離,孩子他說不要,是六個月就自動生效離婚' test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1) test_corpus_1 = dictionary.doc2bow(test_cut_raw_1) similarity.num_best = 5 print(similarity[test_corpus_1]) # 返回最相似的樣本材料,(index_of_document, similarity) tuples print('################################') test_data_2 = '家人因涉嫌運輸毒品被抓,她只是去朋友家探望朋友的,結果就被抓了,還在朋友家收出毒品,可家人的身上和行李中都沒有。現在已經拘留10多天了,請問會被判刑嗎' test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2) test_corpus_2 = dictionary.doc2bow(test_cut_raw_2) similarity.num_best = 5 print(similarity[test_corpus_2]) # 返回最相似的樣本材料,(index_of_document, similarity) tuples
/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.py Building prefix dict from the default dictionary ... Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model from cache /tmp/jieba.cache Loading model cost 0.521 seconds. Loading model cost 0.521 seconds. Prefix dict has been built succesfully. Prefix dict has been built succesfully. adding document #0 to Dictionary(0 unique tokens: []) built Dictionary(61 unique tokens: ['丈夫', '法院', '結婚', '住房', '出資']...) from 16 documents (total 89 corpus positions) starting similarity index under -Similarity-index [(14, 0.40824830532073975), (15, 0.40824830532073975), (10, 0.35355338454246521)] ################################ creating sparse index creating sparse matrix from corpus PROGRESS: at document #0/16 created <16x400 sparse matrix of type '<class 'numpy.float32'>' with 86 stored elements in Compressed Sparse Row format> creating sparse shard #0 saving index shard to -Similarity-index.0 saving SparseMatrixSimilarity object under -Similarity-index.0, separately None loading SparseMatrixSimilarity object from -Similarity-index.0 [(6, 0.50395262241363525), (2, 0.47140452265739441), (4, 0.33333337306976318), (1, 0.29814240336418152), (5, 0.29814240336418152)] Process finished with exit code 0
對於第1個測試問題:原文件中14,15,10和其相似,後面是對應的相似度
對於第2個測試問題:原文件中6,2,4,1,5和其相似,後面是對應的相似度
第二種方法,使用doc2vec
看了gensim的官方文件,寫的不好,同樣是使用上面的資料做測試,程式碼及結果如下:
# 使用doc2vec來判斷
cores = multiprocessing.cpu_count()
print(cores)
corpora_documents = []
for i, item_text in enumerate(raw_documents):
words_list = util_words_cut.get_class_words_list(item_text)
document = TaggedDocument(words=words_list, tags=[i])
corpora_documents.append(document)
print(corpora_documents[:2])
model = Doc2Vec(size=89, min_count=1, iter=10)
model.build_vocab(corpora_documents)
model.train(corpora_documents)
print('#########', model.vector_size)
test_data_1 = '你好,我想問一下我想離婚他不想離,孩子他說不要,是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
print(test_cut_raw_1)
inferred_vector = model.infer_vector(test_cut_raw_1)
print(inferred_vector)
sims = model.docvecs.most_similar([inferred_vector], topn=3)
print(sims)
控制檯列印的相關資訊如下:Pattern library is not installed, lemmatization won't be available.
'pattern' package not found; tag filters are not available for English
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
4
Loading model cost 0.513 seconds.
Loading model cost 0.513 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
consider setting layer size to a multiple of 4 for greater performance
collecting all words and their counts
PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
collected 61 word types and 16 unique tags from a corpus of 16 examples and 89 words
min_count=1 retains 61 unique words (drops 0)
min_count leaves 89 word corpus (100% of original 89)
deleting the raw counts dictionary of 61 items
sample=0 downsamples 0 most-common words
downsampling leaves estimated 89 word corpus (100.0% of prior 89)
estimated required memory for 61 words and 89 dimensions: 91828 bytes
constructing a huffman tree from 61 words
built huffman tree with maximum node depth 7
resetting layer weights
training model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0
expecting 16 sentences, matching count from corpus used for vocabulary survey
[TaggedDocument(words=['無償', '居間', '介紹', '買賣', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '動態', '持有', '毒品', '認定'], tags=[1])]
worker thread finished; awaiting finish of 0 more threads
training on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/s
under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
######### 89
['離婚', '孩子', '自動', '生效', '離婚']
[ 2.54629389e-03 1.87756249e-03 -9.76708368e-04 -5.15014399e-03
-7.54948880e-04 -3.74549557e-03 5.37392031e-03 3.35739669e-03
-3.50345811e-03 2.63415743e-03 -1.32059853e-03 -4.15759953e-03
-2.39425618e-03 -6.20105816e-03 -1.42006821e-03 -4.64246795e-03
3.78829846e-03 1.47493952e-03 4.49652784e-03 -5.57655795e-03
-1.40081509e-04 -7.10823014e-03 -5.34327468e-04 -4.21888893e-03
-2.96280603e-03 6.52066898e-04 5.98943839e-03 -4.01164964e-03
2.49637989e-03 -9.08742077e-04 4.65002051e-03 9.24886088e-04
1.67128560e-03 -1.93383044e-03 -4.58135502e-03 1.78024184e-03
-9.60796722e-04 7.26479106e-04 4.50814469e-03 2.58095766e-04
-4.53767460e-03 -1.72883295e-03 -3.89566552e-03 4.85864235e-03
5.90517826e-04 4.30173194e-03 3.37816169e-03 -1.08716707e-03
1.85196218e-03 1.94042712e-03 1.20989932e-03 -4.69703926e-03
-5.35873650e-03 -1.35291950e-03 -4.62053996e-03 2.15436472e-03
4.05823253e-03 8.01778078e-05 -3.84314684e-03 1.11574796e-03
-4.36050585e-03 -3.31182266e-03 -2.15692003e-03 -2.09038518e-03
4.50274721e-03 -1.85286190e-04 -5.09306230e-03 -1.12043330e-04
8.25022871e-04 2.60405545e-03 -1.73542544e-03 5.14509249e-03
-9.16058663e-04 1.01291772e-03 -7.90049613e-04 4.20650374e-03
-3.00139328e-03 3.34924040e-03 -2.11520446e-03 4.79168072e-03
2.11459701e-03 -3.07943812e-03 -5.09956060e-03 -2.34926818e-03
7.30032055e-03 -5.31428820e-03 -2.96888268e-03 4.95154131e-03
3.09590902e-03]
[(15, 0.2670447528362274), (14, 0.18831682205200195), (10, 0.07022987306118011)]
precomputing L2-norms of doc weight vectors
使用doc2vec結果不是很穩定,可能是我沒有正確的使用吧,不過我看官方文件也沒有找到比較有用的資訊
文件相關連結如下: https://radimrehurek.com/gensim/models/doc2vec.html
第三種方式:使用LSH(LSH原理請見百度搜索)
sciket-learn提供了lsh的實現,當然github上也有lsh的實現。sciket-learn上是提供的lsh樹。
LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.
還是使用同樣的測試資料,程式碼如下:
# 使用lsh來處理
tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
train_documents = []
for item_text in raw_documents:
item_str = util_words_cut.get_class_words_with_space(item_text)
train_documents.append(item_str)
x_train = tfidf_vectorizer.fit_transform(train_documents)
test_data_1 = '你好,我想問一下我想離婚他不想離,孩子他說不要,是六個月就自動生效離婚'
test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)
x_test = tfidf_vectorizer.transform([test_cut_raw_1])
lshf = LSHForest(random_state=42)
lshf.fit(x_train.toarray())
distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)
print(distances)
print(indices)
控制檯打錢的資訊如下,基本和docsim一致
[[ 0.42264973 0.42264973 0.48875208]]
[[10 15 14]]
以上是自己找出來的用來比較文字相似度的實現,不過一般lsh比較適合做短文字的比較。