1. 程式人生 > >文字相似度和分類

文字相似度和分類

文字相似度

  • 度量文字間的相似性
  • 使用詞頻表示文字特徵
  • 文字中單詞出現的頻率或次數
  • NLTK實現詞頻統計

文字相似度案例:

import nltk
from nltk import FreqDist

text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'

text = text1 + text2 + text3 + text4 + text5
words 
= nltk.word_tokenize(text) freq_dist = FreqDist(words) print(freq_dist['is']) # 輸出結果: # 4 # 取出常用的n=5個單詞 n = 5 # 構造“常用單詞列表” most_common_words = freq_dist.most_common(n) print(most_common_words) # 輸出結果: # [('a', 4), ('movie', 4), ('is', 4), ('This', 2), ('That', 2)] def lookup_pos(most_common_words):
""" 查詢常用單詞的位置 """ result = {} pos = 0 for word in most_common_words: result[word[0]] = pos pos += 1 return result # 記錄位置 std_pos_dict = lookup_pos(most_common_words) print(std_pos_dict) # 輸出結果: # {'movie': 0, 'is': 1, 'a': 2, 'That': 3, 'This': 4} # 新文字 new_text = '
That one is a good movie. This is so good!' # 初始化向量 freq_vec = [0] * n # 分詞 new_words = nltk.word_tokenize(new_text) # 在“常用單詞列表”上計算詞頻 for new_word in new_words: if new_word in list(std_pos_dict.keys()): freq_vec[std_pos_dict[new_word]] += 1 print(freq_vec) # 輸出結果: # [1, 2, 1, 1, 1]

 

文字分類

TF-IDF (詞頻-逆文件頻率)

  • TF, Term Frequency(詞頻),表示某個詞在該檔案中出現的次數

  • IDF,Inverse Document Frequency(逆文件頻率),用於衡量某個詞普 遍的重要性。

  • TF-IDF = TF * IDF

  • 舉例假設:

一個包含100個單詞的文件中出現單詞cat的次數為3,則TF=3/100=0.03

樣本中一共有10,000,000個文件,其中出現cat的文件數為1,000個,則IDF=log(10,000,000/1,000)=4

TF-IDF = TF IDF = 0.03 4 = 0.12

  • NLTK實現TF-IDF

TextCollection.tf_idf()

案例:

from nltk.text import TextCollection

text1 = 'I like the movie so much '
text2 = 'That is a good movie '
text3 = 'This is a great one '
text4 = 'That is a really bad movie '
text5 = 'This is a terrible movie'

# 構建TextCollection物件
tc = TextCollection([text1, text2, text3, 
                        text4, text5])
new_text = 'That one is a good movie. This is so good!'
word = 'That'
tf_idf_val = tc.tf_idf(word, new_text)
print('{}的TF-IDF值為:{}'.format(word, tf_idf_val))

 

執行結果:

That的TF-IDF值為:0.02181644599700369