1. 程式人生 > 實用技巧 >TF-IDF筆記(手寫和直接呼叫)

TF-IDF筆記(手寫和直接呼叫)

首先TF-IDF 全稱:term frequency–inverse document frequency,是一種用於資訊檢索與資料探勘的常用加權技術

TF是詞頻(Term Frequency),IDF是逆文字頻率指數(Inverse Document Frequency)。

上面是百度的結果

我的理解就是用來篩選特徵的,看看那些詞用來當特徵比較好。

詞頻(TF):就是一個詞在一個文本里出現的次數除以文字詞數。(文字內詞出現次數 /文字內詞總數)

逆文字頻率指數(IDF):就是總文字數除以包含這個詞的文字數的10的對數,有點饒哈哈。lg(總文字數/包含這個詞的文字數)

TF-IDF = TF*IDF

先看下呼叫的:

# CountVectorizer會將文字中的詞語轉換為詞頻矩陣
vectorizer = CountVectorizer(max_features=1200, min_df=12)

# TfidfTransformer用於統計vectorizer中每個詞語的TF-IDF值
tf_idf_transformer = TfidfTransformer()

# vectorizer.fit_transform()計算每個詞出現的次數
# tf_idf_transformer.fit_transform將詞頻矩陣統計成TF-IDF值
tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(train_features['features'].values.astype('U'))) # .values.astype('U')

x_train_weight = tf_idf.toarray() # 訓練集TF-IDF權重矩陣

然後是我手寫的:
引數格式是,[詞1 詞2 詞3,詞1 詞2 詞3,詞1 詞2 詞3]
一個字串列表,詞與詞間用空格隔開。
 print("-"*5+"構建tf-idf權重矩陣中"+"-"*5)
def get_tf_idf(list_words):
# 構建詞典
wordSet = list(set(" ".join(list_words).split()))

# 統計詞數
def count_(words):
wordDict = dict.fromkeys(wordSet, 0)
for i in words:
wordDict[i] += 1
return wordDict

# 計算tf
def computeTF(words):
cnt_dic = count_(words)
tfDict = {}
nbowCount = len(words)

for word, count in cnt_dic.items():
tfDict[word] = count / nbowCount

return tfDict

# 計算idf
def get_idf():
filecont = dict.fromkeys(wordSet, 0)
for i in wordSet:
for j in list_words:
if i in j.split():
filecont[i] += 1
idfDict = dict.fromkeys(wordSet, 0)
le = len(list_words)
for word, cont in filecont.items():
idfDict[word] = math.log10(le/cont+1)
return idfDict


# 計算每個詞的TF*IDF的值
def get_tf_idf(list_words):
idf_dic = get_idf()
ret = []
for words in list_words:
tf_dic = computeTF(words.split())
tf_idf_dic = {}
temp = []
for word, tf in tf_dic.items():
idf = idf_dic[word]
tf_idf = tf * math.log(len(list_words) / (idf+1))
tf_idf_dic[word] = tf_idf

for word in wordSet:
temp.append(tf_idf_dic.get(word, 0))
ret.append(temp)
return ret
return np.array(get_tf_idf(list_words))
tf-idf矩陣:
word_tf_idf = get_tf_idf(features)

慢的飛起,哈哈哈哈。