1. 程式人生 > >python 之計算詞典和詞頻矩陣

python 之計算詞典和詞頻矩陣

 詞典構造:每個單詞對應一個數字ID 。words列表裡的單詞排序,不知道以何原理。

詞頻矩陣:col 數為單詞的個數,列數為文字的個數。

from collections import Counter
from itertools import chain
import numpy as np
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
def word_matrix(documents):
    '''計算詞頻矩陣'''
    # 所有字母轉換位小寫
    docs = [d.lower() for d in documents]
    # 分詞
    docs = [d.split() for d in docs]
    # 獲取所有詞
    words = list(set(chain(*docs)))
    #print(words)
    # 詞到ID的對映, 使得每個詞有一個ID
    dictionary = dict(zip(words, range(len(words))))
    #print(dictionary)
    # 建立一個空的矩陣, 行數等於詞數, 列數等於文件數
    matrix = np.zeros((len(words), len(docs)))
    # 逐個文件統計詞頻
    for col, d in enumerate(docs):  # col 表示矩陣第幾列,d表示第幾個文件。
        # 統計詞頻
        count = Counter(d)#其實是個詞典,詞典元素為:{單詞:次數}。
        for word in count:
            # 用word的id表示word在矩陣中的行數,該文件表示列數。
            id = dictionary[word]
            # 把詞頻賦值給矩陣
            matrix[id, col] = count[word]
    return matrix, dictionary

matrix, dictionary = word_matrix(documents)
print(matrix,'\n',dictionary)

二、詞頻矩陣matrix構建完成之後,求得TF矩陣和IDF矩陣,兩個矩陣相乘,便得到每個單詞的tf-idf在每個文件裡面的值。之前的理解沒有大局觀。tf-idf模型中的tf和idf不是孤立存在的,由一個矩陣演化而來。