Bag of words model (詞袋模型)
The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order.
詞袋模型是在自然語言處理和資訊檢索中的一種簡單假設。在這種模型中,文字(段落或者文件)被看作是無序的詞彙集合,忽略語法甚至是單詞的順序。
The bag-of-words model is used in some methods of document classification. When a Naive Bayes classifier is applied to text, for example, the conditional independence assumption leads to the bag-of-words model. [1] Other methods of document classification that use this model are latent Dirichlet allocation and latent semantic analysis.[2]
詞袋模型被用在文字分類的一些方法當中。當傳統的貝葉斯分類被應用到文本當中時,貝葉斯中的條件獨立性假設導致詞袋模型。另外一些文字分類方法如LDA和LSA也使用了這個模型。
Example: Spam filtering
In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham"). Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.
在貝葉斯垃圾郵件過濾中,一封郵件被看作無序的詞彙集合,這些詞彙從兩種概率分佈中被選出。一個代表垃圾郵件,一個代表合法的電子郵件。這裡假設有兩個裝滿詞彙的袋子。一個袋子裡面裝的是在垃圾郵件中發現的詞彙。另一個袋子裝的是合法郵件中的詞彙。儘管給定的一個詞可能出現在兩個袋子中,裝垃圾郵件的袋子更有可能包含垃圾郵件相關的詞彙,如股票,偉哥,“買”,而合法的郵件更可能包含郵件使用者的朋友和工作地點的詞彙。
To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.
為了將郵件分類,貝葉斯郵件分類器假設郵件來自於兩個詞袋中中的一個,並使用貝葉斯概率條件概率來決定那個袋子更可能產生這樣的一封郵件。