機器學習(3):基於概率論的分類方法：樸素貝葉斯

阿新 • • 發佈：2018-12-14

概述

優點：在資料較少的情況下仍然有效，可以處理多類別問題。
缺點：對於輸入資料的準備方式較為敏感。
使用資料型別：標稱型資料。

貝葉斯決策理論的核心思想：選擇具有最高概率的決策。

使用條件概率來分類

對於某個資料點x,y：

如果 $P(c_{1}\left | x, y) > P(c_{2}\left | x, y)$ ，那麼屬於類別 $c_{1}$
如果 $P(c_{1}\left | x, y) < P(c_{2}\left | x, y)$ ，那麼屬於類別 $c_{2}$

使用貝葉斯準則（公式： $P(c_{i}|x,y) = \frac{p(x,y | c_{i})}{p(x,y)}$ ），可以通過已知的三個概率值來計算未知的概率值。

使用樸素貝葉斯進行文件分類

樸素貝葉斯的兩個假設分別是：

1.一個特徵或者單詞出現的可能性和它和其他單詞相鄰沒有關係。

由統計學知，如果每個特徵需要N個樣本，那麼對於包含1000個特徵的詞彙表需要N^1000個樣本。如果特徵之間相互獨立，那麼樣本數就可以從N^1000減少到1000xN。即一個特徵或者單詞出現的可能性與它和其他單詞相鄰沒有關係。

2.每個特徵同等重要。

程式碼：

from pysqlite2 import dbapi2 as sqlite
import re
import math

def getwords(doc):
  splitter=re.compile('\\W*')
  print doc
  # Split the words by non-alpha characters
  words=[s.lower() for s in splitter.split(doc) 
          if len(s)>2 and len(s)<20]
  
  # Return the unique set of words only
  return dict([(w,1) for w in words])

class classifier:
  def __init__(self,getfeatures,filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures
    
  def setdb(self,dbfile):
    self.con=sqlite.connect(dbfile)    
    self.con.execute('create table if not exists fc(feature,category,count)')
    self.con.execute('create table if not exists cc(category,count)')


  def incf(self,f,cat):
    count=self.fcount(f,cat)
    if count==0:
      self.con.execute("insert into fc values ('%s','%s',1)" 
                       % (f,cat))
    else:
      self.con.execute(
        "update fc set count=%d where feature='%s' and category='%s'" 
        % (count+1,f,cat)) 
  
  def fcount(self,f,cat):
    res=self.con.execute(
      'select count from fc where feature="%s" and category="%s"'
      %(f,cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def incc(self,cat):
    count=self.catcount(cat)
    if count==0:
      self.con.execute("insert into cc values ('%s',1)" % (cat))
    else:
      self.con.execute("update cc set count=%d where category='%s'" 
                       % (count+1,cat))    

  def catcount(self,cat):
    res=self.con.execute('select count from cc where category="%s"'
                         %(cat)).fetchone()
    if res==None: return 0
    else: return float(res[0])

  def categories(self):
    cur=self.con.execute('select category from cc');
    return [d[0] for d in cur]

  def totalcount(self):
    res=self.con.execute('select sum(count) from cc').fetchone();
    if res==None: return 0
    return res[0]


  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features:
      self.incf(f,cat)

    # Increment the count for this category
    self.incc(cat)
    self.con.commit()

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0

    # The total number of times this feature appeared in this 
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5):
    # Calculate current probability
    basicprob=prf(f,cat)

    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp




class naivebayes(classifier):
  
  def __init__(self,getfeatures):
    classifier.__init__(self,getfeatures)
    self.thresholds={}
  
  def docprob(self,item,cat):
    features=self.getfeatures(item)   

    # Multiply the probabilities of all the features together
    p=1
    for f in features: p*=self.weightedprob(f,cat,self.fprob)
    return p

  def prob(self,item,cat):
    catprob=self.catcount(cat)/self.totalcount()
    docprob=self.docprob(item,cat)
    return docprob*catprob
  
  def setthreshold(self,cat,t):
    self.thresholds[cat]=t
    
  def getthreshold(self,cat):
    if cat not in self.thresholds: return 1.0
    return self.thresholds[cat]
  
  def classify(self,item,default=None):
    probs={}
    # Find the category with the highest probability
    max=0.0
    for cat in self.categories():
      probs[cat]=self.prob(item,cat)
      if probs[cat]>max: 
        max=probs[cat]
        best=cat

    # Make sure the probability exceeds threshold*next best
    for cat in probs:
      if cat==best: continue
      if probs[cat]*self.getthreshold(best)>probs[best]: return default
    return best

class fisherclassifier(classifier):
  def cprob(self,f,cat):
    # The frequency of this feature in this category    
    clf=self.fprob(f,cat)
    if clf==0: return 0

    # The frequency of this feature in all the categories
    freqsum=sum([self.fprob(f,c) for c in self.categories()])

    # The probability is the frequency in this category divided by
    # the overall frequency
    p=clf/(freqsum)
    
    return p
  def fisherprob(self,item,cat):
    # Multiply all the probabilities together
    p=1
    features=self.getfeatures(item)
    for f in features:
      p*=(self.weightedprob(f,cat,self.cprob))

    # Take the natural log and multiply by -2
    fscore=-2*math.log(p)

    # Use the inverse chi2 function to get a probability
    return self.invchi2(fscore,len(features)*2)
  def invchi2(self,chi, df):
    m = chi / 2.0
    sum = term = math.exp(-m)
    for i in range(1, df//2):
        term *= m / i
        sum += term
    return min(sum, 1.0)
  def __init__(self,getfeatures):
    classifier.__init__(self,getfeatures)
    self.minimums={}

  def setminimum(self,cat,min):
    self.minimums[cat]=min
  
  def getminimum(self,cat):
    if cat not in self.minimums: return 0
    return self.minimums[cat]
  def classify(self,item,default=None):
    # Loop through looking for the best result
    best=default
    max=0.0
    for c in self.categories():
      p=self.fisherprob(item,c)
      # Make sure it exceeds its minimum
      if p>self.getminimum(c) and p>max:
        best=c
        max=p
    return best


def sampletrain(cl):
  cl.train('Nobody owns the water.','good')
  cl.train('the quick rabbit jumps fences','good')
  cl.train('buy pharmaceuticals now','bad')
  cl.train('make quick money at the online casino','bad')
  cl.train('the quick brown fox jumps','good')

機器學習(3):基於概率論的分類方法：樸素貝葉斯

概述優點：在資料較少的情況下仍然有效，可以處理多類別問題。缺點：對於輸入資料的準備方式較為敏感。使用資料型別：標稱型資料。貝葉斯決策理論的核心思想：選擇具有最高概率的決策。使用條件概率來分類對於某個資料點x,y：如果，那麼屬於類別如果，那麼屬於類

機器學習實戰——基於概率論的分類方法：樸素貝葉斯（二）

使用貝葉斯過濾垃圾郵件 1.準備資料：切分文字將字串切分為詞列表時，倘若沒有split引數，則標點符號也會被當成詞的一部分，可以使用正則表示式來切分句子，其中分隔符是除了單詞，數字之外的任意字串

【機器學習實戰—第4章：基於概率論的分類方法：樸素貝葉斯】程式碼報錯（python3）

1、報錯：UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xae in position 199: illegal multibyte sequence 原因：這是檔案編碼的問題，檔案中有非法的多位元組字元。解決辦法：開啟Ch04\

第4章基於概率論的分類方法：樸素貝葉斯

樸素貝葉斯概述貝葉斯分類是一類分類演算法的總稱，這類演算法均以貝葉斯定理為基礎，故統稱為貝葉斯分類。本章首先介紹貝葉斯分類演算法的基礎——貝葉斯定理。最後，我們通過例項來討論貝葉斯分類的中最簡單的一種: 樸素貝葉斯分類。貝葉斯理論 & 條件概率貝葉斯理論

基於概率論的分類方法：樸素貝葉斯

需要分類器做出分類決策，可以使分類器給出各個類別的概率估計值，然後選擇概率最高的作為其的類別。在這裡使用到了概率論中的貝葉斯公式：P(A|B)=P(A)*P(B|A)/P(B)，其中P(A|B)是後驗概率，P(A)是先驗概率，P(B|A)/P(B)為調整因子（在已知結果的情況

機器學習實戰教程（五）：樸素貝葉斯實戰篇之新浪新聞分類

原文連結： Jack-Cui，https://cuijiahua.com/blog/2017/11/ml_5_bayes_2.html 一、前言上篇文章機器學習實戰教程（四）：樸素貝葉斯基礎篇之言論過濾器講解了樸素貝葉斯的基礎知識。本篇文章將在此基礎上進行擴充套件，你將看到以下內容：拉普拉

機器學習實戰教程（四）：樸素貝葉斯基礎篇之言論過濾器

word 最可 dog 一個 mac github上開課和數基礎上原文鏈接： Jack-Cui，https://cuijiahua.com/blog/2017/11/ml_4_bayes_1.html 一、前言樸素貝葉斯算法是有監督的學習算法，解決的是分類問題，

機器學習回顧篇（5）：樸素貝葉斯演算法

1 引言說到樸素貝葉斯演算法，很自然地就會想到貝葉斯概率公式，這是我們在高中的時候就學過的只是，沒錯，這也真是樸素貝葉斯演算法的核心，今天我們也從貝葉斯概率公式開始，全面擼一擼樸素貝葉斯演算法。 2 貝葉斯概率公式 2.1 聯合概率與全概率公式定義1：完備事件組 ${A_1} \cup {A_2

機器學習實戰中，第四章樸素貝葉斯，過濾垃圾郵件，正則表示式切分郵件內容得出字母的問題解決方法

原文中的程式碼：listOfTokens = re.split(r'\W*', bigString) 修改為：listOfTokens = re.split(r'\W+', bigString)

【機器學習－西瓜書】七、樸素貝葉斯分類器

推薦閱讀：拉普拉斯修正 7.3樸素貝葉斯分類器關鍵詞：樸素貝葉斯；拉普拉斯修正上一小節我們知道貝葉斯分類器的分類依據是這公式：P(c∣x)=P(x,c)P(x)=P(c)⋅P(c∣x)P(x) ，對於每個樣本而言，分母P(x)=∑mi=1P(

機器學習實踐（九）—sklearn之樸素貝葉斯演算法

一、樸素貝葉斯演算法什麼是樸素貝葉斯分類方法屬於哪個類別概率大，就判斷屬於哪個類別概率基礎概率定義為一件事情發生的可能性 P(X) : 取值在[0, 1] 聯合概率、條件概率與相互獨立

分類演算法：樸素貝葉斯NB

演算法原理設x={a1,a2,a3,...,am}為一個待分類項，aj為x的一個特徵屬性。 c={y1,y2,...,yn}為類別的集合那麼計算x所屬的類別就是計算：已知x，求P(yi|x)。哪一種類別yi使P(yi|x)最大，x就屬

分類演算法：樸素貝葉斯演算法

一、問題的提出使用Wikipedia上的一個例子： “一所學校裡面有 60% 的男生，40% 的女生。男生總是穿長褲，女生則一半穿長褲一半穿裙子。有了這些資訊之後我們可以容易地計算“隨機選取一個學生，他（她）穿長褲的概率和穿裙子的概率是多大”，這個就是前面說的“正向概率”

機器學習基礎——讓你一文學會樸素貝葉斯模型

今天這篇文章和大家聊聊樸素貝葉斯模型，這是機器學習領域非常經典的模型之一，而且非常簡單，適合初學者入門。樸素貝葉斯模型，顧名思義和貝葉斯定理肯定高度相關。之前我們在三扇門遊戲的文章當中介紹過貝葉斯定理，我們先來簡單回顧一下貝葉斯公式： \[P(A|B)=\frac{P(A)P(B|A)}{P(B)}\] 我

機器學習筆記（五）：樸素貝葉斯分類器

一、概述 1.1 簡介樸素貝葉斯（Naive Bayesian）是基於貝葉斯定理和特徵條件獨立假設的分類方法，它通過特徵計算分類的概率，選取概率大的情況進行分類，因此它是基於概率論的一種機器學習分類方法。因為分類的目標是確定的，所以也是屬於監督學習。 Q1：什麼是基於概率論的方

機器學習：樸素貝葉斯分類器，決策函式向量化處理，mask使用技巧

文章目錄前面實現的樸素貝葉斯分類器，決策函式是非向量化的：藉助於numpy向量化處理，相當於平行計算，注意mask使用技巧，用途較廣：前面實現的樸素貝葉斯分類器，決策函式是非向量化的：前面提到過大資料處理，儘量避免個人的遍歷等一些函式

機器學習：樸素貝葉斯分類器程式碼實現，決策函式非向量化方式

文章目錄樸素貝葉斯離散型的演算法描述：程式碼實現：實現一個NaiveBayes的基類，以便擴充套件：實現離散型樸素貝葉斯MultiomialNB類：實現從檔案中讀取資料：測試資料：程式碼測試：

機器學習：貝葉斯分類器，樸素貝葉斯，拉普拉斯平滑

數學基礎：數學基礎是貝葉斯決策論Bayesian DecisionTheory,和傳統統計學概率定義不同。頻率學派認為頻率是是自然屬性，客觀存在的。貝葉斯學派，從觀察這出發，事物的客觀隨機性只是觀察者不知道結果，也就是觀察者的知識不完備，對於知情者而言，事物沒有隨機性，隨機

Python3《機器學習實戰》學習筆記（五）：樸素貝葉斯實戰篇之新浪新聞分類

一前言拉普拉斯平滑垃圾郵件過濾新浪新聞分類二樸素貝葉斯改進之拉普拉斯平滑上篇文章提到過，演算法存在一定的問題，需要進行改進。那麼需要改進的地方在哪裡呢？利用貝葉斯分類器對文件進行分類時，要計算多個概率的乘積以獲得

《機器學習實戰》學習筆記：樸素貝葉斯分類演算法

貝葉斯決策理論選擇高概率對應的類別是貝葉斯決策理論的核心思想，即選擇具有最高概率的決策。樸素貝葉斯樸素貝葉斯法是基於貝葉斯定理與特徵條件獨立假設的分類方法。最為廣泛的兩種分類模型是決策樹模型(Decision Tree Model)和樸素貝葉斯模型（Nai

機器學習(3):基於概率論的分類方法：樸素貝葉斯

概述

使用條件概率來分類

使用樸素貝葉斯進行文件分類

相關推薦