1. 程式人生 > >sklearn樸素貝葉斯類庫使用

sklearn樸素貝葉斯類庫使用

在scikit-learn中,提供了3中樸素貝葉斯分類演算法:GaussianNB(高斯樸素貝葉斯)、MultinomialNB(多項式樸素貝葉斯)、BernoulliNB(伯努利樸素貝葉斯)

1、高斯樸素貝葉斯:sklearn.naive_bayes.GaussianNB(priors=None)

①利用GaussianNB類建立簡單模型

  1. In [1]: import numpy as np

  2.    ...: from sklearn.naive_bayes import GaussianNB

  3.    ...: X = np.array([[-1, -1], [-2, -2], [-3, -3],[-4,-4],[-5,-5], [1, 1], [2,

  4.    ...:   2], [3, 3]])

  5.    ...: y = np.array([1, 1, 1,1,1, 2, 2, 2])

  6.    ...: clf = GaussianNB()#預設priors=None

  7.    ...: clf.fit(X,y)

  8.    ...:

  9. Out[1]: GaussianNB(priors=None)

②經過訓練集訓練後,觀察各個屬性值

  1. In [2]: clf.priors#無返回值,因priors=None

  2. In [3]: clf.set_params(priors=[0.625, 0.375])#設定priors引數值

  3. Out[3]: GaussianNB(priors=[0.625, 0.375])

  4. In [4]: clf.priors#返回各類標記對應先驗概率組成的列表

  5. Out[4]: [0.625, 0.375]

  • priors屬性:獲取各個類標記對應的先驗概率
  1. In [5]: clf.class_prior_

  2. Out[5]: array([ 0.625, 0.375])

  3. In [6]: type(clf.class_prior_)

  4. Out[6]: numpy.ndarray

  • class_prior_屬性:同priors一樣,都是獲取各個類標記對應的先驗概率,區別在於priors屬性返回列表,class_prior_返回的是陣列
  1. In [7]: clf.class_count_

  2. Out[7]: array([ 5., 3.])

  • class_count_屬性:獲取各類標記對應的訓練樣本數
  1. In [8]: clf.theta_

  2. Out[8]:

  3. array([[-3., -3.],

  4. [ 2., 2.]])

  • theta_屬性:獲取各個類標記在各個特徵上的均值
  1. In [9]: clf.sigma_

  2. Out[9]:

  3. array([[ 2.00000001, 2.00000001],

  4. [ 0.66666667, 0.66666667]])

  • sigma_屬性:獲取各個類標記在各個特徵上的方差

③方法

  • get_params(deep=True):返回priors與其引數值組成字典
  1. In [10]: clf.get_params(deep=True)

  2. Out[10]: {'priors': [0.625, 0.375]}

  3. In [11]: clf.get_params()

  4. Out[11]: {'priors': [0.625, 0.375]}

set_params(**params):設定估計器priors引數

  1. In [3]: clf.set_params(priors=[ 0.625, 0.375])

  2. Out[3]: GaussianNB(priors=[0.625, 0.375])

fit(X, y, sample_weight=None):訓練樣本,X表示特徵向量,y類標記,sample_weight表各樣本權重陣列

  1. In [12]: clf.fit(X,y,np.array([0.05,0.05,0.1,0.1,0.1,0.2,0.2,0.2]))#設定樣本不同的權重

  2. Out[12]: GaussianNB(priors=[0.625, 0.375])

  3. In [13]: clf.theta_

  4. Out[13]:

  5. array([[-3.375, -3.375],

  6. [ 2. , 2. ]])

  7. In [14]: clf.sigma_

  8. Out[14]:

  9. array([[ 1.73437501, 1.73437501],

  10. [ 0.66666667, 0.66666667]])

對於不平衡樣本,類標記1在特徵1均值及方差計算過程:

均值= ((-1*0.05)+(-2*0.05)+(-3*0.1)+(-4*0.1+(-5*0.1)))/(0.05+0.05+0.1+0.1+0.1)=-3.375

方差=((-1+3.375)**2*0.05 +(-2+3.375)**2*0.05+(-3+3.375)**2*0.1+(-4+3.375)**2*0.1+(-5+3.375)**2*0.1)/(0.05+0.05+0.1+0.1+0.1)=1.73437501

  • partial_fit(X, y, classes=None, sample_weight=None):增量式訓練,當訓練資料集資料量非常大,不能一次性全部載入記憶體時,可以將資料集劃分若干份,重複呼叫partial_fit線上學習模型引數,在第一次呼叫partial_fit函式時,必須制定classes引數,在隨後的呼叫可以忽略
  1. In [18]: import numpy as np

  2. ...: from sklearn.naive_bayes import GaussianNB

  3. ...: X = np.array([[-1, -1], [-2, -2], [-3, -3],[-4,-4],[-5,-5], [1, 1], [2

  4. ...: , 2], [3, 3]])

  5. ...: y = np.array([1, 1, 1,1,1, 2, 2, 2])

  6. ...: clf = GaussianNB()#預設priors=None

  7. ...: clf.partial_fit(X,y,classes=[1,2],sample_weight=np.array([0.05,0.05,0.

  8. ...: 1,0.1,0.1,0.2,0.2,0.2]))

  9. ...:

  10. Out[18]: GaussianNB(priors=None)

  11. In [19]: clf.class_prior_

  12. Out[19]: array([ 0.4, 0.6])

  • predict(X):直接輸出測試集預測的類標記
  1. In [20]: clf.predict([[-6,-6],[4,5]])

  2. Out[20]: array([1, 2])

  • predict_proba(X):輸出測試樣本在各個類標記預測概率值
  1. In [21]: clf.predict_proba([[-6,-6],[4,5]])

  2. Out[21]:

  3. array([[ 1.00000000e+00, 4.21207358e-40],

  4. [ 1.12585521e-12, 1.00000000e+00]])

  • predict_log_proba(X):輸出測試樣本在各個類標記上預測概率值對應對數值
  1. In [22]: clf.predict_log_proba([[-6,-6],[4,5]])

  2. Out[22]:

  3. array([[ 0.00000000e+00, -9.06654487e+01],

  4. [ -2.75124782e+01, -1.12621024e-12]])

  • score(X, y, sample_weight=None):返回測試樣本對映到指定類標記上的得分(準確率)
  1. In [23]: clf.score([[-6,-6],[-4,-2],[-3,-4],[4,5]],[1,1,2,2])

  2. Out[23]: 0.75

  3. In [24]: clf.score([[-6,-6],[-4,-2],[-3,-4],[4,5]],[1,1,2,2],sample_weight=[0.3

  4. ...: ,0.2,0.4,0.1])

  5. Out[24]: 0.59999999999999998

2、多項式樸素貝葉斯:sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)主要用於離散特徵分類,例如文字分類單詞統計,以出現的次數作為特徵值

引數說明:

alpha:浮點型,可選項,預設1.0,新增拉普拉修/Lidstone平滑引數

fit_prior:布林型,可選項,預設True,表示是否學習先驗概率,引數為False表示所有類標記具有相同的先驗概率

class_prior:類似陣列,陣列大小為(n_classes,),預設None,類先驗概率

①利用MultinomialNB建立簡單模型

  1. In [2]: import numpy as np

  2.    ...: from sklearn.naive_bayes import MultinomialNB

  3.    ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,

  4.    ...: 6]])

  5.    ...: y = np.array([1,1,4,2,3,3])

  6.    ...: clf = MultinomialNB(alpha=2.0)

  7.    ...: clf.fit(X,y)

  8.    ...:

  9. Out[2]: MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)

②經過訓練後,觀察各個屬性值

  • class_log_prior_:各類標記的平滑先驗概率對數值,其取值會受fit_prior和class_prior引數的影響

a、若指定了class_prior引數,不管fit_prior為True或False,class_log_prior_取值是class_prior轉換成log後的結果

  1. In [4]: import numpy as np

  2. ...: from sklearn.naive_bayes import MultinomialNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,

  4. ...: 6]])

  5. ...: y = np.array([1,1,4,2,3,3])

  6. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True,class_prior=[0.3,0.1,0.3,0

  7. ...: .2])

  8. ...: clf.fit(X,y)

  9. ...: print(clf.class_log_prior_)

  10. ...: print(np.log(0.3),np.log(0.1),np.log(0.3),np.log(0.2))

  11. ...: clf1 = MultinomialNB(alpha=2.0,fit_prior=False,class_prior=[0.3,0.1,0.3

  12. ...: ,0.2])

  13. ...: clf1.fit(X,y)

  14. ...: print(clf1.class_log_prior_)

  15. ...:

  16. [-1.2039728 -2.30258509 -1.2039728 -1.60943791]

  17. -1.20397280433 -2.30258509299 -1.20397280433 -1.60943791243

  18. [-1.2039728 -2.30258509 -1.2039728 -1.60943791]

b、若fit_prior引數為False,class_prior=None,則各類標記的先驗概率相同等於類標記總個數N分之一

  1. In [5]: import numpy as np

  2.    ...: from sklearn.naive_bayes import MultinomialNB

  3.    ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,

  4.    ...: 6]])

  5.    ...: y = np.array([1,1,4,2,3,3])

  6.    ...: clf = MultinomialNB(alpha=2.0,fit_prior=False)

  7.    ...: clf.fit(X,y)

  8.    ...: print(clf.class_log_prior_)

  9.    ...: print(np.log(1/4))

  10.    ...:

  11. [-1.38629436 -1.38629436 -1.38629436 -1.38629436]

  12. -1.38629436112

c、若fit_prior引數為True,class_prior=None,則各類標記的先驗概率相同等於各類標記個數除以各類標記個數之和

  1. In [6]: import numpy as np

  2. ...: from sklearn.naive_bayes import MultinomialNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,

  4. ...: 6]])

  5. ...: y = np.array([1,1,4,2,3,3])

  6. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True)

  7. ...: clf.fit(X,y)

  8. ...: print(clf.class_log_prior_)#按類標記1、2、3、4的順序輸出

  9. ...: print(np.log(2/6),np.log(1/6),np.log(2/6),np.log(1/6))

  10. ...:

  11. [-1.09861229 -1.79175947 -1.09861229 -1.79175947]

  12. -1.09861228867 -1.79175946923 -1.09861228867 -1.79175946923

  • intercept_:將多項式樸素貝葉斯解釋的class_log_prior_對映為線性模型,其值和class_log_propr相同
  1. In [7]: clf.class_log_prior_

  2. Out[7]: array([-1.09861229, -1.79175947, -1.09861229, -1.79175947])

  3. In [8]: clf.intercept_

  4. Out[8]: array([-1.09861229, -1.79175947, -1.09861229, -1.79175947])

  • feature_log_prob_:指定類的各特徵概率(條件概率)對數值,返回形狀為(n_classes, n_features)陣列
  1. In [9]: clf.feature_log_prob_

  2. Out[9]:

  3. array([[-2.01490302, -1.45528723, -1.2039728 , -1.09861229],

  4. [-1.87180218, -1.31218639, -1.178655 , -1.31218639],

  5. [-1.74919985, -1.43074612, -1.26369204, -1.18958407],

  6. [-1.79175947, -1.38629436, -1.23214368, -1.23214368]])

特徵條件概率計算過程,以類為1各個特徵對應的條件概率為例

  1. In [9]: clf.feature_log_prob_

  2. Out[9]:

  3. array([[-2.01490302, -1.45528723, -1.2039728 , -1.09861229],

  4. [-1.87180218, -1.31218639, -1.178655 , -1.31218639],

  5. [-1.74919985, -1.43074612, -1.26369204, -1.18958407],

  6. [-1.79175947, -1.38629436, -1.23214368, -1.23214368]])

  7. In [10]: print(np.log((1+1+2)/(1+2+3+4+1+3+4+4+4*2)),np.log((2+3+2)/(1+2+3+4+1+

  8. ...: 3+4+4+4*2)),np.log((3+4+2)/(1+2+3+4+1+3+4+4+4*2)),np.log((4+4+2)/(1+2+

  9. ...: 3+4+1+3+4+4+4*2)))

  10. -2.01490302054 -1.45528723261 -1.20397280433 -1.09861228867

特徵的條件概率=(指定類下指定特徵出現的次數+alpha)/(指定類下所有特徵出現次數之和+類的可能取值個數*alpha)

  • coef_:將多項式樸素貝葉斯解釋feature_log_prob_對映成線性模型,其值和feature_log_prob相同
  1. In [11]: clf.coef_

  2. Out[11]:

  3. array([[-2.01490302, -1.45528723, -1.2039728 , -1.09861229],

  4. [-1.87180218, -1.31218639, -1.178655 , -1.31218639],

  5. [-1.74919985, -1.43074612, -1.26369204, -1.18958407],

  6. [-1.79175947, -1.38629436, -1.23214368, -1.23214368]])

  • class_count_:訓練樣本中各類別對應的樣本數,按類的順序排序輸出
  1. In [12]: clf.class_count_

  2. Out[12]: array([ 2., 1., 2., 1.])

  • feature_count_:各類別各個特徵出現的次數,返回形狀為(n_classes, n_features)陣列
  1. In [13]: clf.feature_count_

  2. Out[13]:

  3. array([[  2.,   5.,   7.,   8.],

  4.        [  2.,   5.,   6.,   5.],

  5.        [  6.,   9.,  11.,  12.],

  6.        [  2.,   4.,   5.,   5.]])

  7. In [14]: print([(1+1),(2+3),(3+4),(4+4)])#以類別1為例

  8. [2, 5, 7, 8]


③方法

  • fit(X, y, sample_weight=None):根據X、y訓練模型
  1. In [15]: import numpy as np

  2. ...: from sklearn.naive_bayes import MultinomialNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6

  4. ...: ,6]])

  5. ...: y = np.array([1,1,4,2,3,3])

  6. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True)

  7. ...: clf.fit(X,y)

  8. ...:

  9. Out[15]: MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)

  • get_params(deep=True):獲取分類器的引數,以各引數字典形式返回
  1. In [16]: clf.get_params(True)

  2. Out[16]: {'alpha': 2.0, 'class_prior': None, 'fit_prior': True}

  • partial_fit(X, y, classes=None, sample_weight=None):對於資料量大時,提供增量式訓練,線上學習模型引數,引數X可以是類似陣列或稀疏矩陣,在第一次呼叫函式,必須制定classes引數,隨後呼叫時可以忽略
  1. In [17]: import numpy as np

  2. ...: from sklearn.naive_bayes import MultinomialNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6

  4. ...: ,6]])

  5. ...: y = np.array([1,1,4,2,3,3])

  6. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True)

  7. ...: clf.partial_fit(X,y)

  8. ...: clf.partial_fit(X,y,classes=[1,2])

  9. ...:

  10. ---------------------------------------------------------------------------

  11. ValueError Traceback (most recent call last)

  12. <ipython-input-17-b512d165c9a0> in <module>()

  13. 4 y = np.array([1,1,4,2,3,3])

  14. 5 clf = MultinomialNB(alpha=2.0,fit_prior=True)

  15. ----> 6 clf.partial_fit(X,y)

  16. 7 clf.partial_fit(X,y,classes=[1,2])

  17. ValueError: classes must be passed on the first call to partial_fit.

  18. In [18]: import numpy as np

  19. ...: from sklearn.naive_bayes import MultinomialNB

  20. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6

  21. ...: ,6]])

  22. ...: y = np.array([1,1,4,2,3,3])

  23. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True)

  24. ...: clf.partial_fit(X,y,classes=[1,2])

  25. ...: clf.partial_fit(X,y)

  26. ...:

  27. ...:

  28. Out[18]: MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)

  • predict(X):在測試集X上預測,輸出X對應目標值
  1. In [19]: clf.predict([[1,3,5,6],[3,4,5,4]])

  2. Out[19]: array([1, 1])

  • predict_log_proba(X):測試樣本劃分到各個類的概率對數值
  1. In [22]: import numpy as np

  2. ...: from sklearn.naive_bayes import MultinomialNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6

  4. ...: ,6]])

  5. ...: y = np.array([1,1,4,2,3,3])

  6. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True)

  7. ...: clf.fit(X,y)

  8. ...:

  9. Out[22]: MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)

  10. In [23]: clf.predict_log_proba([[3,4,5,4],[1,3,5,6]])

  11. Out[23]:

  12. array([[-1.27396027, -1.69310891, -1.04116963, -1.69668527],

  13. [-0.78041614, -2.05601551, -1.28551649, -1.98548389]])

  • predict_proba(X):輸出測試樣本劃分到各個類別的概率值
  1. In [1]: import numpy as np

  2. ...: from sklearn.naive_bayes import MultinomialNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5],[2,5,6,5],[3,4,5,6],[3,5,6,

  4. ...: 6]])

  5. ...: y = np.array([1,1,4,2,3,3])

  6. ...: clf = MultinomialNB(alpha=2.0,fit_prior=True)

  7. ...: clf.fit(X,y)

  8. ...:

  9. Out[1]: MultinomialNB(alpha=2.0, class_prior=None, fit_prior=True)

  10. In [2]: clf.predict_proba([[3,4,5,4],[1,3,5,6]])

  11. Out[2]:

  12. array([[ 0.27972165, 0.18394676, 0.35304151, 0.18329008],

  13. [ 0.45821529, 0.12796282, 0.27650773, 0.13731415]])

  • score(X, y, sample_weight=None):輸出對測試樣本的預測準確率的平均值
  1. In [3]: clf.score([[3,4,5,4],[1,3,5,6]],[1,1])

  2. Out[3]: 0.5

  • set_params(**params):設定估計器的引數
  1. In [4]: clf.set_params(alpha=1.0)

  2. Out[4]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

3、伯努利樸素貝葉斯:sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True,class_prior=None)類似於多項式樸素貝葉斯,也主要使用者離散特徵分類,和MultinomialNB的區別是:MultinomialNB以出現的次數為特徵值,BernoulliNB為二進位制或布林型特性

引數說明:

binarize:將資料特徵二值化的閾值

①利用BernoulliNB建立簡單模型

  1. In [5]: import numpy as np

  2. ...: from sklearn.naive_bayes import BernoulliNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5]])

  4. ...: y = np.array([1,1,2])

  5. ...: clf = BernoulliNB(alpha=2.0,binarize = 3.0,fit_prior=True)

  6. ...: clf.fit(X,y)

  7. ...:

  8. Out[5]: BernoulliNB(alpha=2.0, binarize=3.0, class_prior=None, fit_prior=True)

經過binarize = 3.0二值化處理,相當於輸入的X陣列為

  1. In [7]: X = np.array([[0,0,0,1],[0,0,1,1],[0,1,1,1]])

  2. In [8]: X

  3. Out[8]:

  4. array([[0, 0, 0, 1],

  5. [0, 0, 1, 1],

  6. [0, 1, 1, 1]])

②訓練後檢視各屬性值

  • class_log_prior_:類先驗概率對數值,類先驗概率等於各類的個數/類的總個數
  1. In [9]: clf.class_log_prior_

  2. Out[9]: array([-0.40546511, -1.09861229])

  • feature_log_prob_ :指定類的各特徵概率(條件概率)對數值,返回形狀為(n_classes, n_features)陣列
  1. Out[10]:

  2. array([[-1.09861229, -1.09861229, -0.69314718, -0.40546511],

  3. [-0.91629073, -0.51082562, -0.51082562, -0.51082562]])

上述結果計算過程:

假設X對應的四個特徵為A1、A2、A3、A4,類別為y1,y2,類別為y1時,特徵A1的概率為:P(A1|y=y1) = P(A1=0|y=y1)*A1+P(A1=1|y=y1)*A1

  1. In [11]: import numpy as np

  2. ...: from sklearn.naive_bayes import BernoulliNB

  3. ...: X = np.array([[1,2,3,4],[1,3,4,4],[2,4,5,5]])

  4. ...: y = np.array([1,1,2])

  5. ...: clf = BernoulliNB(alpha=2.0,binarize = 3.0,fit_prior=True)

  6. ...: clf.fit(X,y)

  7. ...: print(clf.feature_log_prob_)

  8. ...: print([np.log((2+2)/(2+2*2))*0+np.log((0+2)/(2+2*2))*1,np.log((2+2)/(2

  9. ...: +2*2))*0+np.log((0+2)/(2+2*2))*1,np.log((1+2)/(2+2*2))*0+np.log((1+2)/

  10. ...: (2+2*2))*1,np.log((0+2)/(2+2*2))*0+np.log((2+2)/(2+2*2))*1])

  11. ...:

  12. [[-1.09861229 -1.09861229 -0.69314718 -0.40546511]

  13. [-0.91629073 -0.51082562 -0.51082562 -0.51082562]]

  14. [-1.0986122886681098, -1.0986122886681098, -0.69314718055994529, -0.405465108108

  15. 16444]

  • class_count_:按類別順序輸出其對應的個數
  1. In [12]: clf.class_count_

  2. Out[12]: array([ 2., 1.])

  • feature_count_:各類別各特徵值之和,按類的順序輸出,返回形狀為[n_classes, n_features] 的陣列
  1. In [13]: clf.feature_count_

  2. Out[13]:

  3. array([[ 0., 0., 1., 2.],

  4. [ 0., 1., 1., 1.]])