《統計學習方法》——樸素貝葉斯程式碼實現
阿新 • • 發佈:2021-02-28
### 樸素貝葉斯分類原理
對於給定的訓練資料集,首先基於特徵條件獨立假設學習輸入/輸出的聯合概率分佈;然後基於此模型,對給定的輸入$x$,利用貝葉斯定理求出後驗概率最大的輸出$y$。
**特徵獨立性假設**:在利用貝葉斯定理進行預測時,我們需要求解條件概率$P(x|y_k)=P(x_1,x_2,...,x_n|y_k)P(x|y_k)=P(x_1,x_2,...,x_n|y_k)$,它的引數規模是指數數量級別的,假設第i維特徵可取值的個數有$T_i$個,類別取值個數為k個,那麼引數個數為:$k\prod_{i=1}^nT_i$。這顯然不可行,所以樸素貝葉斯演算法對條件概率分佈作出了獨立性的假設,實際上是為了簡化計算。
```python
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
```
從sklearn資料集中載入鳶尾花分類資料集
```python
iris = load_iris()
X, Y = iris.data, iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
print('X_train[0]: {}'.format(X_train[0]))
print('Y_train[0]: {}'.format(Y_train[0]))
# 檢視訓練集各個類別的數量
for l in set(Y_train):
print('label: %s ,count: %d' % (l, len(Y_train[Y_train==l])))
```
程式碼輸出:
X_train[0]: [5.2 3.5 1.5 0.2]
Y_train[0]: 0
label: 0 ,count: 35
label: 1 ,count: 32
label: 2 ,count: 38
**高斯模型的樸素貝葉斯:**
對於取值是連續型的特徵變數,用離散型特徵的求解方法時會有很多特徵取值的條件概率為0,所以我們使用高斯模型的樸素貝葉斯,它假設每一維特徵都服從高斯分佈。即:
$$P(x_i | y_k)=\frac{1}{\sqrt{2\pi}\sigma_{y_k,i}}exp(-\frac{(x_i-\mu_{y_k,i})^2}{2\sigma^2_{y_k,i}})$$
$\mu_{y_k,i}$是分類為$y_k$的樣本中,第$i$維特徵取值的均值;$\sigma_{y_k,i}^2$為其方差
```python
class GaussianNaiveBayes:
def __init__(self):
self.parameters = {}
self.prior = {}
# 訓練過程就是求解先驗概率和高斯分佈引數的過程
# X:(樣本數,特徵維度) Y:(樣本數,)
def fit(self, X, Y):
self._get_prior(Y) # 計算先驗概率
labels = set(Y)
for label in labels:
samples = X[Y==label]
# 計算高斯分佈的引數:均值和標準差
means = np.mean(samples, axis=0)
stds = np.std(samples, axis=0)
self.parameters[label] = {
'means': means,
'stds': stds
}
# x:單個樣本
def predict(self, x):
probs = sorted(self._cal_likelihoods(x).items(), key=lambda x:x[-1]) # 按概率從小到大排序
return probs[-1][0]
# 計算模型在測試集的準確率
# X_test:(測試集樣本個數,特徵維度)
def evaluate(self, X_test, Y_test):
true_pred = 0
for i, x in enumerate(X_test):
label = self.predict(x)
if label == Y_test[i]:
true_pred += 1
return true_pred / len(X_test)
# 計算每個類別的先驗概率
def _get_prior(self, Y):
cnt = Counter(Y)
for label, count in cnt.items():
self.prior[label] = count / len(Y)
# 高斯分佈
def _gaussian(self, x, mean, std):
exponent = math.exp(-(math.pow(x - mean, 2)/(2 * math.pow(std, 2))))
return (1 / (math.sqrt(2 * math.pi) * std)) * exponent
# 計算樣本x屬於每個類別的似然概率
def _cal_likelihoods(self, x):
likelihoods = {}
for label, params in self.parameters.items():
means = params['means']
stds = params['stds']
prob = self.prior[label]
# 計算每個特徵的條件概率,P(xi|yk)
for i in range(len(means)):
prob *= self._gaussian(x[i], means[i], stds[i])
likelihoods[label] = prob
return likelihoods
```
在測試集上評估分類器:
```python
gussian_nb = GaussianNaiveBayes()
gussian_nb.fit(X_train, Y_train)
print('樣本[4.4, 3.2, 1.3, 0.2]的預測結果: %d' % gussian_nb.predict([4.4, 3.2, 1.3, 0.2]))
print('測試集的準確率: %f' % gussian_nb.evaluate(X_test, Y_test))
```
程式碼輸出:
樣本[4.4, 3.2, 1.3, 0.2]的預測結果: 0
測試集的準確率: 0.955556
### 與scikit-learn的實現對比
```python
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, Y_train)
print('(sklearn)樣本[4.4, 3.2, 1.3, 0.2]的預測結果: %d' % clf.predict([[4.4, 3.2, 1.3, 0.2]])[0])
print('(sklearn)測試集的準確率: %f' % clf.score(X_test, Y_test))
```
程式碼輸出:
(sklearn)樣本[4.4, 3.2, 1.3, 0.2]的預測結果: 0
(sklearn)測試集的準確率: 0