用起來不太樸素的樸素貝葉斯及其Python實現
作為一個聽起來非常Naive的分類器,Naive Bayes Classifier使用了“屬性條件獨立性假設”,也就是假設所有屬性相互獨立。分類器的目的,是對任一測試樣本
來直觀理解一下貝葉斯定理,左邊,求的是當我們拿到了一個新樣本
我們來看一下貝葉斯定理的計算。
問題變成了,在第c類樣本中第i個屬性上取值為
如果屬性值連續,那麼採用概率密度函式
到這裡,演算法就講完了,原理很簡單。值得一提的是,儘管在利用貝葉斯定理進行分類時,為降低組合爆炸、樣本稀疏等問題,加入了在實際中很難存在的屬性條件獨立性假設,但實際證明,樸素貝葉斯分類器在很多時候都表現的非常好。這裡有兩個解釋:一是儘管樸素貝葉斯給不出精確的概率值,但是我們在分類時只是選擇條件概率最大的那一個,只要大小順序正確就可以;二則,如果屬性間的依賴關係對所有類別作用相同,或者依賴關係的影響可以互相抵消,那麼就不會對分類效能產生影響,同時顯著降低了計算開銷。
貝葉斯定理可以認為是機器學習領域除深度學習外的另外一個很重要的演算法,周志華在KDD China2016大會上強調了深度學習和人腦的生物神經網路結構不同,也沒有一種機器學習演算法是最好的。另外一個方面,我們的大腦又好像天生在使用貝葉斯推理。這一塊,大家見仁見智就OK。
Python原始碼
注:樸素貝葉斯分類器,對分類元數無要求。採用葡萄酒三元資料集進行測試,取測試集佔比0.4,在72個測試樣本上正確率為95%-100%(有不同的原因在於抽取樣本的隨機性),足見其實用性。
# !/usr/bin/env python3
# coding=utf-8
"""
Naive Bayes Classifier
Author :Chai Zheng
Blog :http://blog.csdn.net/chai_zheng/
Github :https://github.com/Chai-Zheng/Machine-Learning
Email :[email protected]
Date :2017.10.7
"""
import math
import time
import numpy as np
from collections import Counter
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
def calcProbDensity(meanLabel,stdLabel,test_X): #計算某樣本的條件概率密度P(xi|c)
numAttributes = len(test_X)
MultiProbDensity = 1.0
for i in range(numAttributes): #對樣本test_X每個屬性的概率密度函式連乘
MultiProbDensity = np.exp(-np.square(test_X[i]-meanLabel[i])/(2.0*np.square(stdLabel[i])))/(np.sqrt(2.0*math.pi)*stdLabel[i])*MultiProbDensity
return MultiProbDensity
def calcPriorProb(Y_train): #計算每一類的先驗概率P(c)
i,j = 0,0
global labelValue,classNum
numSamples = Y_train.shape[0]
labelValue = np.zeros((numSamples,1)) #前i行用來儲存標籤值
Y_train_counter = sum(Y_train.tolist(),[]) #將Y_train轉化為可雜湊的資料結構
cnt = Counter(Y_train_counter) #計算標籤值的類別個數及各類樣例的個數
for key in cnt:
labelValue[i] = key
i += 1
classNum = i #類別總數
Pc = np.zeros((classNum,1)) #不同類的先驗概率
eachLabelNum = np.zeros((classNum,1)) #每類樣例數
for key in cnt:
Pc[j] = cnt[key]/numSamples
eachLabelNum[j] = cnt[key]
j += 1
return labelValue,eachLabelNum,classNum,Pc
def trainBayes(X_train,Y_train):
startTime = time.time()
numTrainSamples,numAttributes = X_train.shape
labelValue,eachLabelNum,classNum,Pc = calcPriorProb(Y_train)
meanlabelX,stdlabelX = [],[] #存放每一類樣本在所有屬性上取值的均值和方差
for i in range(classNum):
k = 0
labelXMatrix = np.zeros((int(eachLabelNum[i]),numAttributes)) #取出某一類所有樣本組成新矩陣
for j in range(numTrainSamples):
if Y_train[j] == labelValue[i]:
labelXMatrix[k] = X_train[j,:]
k += 1
meanlabelX.append(np.mean(labelXMatrix,axis=0).tolist()) #求該矩陣的列均值與無偏標準差,append至所有類
stdlabelX.append(np.std(labelXMatrix,ddof=1,axis=0).tolist())
meanlabelX = np.array(meanlabelX).reshape(classNum,numAttributes)
stdlabelX = np.array(stdlabelX).reshape(classNum,numAttributes)
print('---Train completed.Took %f s.'%((time.time()-startTime)))
return meanlabelX,stdlabelX,Pc
def predict(X_test,Y_test,meanlabelX,stdlabelX,Pc):
numTestSamples = X_test.shape[0]
matchCount = 0
for m in range(X_test.shape[0]):
x_test = X_test[m,:] #輪流取測試樣本
pred = np.zeros((classNum,1)) #對不同類的概率
for i in range(classNum):
pred[i] = calcProbDensity(meanlabelX[i,:],stdlabelX[i,:],x_test)*Pc[i] #計算屬於各類的概率
predict = labelValue[np.argmax(pred)] #取出標籤
if predict == Y_test[m]:
matchCount += 1
accuracy = float(matchCount/numTestSamples)
return accuracy
if __name__ == '__main__':
print('Step 1.Loading data...')
#資料集下載http://download.csdn.net/download/chai_zheng/10009919
data = np.loadtxt("Wine.txt",delimiter=',') #載入葡萄酒資料集
print('---Loading completed.')
x = data[:,1:14]
y = data[:,0].reshape(178,1)
print('Step 2.Splitting and preprocessing data...')
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=0.4) #拆分資料集
scaler = preprocessing.StandardScaler().fit(X_train) #資料標準化
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
print('---Splittinging completed.\n---Number of training samples:%d\n---Number of testing samples:%d'\
%(X_train.shape[0],X_test.shape[0]))
print('Step 3.Training...')
meanlabelX,stdlabelX,Pc = trainBayes(X_train,Y_train)
print('Step 4.Testing...')
accuracy = predict(X_test,Y_test,meanlabelX,stdlabelX,Pc)
print('---Testing completed.Accuracy:%.3f%%' %(accuracy*100))
測試效果如下: