Python實現K近鄰演算法_分類器

阿新 • • 發佈：2019-01-01

收集資料

31,65,4,1
33,58,10,1
33,60,0,1
34,59,0,2
34,66,9,2

這是關於乳腺癌已手術患者存活時間(壽命)的樣本集，文字檔案中共包含306個樣本，樣本包含的屬性有：
1. 患者做手術時的年齡 opAge
2. 患者做手術的年份-1900 opYear，比如1970年做的手術，則opYear屬性的值為70
3. 陽性腋窩淋巴結的數目 cellNum
4. 存活時間 status，其中，status等於1表示該患者存活了5年以上，status等於2表示該患者在5年之內死亡。
使用pandas讀取文字檔案，將資料轉換為DataFrame物件：

data = pd.read_table(r"C:\data\Haberman's Survival Data.txt" 
, sep=",", header=None, names=['opAge', 'opYear', 'cellNum', 'status'], engine="python")
data1 = data[data['status'] == 1]  # status為1的資料
data2 = data[data['status'] == 2]  # status為2的資料

檢視data.shape為(306, 4)

資料散點圖

fig = plt.figure(figsize=(16, 12))
ax = fig.gca(projection="3d") #get current axis
ax.scatter(data1['opAge' 
], data1['opYear'], data1['cellNum'], c='r', s=100, marker="*", label="survived 5 years or longer") #status為1的樣本的散點圖
ax.scatter(data2['opAge'], data2['opYear'], data2['cellNum'], c='b', s=100, marker="^", label="died within 5 year") #status為2的樣本的散點圖
ax.set_xlabel("operation age", size=15)
ax.set_ylabel("operation year" 
, size=15)
ax.set_zlabel("cell number", size=15)
ax.set_title('Haberman\'s Survival Scatter Plot', size=15, weight='bold')
ax.set_zlim(0, 30)
ax.legend(loc="lower right", fontsize=15)
plt.show()

得到的3D散點圖如下所示：

準備資料

KNN演算法的原則是：找到距離目標樣本距離最近的K個樣本，這K個樣本中類別出現次數最多的那個類，作為目標樣本的類別。這裡採用的距離是歐式距離，那麼數值較大的屬性在衡量距離的時候起的作用較大，這會導致樣本距離衡量出現偏差，為了遮蔽這種影響，需要對資料進行歸一化：

# 歸一化 (x-min)/(max-min)∈[0,1]
def autoNorm(dataSet):
    minVals = samples.min(axis=0) # 按列求最小值，即求每個屬性的最小值
    maxVals = samples.max(axis=0) # 求每個屬性的最大值
    factors = maxVals - minVals # 歸一化因子
    sNum = dataSet.shape[0]  # 資料集的行數，即樣本數
    normDataSet = (dataSet - np.tile(minVals, (sNum, 1))) / np.tile(factors, (sNum, 1))  # 先將minVals和歸一化因子轉換成與dataSet相同的shape，再做減法和除法運算，最終歸一化後的資料都介於[0,1]
    return normDataSet

訓練演算法

劃分資料集

採用K折交叉驗證的方法來評估分類器的效能，從樣本集中隨機抽取10%作為測試集testing data，其餘的90%用來做10 fold cross validation

testIdxs = random.sample(range(0, len(samples), 1), len(samples) * 1 / 10)  # 隨機選取testing data的索引
testingSet = samples.ix[testIdxs]  # 根據索引從樣本集中獲取testing data
idxs = range(0, len(samples), 1)  # 總的資料索引序列
#以下for迴圈是從總的資料索引序列中將testing data的索引去除
for i in range(0, len(testIdxs)):
    idxs.remove(testIdxs[i])
trainData = samples.ix[idxs]  # 獲取用作訓練的資料集

對trainData使用10折交叉驗證，首先隨機打亂trainData的索引序列random.shuffle(idxs) idxs是引用型別，在執行完remove操作後，idxs已只剩trainData的索引。

k nearest neighbor

採用歐氏距離作為距離評價準則：

#inX: 目標樣本
#dataSet: 用來找k nearest neighbor的資料集，labels是該資料集對應的類別標籤，dataSet和labels的索引是一一對應的
def classifyKNN(inX, dataSet, labels, k):
    #以下程式碼是為了防止出現這種情況：dataSet和labels的索引不是從0開始的有序自然數，導致在argsort排序的時候出現錯亂，因為argsort排序結果是從0開始的自然數，因此首先需要重置dataSet和labels的索引，使其索引變為依次從0開始自然數。
    nDataSet = zeros((dataSet.shape[0], dataSet.shape[1])) #與dataSet同型的0矩陣
    j = 0
    for i in dataSet.index:
        nDataSet[j] = dataSet.ix[i]
        j += 1
    nDataSet = pd.DataFrame(nDataSet)

    nLabels = zeros(labels.shape[0]) #與labels同型的0向量
    h = 0
    for i in labels.index:
        nLabels[h] = labels.ix[i]
        h += 1

    dataSetNum = nDataSet.shape[0]  # 樣本數(DataFrame行數)
    diffMat = tile(inX, (dataSetNum, 1)) - nDataSet #目標樣本與參照樣本集的差，對應屬性相減，結果為與nDataSet同型的矩陣
    sqDiffMat = diffMat ** 2  #平方
    sqDistances = sqDiffMat.sum(axis=1) #矩陣sqDiffMat的列之和，即目標樣本與樣本集中每個樣本對應屬性的差值的平方和
    distances = sqDistances ** 0.5 #平方根，歐氏距離，即目標樣本與每個樣本點的距離
    sortedDistanceIdx = distances.argsort()  # 距離從小到大的索引值，sortedDistanceIdx的索引是從0開始的自然數，sortedDistanceIdx的值表示對應的distance的索引，比如sortedDistanceIdx[0]是150，表示最小的距離在distances中的索引是150
    classCount = {}
    for i in range(k):
        #找出distance最小的k個索引，然後在nLabels中獲取其對應類別
        voteLabel = nLabels[int(sortedDistanceIdx[i])]
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
    #classCount字典中存放了統計的label和對應的出現次數
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #倒序
    return sortedClassCount[0][0]  #出現次數最大的label

10折交叉驗證

一次驗證

#返回在該驗證集上的錯誤率
def train(trainingSet, validationSet, kn):
    errorCount = 0
    vIdxs = validationSet.index
    #遍歷驗證集，對每個樣本使用KNN
    for i in range(0, len(validationSet)):
        pred = classifyKNN(validationSet.loc[vIdxs[i], ['opAge', 'opYear', 'cellNum']], trainingSet[['opAge', 'opYear', 'cellNum']], trainingSet['status'], kn)
        if (pred != validationSet.at[vIdxs[i], 'status']):
            errorCount += 1
    return errorCount / float(len(validationSet))

10次驗證

使用10次驗證的平均錯誤率來評價KNN分類器的效能

# dataSet：用來交叉驗證的資料集，idxs是對應的索引序列
# k: k折交叉驗證
# kn: kn近鄰
def crossValidation(dataSet, idxs, k, kn):
    step = len(idxs) / k
    errorRate = 0
    for i in range(k):
        validationIdx = []
        for i in range(i * step, (i + 1) * step):
            validationIdx.append(idxs[i])
        validationSet = dataSet.ix[validationIdx]  # 獲得驗證集資料
        temp = idxs[:]
        for i in validationIdx:  # 把驗證集的索引去除
            temp.remove(i)
        trainingSet = dataSet.ix[temp]  # 獲取訓練集資料
        errorRate += train(trainingSet, validationSet, kn)
    aveErrorRate = errorRate / float(k)
    return aveErrorRate

測試演算法

交叉驗證完畢之後，使用全部的trainData，對testingData進行預測

def predict(trainingSet, testingSet, kn):
    errorCount = 0
    for i in range(0, len(testingSet)):
        vIdxs = testingSet.index
        pred = classifyKNN(testingSet.loc[vIdxs[i], ['opAge', 'opYear', 'cellNum']], trainingSet[['opAge', 'opYear', 'cellNum']], trainingSet['status'], kn)
        print "The prediction label is %s"%(pred)
        print "The real label is %s"%(testingSet.at[vIdxs[i], 'status'])
        if (pred != testingSet.at[vIdxs[i], 'status']):
            errorCount += 1
    return errorCount / float(len(testingSet))

print "The cross validation error ratio is %d" %crossValidation(trainData, idxs, 10, 3)
print "The testing data error ratio is %d"%predict(samples,testingSet,3)

執行結果為：

The cross validation error ratio is 0.28
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 2
The prediction result is 1
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 2
The testing data error ratio is 0.17

Python實現K近鄰演算法_分類器

收集資料

資料散點圖

準備資料

訓練演算法

劃分資料集

k nearest neighbor

10折交叉驗證

一次驗證

10次驗證

測試演算法

Python實現K近鄰演算法_分類器

python實現k近鄰演算法

Python實現k-近鄰演算法

2、python機器學習基礎教程——K近鄰演算法鳶尾花分類

機器學習經典演算法詳解及Python實現--K近鄰(KNN)演算法

用python實現K均值演算法

基於scikit-learn實現k近鄰演算法（kNN）與超引數的除錯

【機器學習實踐】用Python實現樸素貝葉斯分類器

【機器學習實戰之一】：C++實現K-近鄰演算法KNN

Python高階--K-近鄰演算法（KNN）

機器學習及python實現——樸素貝葉斯分類器

Python實現樸素貝葉斯分類器

Python實現k-means演算法

利用sklearn包中的k-近鄰演算法進行分類

機器學習經典分類演算法 —— k-近鄰演算法（附python實現程式碼及資料集）

機器學習實戰——k-近鄰演算法Python實現問題記錄

K近鄰演算法(KNN)原理解析及python實現程式碼

Python中的k—近鄰演算法（處理常見的分類問題）

k近鄰演算法(k-nearest neighbor)和python 實現

機器學習實戰（第二篇）-k-近鄰演算法Python實現

Python實現K近鄰演算法_分類器

收集資料

資料散點圖

準備資料

訓練演算法

劃分資料集

k nearest neighbor

10折交叉驗證

一次驗證

10次驗證

測試演算法

相關推薦