我眼中的K-近鄰算法

阿新 • • 發佈：2018-09-16

def with tlist digits 文本 str gdi writing video

有一句話這樣說：如果你想了解一個人，你可以從他身邊的朋友開始。

如果與他交往的好友都是一些品行高尚的人，那麽可以認為這個人的品行也差不了。

其實古人在這方面的名言警句，寓言故事有很多。例如：人以類聚，物以群分。近朱者赤近墨者黑

其實K-近鄰算法和古人的智慧想通，世間萬物息息相通，你中有我，我中有你。

K-近鄰原理：

存在一個訓練集，我們知道每一個樣本的標簽，例如訓練樣本是一群人，他們都有相應特征，例如，愛喝酒或愛看書或逛窯子或打架鬥毆或樂於助人等等，並且知道他們是好人還是壞人，然後來了一個新人（新樣本），然後把新樣本的特征與樣本集中數據對應的特征進行比較，然後算法提取集中特征最相似數據的分類標簽，就是比較這個新人具有的品行與那一群人中誰的品行相近，選取出樣本集中數據中前K個數據（這就是K的來歷），然後查看這K個數據的標簽，選取出現最多類作為新樣本的分類。就是查看選出的這些人，看看是好人多還是壞人多，如果好人多，那麽我們就確定這個新人是好人。

K-近鄰算法沒有訓練過程，它直接對新樣本進行分類。

代碼來源機器學習實戰，python3.7可用，詳細註釋：

#coding=utf-8
from numpy import *
import operator
import os,sys

def createDataSet():
    #數組轉換成矩陣
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = [‘A‘,‘A‘,‘B‘,‘B‘]
    return group,labels

#inx為測試樣本
def classify0(inx,dataSet,labels,k):
     
#shape[0]給出行數，shape[1]列數
    dataSetSize = dataSet.shape[0]
    #把inx矩陣的每一行復制dataSetSize次，列不復制
    #為了把該樣本與訓練集中每一個樣本計算出距離
    #計算歐氏距離
    diffMat = tile(inx,(dataSetSize,1)) - dataSet
    #距離的平方差
    sqDiffMat = diffMat**2
    #把數組每一行求和
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    # 
argsort 從小到大排序，但是返回的是下標
    sortedDistIndices = distances.argsort()
    classCount = {}
    #k是前k個最小距離
    for i in range(k):
        #把最小距離對應的標簽賦值給voteIlabel
        voteIlabel = labels[sortedDistIndices[i]]
        #投票算法，統計前k個數據的標簽類型及其出現的個數
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    #排序選出出現次數最多的標簽，（註意：Python 3 renamed dict.iteritems() -> dict.items()）
    sortedClassCount = sorted(classCount.items(),
     key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr = open(filename)
    #文件有多少行
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)
    #返回一個（numberOfLines,3）的零矩陣
    returnMat = zeros((numberOfLines,3))
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        #去除字符串的首尾的字符（空格,回車）
        line = line.strip()
        listFromLine = line.split(‘\t‘)
        #復制行給returnMat
        returnMat[index,:] = listFromLine[0:3]
        #獲取標簽,這裏需要把字符串類型轉換成int類型
        if listFromLine[-1] == ‘largeDoses‘:
            classLabelVector.append(3)
        elif listFromLine[-1] == ‘smallDoses‘:
            classLabelVector.append(2)
        elif listFromLine[-1] == ‘didntLike‘:
            classLabelVector.append(1)
        else:
            classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

#分析數據
‘‘‘
控制臺輸入
import matplotlib
import matplotlib.pyplot as plt
#定義一個圖像窗口
fig = plt.figure()
#意思是窗口背劃分成1*1個格子，使用第一個格子
ax = fig.add_subplot(111)
#描繪散點圖
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
#使用顏色來分辨
ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
plt.show()

‘‘‘
#給出的數據集往往會遇見這樣的問題，就是每一個特征值的取值不在
#同一個數量級，有的取值會很大，這樣會嚴重影響結果的準確性
#所以要歸一化特征值到0~1之間
#公式：newValue = （oldValue-min）/(max-min)
def autoNorm(dataSet):
    #返回每一列最小值（1，m）
    minVals = dataSet.min(0)
    #返回每一列最大值
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

#分類器針對約會網站的測試代碼
def datingClassTest():
    hoRatio = 0.10
    datingDataMat,datingLabels = file2matrix(‘datingTestSet.txt‘)
    #歸一化
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    #選取數據集的10%作為測試集
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    #循環對測試集進行分類，然後計算準確率
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print ("the classifier came back with:%d,the real answer is:%d"%(classifierResult,datingLabels[i]))
        if (classifierResult != datingLabels[i]):errorCount += 1.0
    print ("the total error rate is:%f"%(errorCount/float(numTestVecs)))

    
def classifyPerson():
    resultList = [‘not at all‘, ‘in small doses‘, ‘in large doses‘]
    #python3 輸入是input
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix(‘datingTestSet2.txt‘)
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels, 3)
    print ("You will probably like this person: %s" % resultList[classifierResult - 1])
    
    
    #識別手寫數字
    #把32*32的矩陣轉換成1*1024矩陣
def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect
    
def handwritingClassTest():
    hwLabels= []
    #獲取目錄的內容
    trainingFileList = os.listdir(‘trainingDigits‘)
    m = len(trainingFileList)
    trainingMat = zeros((m,1024))
    for i in range(m):
        #從文本文件的名稱中截取是什麽數字
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split(‘.‘)[0]
        classNumStr = int(fileStr.split(‘_‘)[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector(‘trainingDigits/%s‘ % fileNameStr)
    testFileList = os.listdir(‘testDigits‘)
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split(‘.‘)[0]
        classNumStr = int(fileStr.split(‘_‘)[0])
        vectorUnderTest = img2vector(‘testDigits/%s‘ % fileNameStr)
        classifierResult = classify0(vectorUnderTest,trainingMat,hwLabels,3)
        #計算精確性
        print ("the classifier came back with:%d,the real answer is:%d" % (classifierResult,classNumStr))
        if (classifierResult != classNumStr):errorCount += 1.0
    print ("\n the total number of errors is:%d" % errorCount)
    print ("\n the total error rate is:%f" % (errorCount/float(mTest)))

算法主要有兩個主要的步驟：

（1）求解兩向量之間的距離來比較相似性：

　　技術分享圖片

（2）排序選出前K個相似點，篩選出出現頻率最高的類別

　　代碼中直接調用排序算法，如果對於大量數據，排序會很耗費時間，所以可以優化排序算法:Kd樹

篩選評論最高的是通過投票的方式。

上面代碼中包括了識別手寫體的代碼，依然用的是歐氏距離，之前做過一個使用神經網絡訓練做的手寫體數字識別，我想比較這兩個算法的準確性。

kNN算法沒有訓練過程，算法也十分簡單，但是在實踐的過程中我發現，KNN具有局限性。我的做法是

kNN識別手寫體：

先把數字的灰度圖轉換成32*32的字符文件的格式，然後使用kNN算法，發現不同的測試集的準確性相差很大，如果使用和訓練集相近的測試集去測試，所謂相近就是說數字的大小，粗細都會影響識別的準確性，所以我用不同的測試集得到的結果完全不同，如果用訓練集去作為測試集使用，準確率會達到99%，但是換一個不同的測試集，準確率就會降到34%左右（比蒙的好一點點）。如果要提高準確性，必須加大

訓練集（盡量包含所有的手寫體類型），再調整K的取值，如果那樣的話，做一次分類，就要對大量的數據集進行比對，排序選出相近的，這樣效率非常低。

神經網絡識別手寫體：

在訓練的過程中會消耗時間，但是一旦模型訓練完畢，準確率會很高。

所以說kNN算法適合數據集較小的情況的分類。

註意：K-近鄰是監督學習，K-Means是無監督學習

我眼中的K-近鄰算法

def with tlist digits 文本 str gdi writing video 有一句話這樣說：如果你想了解一個人，你可以從他身邊的朋友開始。如果與他交往的好友都是一些品行高尚的人，那麽可以認為這個人的品行也差不了。其實古人在這方面的名言警句，寓言故事有很

我眼中的K-近鄰算法

我眼中的K-近鄰算法

Machine Learning in Action-chapter2-k近鄰算法

K近鄰算法——多分類問題

Machine Learn in Action(K-近鄰算法)

K 近鄰算法

監督學習--k近鄰算法

機器學習實戰精讀--------K-近鄰算法

機器學習實戰之第二章 k-近鄰算法

k近鄰算法--手寫識別系統

機器學習實戰(一)k-近鄰算法

K近鄰算法中常見的問題

《機器學習實戰》學習筆記——k近鄰算法

手寫數字識別的k-近鄰算法實現

K-近鄰算法

K-近鄰算法（KNN）

機器學習之k-近鄰算法實踐學習

k近鄰算法（k-nearest neighbor,k-NN）

《機器學習實戰》中的程序清單2-1 k近鄰算法classify0都做了什麽

機器學習之K近鄰算法

【轉載】用Scikit-Learn構建K-近鄰算法，分類MNIST數據集

我眼中的K-近鄰算法

相關推薦