python實現KNN近鄰演算法
阿新 • • 發佈:2021-01-01
示例:《電影型別分類》
獲取資料來源
電影名稱 | 打鬥次數 | 接吻次數 | 電影型別 |
---|---|---|---|
California Man | 3 | 104 | Romance |
He's Not Really into Dudes | 8 | 95 | Romance |
Beautiful Woman | 1 | 81 | Romance |
Kevin Longblade | 111 | 15 | Action |
Roob Slayer 3000 | 99 | 2 | Action |
Amped II | 88 | 10 | Action |
Unknown | 18 | 90 | unknown |
資料顯示:肉眼判斷電影型別unknown是什麼
from matplotlib import pyplot as plt # 用來正常顯示中文標籤 plt.rcParams["font.sans-serif"] = ["SimHei"] # 電影名稱 names = ["California Man","He's Not Really into Dudes","Beautiful Woman","Kevin Longblade","Robo Slayer 3000","Amped II","Unknown"] # 型別標籤 labels = ["Romance","Romance","Action","Unknown"] colors = ["darkblue","red","green"] colorDict = {label: color for (label,color) in zip(set(labels),colors)} print(colorDict) # 打鬥次數,接吻次數 X = [3,8,1,111,99,88,18] Y = [104,95,81,15,2,10,88] plt.title("通過打鬥次數和接吻次數判斷電影型別",fontsize=18) plt.xlabel("電影中打鬥鏡頭出現的次數",fontsize=16) plt.ylabel("電影中接吻鏡頭出現的次數",fontsize=16) # 繪製資料 for i in range(len(X)): # 散點圖繪製 plt.scatter(X[i],Y[i],color=colorDict[labels[i]]) # 每個點增加描述資訊 for i in range(0,7): plt.text(X[i]+2,Y[i]-1,names[i],fontsize=14) plt.show()
問題分析:根據已知資訊分析電影型別unknown是什麼
核心思想:
未標記樣本的類別由距離其最近的K個鄰居的類別決定
距離度量:
一般距離計算使用歐式距離(用勾股定理計算距離),也可以採用曼哈頓距離(水平上和垂直上的距離之和)、餘弦值和相似度(這是距離的另一種表達方式)。相比於上述距離,馬氏距離更為精確,因為它能考慮很多因素,比如單位,由於在求協方差矩陣逆矩陣的過程中,可能不存在,而且若碰見3維及3維以上,求解過程中極其複雜,故可不使用馬氏距離
知識擴充套件
- 馬氏距離概念:表示資料的協方差距離
- 方差:資料集中各個點到均值點的距離的平方的平均值
- 標準差:方差的開方
- 協方差cov(x,y):E表示均值,D表示方差,x,y表示不同的資料集,xy表示資料集元素對應乘積組成資料集
cov(x,y) = E(xy) - E(x)*E(y)
cov(x,x) = D(x)
cov(x1+x2,y) = cov(x1,y) + cov(x2,y)
cov(ax,by) = abcov(x,y)
- 協方差矩陣:根據維度組成的矩陣,假設有三個維度,a,b,c
∑ij = [cov(a,a) cov(a,b) cov(a,c) cov(b,a) cov(b,b) cov(b,c) cov(c,a) cov(c,b) cov(c,c)]
演算法實現:歐氏距離
編碼實現
# 自定義實現 mytest1.py import numpy as np # 建立資料集 def createDataSet(): features = np.array([[3,104],[8,95],[1,81],[111,15],[99,2],[88,10]]) labels = ["Romance","Action"] return features,labels def knnClassify(testFeature,trainingSet,labels,k): """ KNN演算法實現,採用歐式距離 :param testFeature: 測試資料集,ndarray型別,一維陣列 :param trainingSet: 訓練資料集,ndarray型別,二維陣列 :param labels: 訓練集對應標籤,ndarray型別,一維陣列 :param k: k值,int型別 :return: 預測結果,型別與標籤中元素一致 """ dataSetsize = trainingSet.shape[0] """ 構建一個由dataSet[i] - testFeature的新的資料集diffMat diffMat中的每個元素都是dataSet中每個特徵與testFeature的差值(歐式距離中差) """ testFeatureArray = np.tile(testFeature,(dataSetsize,1)) diffMat = testFeatureArray - trainingSet # 對每個差值求平方 sqDiffMat = diffMat ** 2 # 計算dataSet中每個屬性與testFeature的差的平方的和 sqDistances = sqDiffMat.sum(axis=1) # 計算每個feature與testFeature之間的歐式距離 distances = sqDistances ** 0.5 """ 排序,按照從小到大的順序記錄distances中各個資料的位置 如distance = [5,9,2] 則sortedStance = [2,3,1] """ sortedDistances = distances.argsort() # 選擇距離最小的k個點 classCount = {} for i in range(k): voteiLabel = labels[list(sortedDistances).index(i)] classCount[voteiLabel] = classCount.get(voteiLabel,0) + 1 # 對k個結果進行統計、排序,選取最終結果,將字典按照value值從大到小排序 sortedclassCount = sorted(classCount.items(),key=lambda x: x[1],reverse=True) return sortedclassCount[0][0] testFeature = np.array([100,200]) features,labels = createDataSet() res = knnClassify(testFeature,features,3) print(res) # 使用python包實現 mytest2.py from sklearn.neighbors import KNeighborsClassifier from .mytest1 import createDataSet features,labels = createDataSet() k = 5 clf = KNeighborsClassifier(k_neighbors=k) clf.fit(features,labels) # 樣本值 my_sample = [[18,90]] res = clf.predict(my_sample) print(res)
示例:《交友網站匹配效果預測》
資料來源:略
資料顯示
import pandas as pd import numpy as np from matplotlib import pyplot as plt from mpl_toolkits.mplot3d import Axes3D # 資料載入 def loadDatingData(file): datingData = pd.read_table(file,header=None) datingData.columns = ["FlightDistance","PlaytimePreweek","IcecreamCostPreweek","label"] datingTrainData = np.array(datingData[["FlightDistance","IcecreamCostPreweek"]]) datingTrainLabel = np.array(datingData["label"]) return datingData,datingTrainData,datingTrainLabel # 3D圖顯示資料 def dataView3D(datingTrainData,datingTrainLabel): plt.figure(1,figsize=(8,3)) plt.subplot(111,projection="3d") plt.scatter(np.array([datingTrainData[x][0] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "smallDoses"]),np.array([datingTrainData[x][1] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "smallDoses"]),np.array([datingTrainData[x][2] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "smallDoses"]),c="red") plt.scatter(np.array([datingTrainData[x][0] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "didntLike"]),np.array([datingTrainData[x][1] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "didntLike"]),np.array([datingTrainData[x][2] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "didntLike"]),c="green") plt.scatter(np.array([datingTrainData[x][0] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "largeDoses"]),np.array([datingTrainData[x][1] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "largeDoses"]),np.array([datingTrainData[x][2] for x in range(len(datingTrainLabel)) if datingTrainLabel[x] == "largeDoses"]),c="blue") plt.xlabel("飛行里程數",fontsize=16) plt.ylabel("視訊遊戲耗時百分比",fontsize=16) plt.clabel("冰淇凌消耗",fontsize=16) plt.show() datingData,datingTrainLabel = loadDatingData(FILEPATH1) datingView3D(datingTrainData,datingTrainLabel)
問題分析:抽取資料集的前10%在資料集的後90%進行測試
編碼實現
# 自定義方法實現 import pandas as pd import numpy as np # 資料載入 def loadDatingData(file): datingData = pd.read_table(file,datingTrainLabel # 資料歸一化 def autoNorm(datingTrainData): # 獲取資料集每一列的最值 minValues,maxValues = datingTrainData.min(0),datingTrainData.max(0) diffValues = maxValues - minValues # 定義形狀和datingTrainData相似的最小值矩陣和差值矩陣 m = datingTrainData.shape(0) minValuesData = np.tile(minValues,(m,1)) diffValuesData = np.tile(diffValues,1)) normValuesData = (datingTrainData-minValuesData)/diffValuesData return normValuesData # 核心演算法實現 def KNNClassifier(testData,trainData,trainLabel,k): m = trainData.shape(0) testDataArray = np.tile(testData,1)) diffDataArray = (testDataArray - trainData) ** 2 sumDataArray = diffDataArray.sum(axis=1) ** 0.5 # 對結果進行排序 sumDataSortedArray = sumDataArray.argsort() classCount = {} for i in range(k): labelName = trainLabel[list(sumDataSortedArray).index(i)] classCount[labelName] = classCount.get(labelName,0)+1 classCount = sorted(classCount.items(),reversed=True) return classCount[0][0] # 資料測試 def datingTest(file): datingData,datingTrainLabel = loadDatingData(file) normValuesData = autoNorm(datingTrainData) errorCount = 0 ratio = 0.10 total = datingTrainData.shape(0) numberTest = int(total * ratio) for i in range(numberTest): res = KNNClassifier(normValuesData[i],normValuesData[numberTest:m],datingTrainLabel,5) if res != datingTrainLabel[i]: errorCount += 1 print("The total error rate is : {}\n".format(error/float(numberTest))) if __name__ == "__main__": FILEPATH = "./datingTestSet1.txt" datingTest(FILEPATH) # python 第三方包實現 import pandas as pd import numpy as np from sklearn.neighbors import KNeighborsClassifier if __name__ == "__main__": FILEPATH = "./datingTestSet1.txt" datingData,datingTrainLabel = loadDatingData(FILEPATH) normValuesData = autoNorm(datingTrainData) errorCount = 0 ratio = 0.10 total = normValuesData.shape[0] numberTest = int(total * ratio) k = 5 clf = KNeighborsClassifier(n_neighbors=k) clf.fit(normValuesData[numberTest:total],datingTrainLabel[numberTest:total]) for i in range(numberTest): res = clf.predict(normValuesData[i].reshape(1,-1)) if res != datingTrainLabel[i]: errorCount += 1 print("The total error rate is : {}\n".format(errorCount/float(numberTest)))
以上就是python實現KNN近鄰演算法的詳細內容,更多關於python實現KNN近鄰演算法的資料請關注我們其它相關文章!