1. 程式人生 > >K_Nearest_Neighbot(knn)方法及其Pyhon 實現

K_Nearest_Neighbot(knn)方法及其Pyhon 實現

  • 1.近鄰演算法
    • 給定一個訓練資料集合,對新的輸入例項,在訓練資料集中找到與該例項最近鄰的k個例項,這k個例項多數屬於某個類,就把這個例項劃分為某個類。k=1時,稱為最近鄰演算法。
    • k近鄰演算法的三個基本要素:
      • k值的選擇: k值的減小就相當於是整體模型變得複雜,容易過擬合,k變大表示模型變的簡單,k=N表示完全忽略了訓練例項中的大量有用資訊(這個時候只是單純的根據訓練資料集的標籤來進行分類,新輸入的資料屬於訓練集合中元素數比較多的類,根據標籤來進行分類是很不靠譜的)。
        knn
        上圖中NN Classifier是最近鄰演算法(k==1),可以看到它把藍色點集中的幾個綠色的異常點(Outlier)也區分了出來
        ,這樣的模型很容易表現出過擬合,泛化的效果比較差。而最右邊的5-NN classifier就可以平滑掉這些綠色的異常點,會有較好的泛化效果。
      • 距離的度量
      • 分類決策規則: 多數表決。
    • 當輸入是兩張圖片時,可以將其轉換成兩個向量I1I2,這裡先選用L1距離,d1(I1,I2)=p|I1pI2p|,其過程視覺化為:l1
    • 還可以使用L2距離,其幾何意義是兩個向量之間的歐氏距離,d2(I1,I2)=p(I1pI2p)2
    • 近鄰演算法的有點和缺點(pros and cons)

      • 優點:易於實現和理解,無需訓練
      • 缺點:測試花費的時間太長
      • 缺點: 當輸入資料的維度很高時,譬如圖片很大,像L2距離這些並沒有感官上的直接聯絡。
        如下圖,someone
        基於畫素的高維資料的距離是非常不直觀的,上圖中最左側是原始影象,右側三張與其的L2距離是相同的,但很顯然從視覺效果上和語義上它們三個之間並沒什麼相關性。L1和L2距離只是和圖片背景和顏色分佈有較強的相關性。
    • 2.程式碼實現
     #!/usr/bin/env python2
    ## -*- coding: utf-8 -*-
    """
    Created on Thu Aug  2 09:46:44 2018
    @author: rd
    """
from __future__ import division import numpy as np class KNearestNeighbor(object): """ a kNN classifier with L2 distance """ def __init__(self): pass """In kNearestNeighbor,training means store the training data""" def train(self, X, Y): self.X_train = X self.Y_train = Y def predict(self, X, k=1, num_loops=0): if num_loops == 0: dists = self.compute_distances_no_loops(X) elif num_loops == 1: dists = self.compute_distances_one_loop(X) elif num_loops == 2: dists = self.compute_distances_two_loops(X) else: raise ValueError('Invalid value %d for num_loops' % num_loops) return self.predict_labels(dists, k=k) def compute_distances_two_loops(self, X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): for j in range(num_train): dists[i][j]=np.sum(np.square(self.X_train[j,:] - X[i,:])) return dists def compute_distances_one_loops(self,X): num_test = X.shape[0] num_train = self.X_train.shape[0] dists = np.zeros((num_test, num_train)) for i in range(num_test): dists[i]=np.sum(np.square(self.X_train-X[i]),axis=1) return dists def compute_distances_no_loops(self,X): squa_sum_X=np.sum(np.square(X),axis=1).reshape(-1,1) squa_sum_Xtr=np.sum(np.square(self.X_train),axis=1) inner_prod=np.dot(X,self.X_train.T) dists = -2*inner_prod+squa_sum_X+squa_sum_Xtr return dists def predict_labels(self, dists, k=1): num_test = dists.shape[0] y_pred = np.zeros(num_test) for i in range(num_test): pos=np.argsort(dists[i])[:k] closest_y = self.Y_train[pos] y_pred[i]=np.argmax(np.bincount(closest_y.astype(int))) return y_pred """ This dataset is part of MNIST dataset,but there is only 3 classes, classes = {0:'0',1:'1',2:'2'},and images are compressed to 14*14 pixels and stored in a matrix with the corresponding label, at the end the shape of the data matrix is num_of_images x 14*14(pixels)+1(lable) """ def load_data(split_ratio): tmp=np.load("data216x197.npy") data=tmp[:,:-1] label=tmp[:,-1] mean_data=np.mean(data,axis=0) train_data=data[int(split_ratio*data.shape[0]):]-mean_data train_label=label[int(split_ratio*data.shape[0]):] test_data=data[:int(split_ratio*data.shape[0])]-mean_data test_label=label[:int(split_ratio*data.shape[0])] return train_data,train_label,test_data,test_label def main(): train_data,train_label,test_data,test_label=load_data(0.4) knn=KNearestNeighbor() knn.train(train_data,train_label) Yte=knn.predict(test_data,k=2) print "The accuracy is {}".format(np.mean(Yte==test_label)) if __name__=="__main__": main() >>>python knn.py The accuracy is 0.976744186047 #資料很少,圖片是單通道尺寸也很小,所以分類結果還不錯