1. 程式人生 > >knn自己實現(python)

knn自己實現(python)

前言:

對於knn來說,有兩個hyperparameters(超引數:choices about the algorithm that we set rather than learn. Very problem-dependent, must try them all out and see what works best.),其一是怎麼選取distance metric,其二是怎麼選取k。

這兒說兩種distance metric,一種是Manhattan metric, 也叫L1,另一種是Euclidean metric, 即L2。

這兩種距離各有應用的方面,目前自己也在學習階段,並不是特別清楚,但老師說,L1是coordinate dependence。

比如說你有一個關於員工的vector,裡面的元素描述的是員工的工資,年齡,性別等等不同的方面,那麼,此時

就可以考慮L1距離。

 

程式碼實現:

    步驟:

                1、匯入資料集

                2、資料集分類(訓練、測試)

                3、計算test instance與training set的distance

                4、根據distance的大小,選取k個neighbors   

                5、k個鄰居進行投票(這兒只討論binary knn)

    實現:

from sklearn.datasets import load_iris
from sklearn import cross_validation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from collections import Counter
from operator import itemgetter
import numpy as np
import math


# 1) given two data points, calculate the euclidean distance between them
def get_distance(data1, data2):
    points = zip(data1, data2)
    diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]
    return math.sqrt(sum(diffs_squared_distance))

 
# 2) given a training set and a test instance, use getDistance to calculate all pairwise distances
def get_neighbours(training_set, test_instance, k):
    distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]
 
    # index 1 is the calculated distance between training_instance and test_instance
    sorted_distances = sorted(distances, key=itemgetter(1))
 
    # extract only training instances without distance
    sorted_training_instances = [tuple[0] for tuple in sorted_distances]
 
    # select first k elements
    return sorted_training_instances[:k]

def _get_tuple_distance(training_instance, test_instance):
    return (training_instance, get_distance(test_instance, training_instance[0]))



def get_majority_vote(neighbours):
    # index 1 is the class
    classes = [neighbour[1] for neighbour in neighbours]
    count = Counter(classes)
    return count.most_common()[0][0]


def main():
 
    # load the data and create the training and test sets
    # random_state = 1 is just a seed to permit reproducibility of the train/test split,即設定種子,使得隨機結果能夠在以後的實驗再次出現
    iris = load_iris()
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)
 
    # reformat train/test datasets for convenience
    train = np.array(list(zip(X_train,y_train)))
    test = np.array(list(zip(X_test, y_test)))
    
    #上面的train和test的資料結構如下
    """
        array([[array([5.8, 2.8, 5.1, 2.4]), 2],
       [array([6. , 2.2, 4. , 1. ]), 1],
       [array([5.5, 4.2, 1.4, 0.2]), 0],.....])
    """

 
    # generate predictions
    predictions = []
 
    # let's arbitrarily set k equal to 5, meaning that to predict the class of new instances,
    k = 5
 
    # for each instance in the test set, get nearest neighbours and majority vote on predicted class
    for x in range(len(X_test)):
 
            print('Classifying test instance number ',str(x),":")
            neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)
            majority_vote = get_majority_vote(neighbours)
            predictions.append(majority_vote)
            print('Predicted label=',str(majority_vote),', Actual label=',str(test[x][1]))
            
    # summarize performance of the classification
    y_test_new=y_test.reshape(-1,1)
    print('The overall accuracy of the model is: ',accuracy_score(y_test_new, predictions))
    report = classification_report(y_test, predictions, target_names = iris.target_names)
    print('A detailed classification report: \n\n',report)
 
if __name__ == "__main__":
    main()