knn自己實現(python)
阿新 • • 發佈:2018-11-19
前言:
對於knn來說,有兩個hyperparameters(超引數:choices about the algorithm that we set rather than learn. Very problem-dependent, must try them all out and see what works best.),其一是怎麼選取distance metric,其二是怎麼選取k。
這兒說兩種distance metric,一種是Manhattan metric, 也叫L1,另一種是Euclidean metric, 即L2。
這兩種距離各有應用的方面,目前自己也在學習階段,並不是特別清楚,但老師說,L1是coordinate dependence。
比如說你有一個關於員工的vector,裡面的元素描述的是員工的工資,年齡,性別等等不同的方面,那麼,此時
就可以考慮L1距離。
程式碼實現:
步驟:
1、匯入資料集
2、資料集分類(訓練、測試)
3、計算test instance與training set的distance
4、根據distance的大小,選取k個neighbors
5、k個鄰居進行投票(這兒只討論binary knn)
實現:
from sklearn.datasets import load_iris from sklearn import cross_validation from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score from collections import Counter from operator import itemgetter import numpy as np import math # 1) given two data points, calculate the euclidean distance between them def get_distance(data1, data2): points = zip(data1, data2) diffs_squared_distance = [pow(a - b, 2) for (a, b) in points] return math.sqrt(sum(diffs_squared_distance)) # 2) given a training set and a test instance, use getDistance to calculate all pairwise distances def get_neighbours(training_set, test_instance, k): distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set] # index 1 is the calculated distance between training_instance and test_instance sorted_distances = sorted(distances, key=itemgetter(1)) # extract only training instances without distance sorted_training_instances = [tuple[0] for tuple in sorted_distances] # select first k elements return sorted_training_instances[:k] def _get_tuple_distance(training_instance, test_instance): return (training_instance, get_distance(test_instance, training_instance[0])) def get_majority_vote(neighbours): # index 1 is the class classes = [neighbour[1] for neighbour in neighbours] count = Counter(classes) return count.most_common()[0][0] def main(): # load the data and create the training and test sets # random_state = 1 is just a seed to permit reproducibility of the train/test split,即設定種子,使得隨機結果能夠在以後的實驗再次出現 iris = load_iris() X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1) # reformat train/test datasets for convenience train = np.array(list(zip(X_train,y_train))) test = np.array(list(zip(X_test, y_test))) #上面的train和test的資料結構如下 """ array([[array([5.8, 2.8, 5.1, 2.4]), 2], [array([6. , 2.2, 4. , 1. ]), 1], [array([5.5, 4.2, 1.4, 0.2]), 0],.....]) """ # generate predictions predictions = [] # let's arbitrarily set k equal to 5, meaning that to predict the class of new instances, k = 5 # for each instance in the test set, get nearest neighbours and majority vote on predicted class for x in range(len(X_test)): print('Classifying test instance number ',str(x),":") neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5) majority_vote = get_majority_vote(neighbours) predictions.append(majority_vote) print('Predicted label=',str(majority_vote),', Actual label=',str(test[x][1])) # summarize performance of the classification y_test_new=y_test.reshape(-1,1) print('The overall accuracy of the model is: ',accuracy_score(y_test_new, predictions)) report = classification_report(y_test, predictions, target_names = iris.target_names) print('A detailed classification report: \n\n',report) if __name__ == "__main__": main()