自己實現knn,進行圖片分類(python)
這是斯坦福課程的作業,根據大綱裡面assignment1內的提示,下載好實驗所需要的資料集。
目錄
通過Cross-validation選取hyperparameters
前言:
1、knn的預處理步驟需要對資料進行Normalize,對於影象而言,可以理解為一個畫素為一個feature。因為畫素的分佈是homogeneous,且不存在widely different distributions,因此這兒不需要data mormalization。
2、實際運用knn時需要對資料的預處理第二步是降維處理,因為knn是通過定義不同的distance metric來vote的,而distance一旦處於高維就是反直覺的,並且在數學上也是能解釋為什麼效果不好,具體參見文章section6。這兒的實驗只是為了學習knn,實際中從來不會用knn進行圖片分類,原因見3
3、knn的優點在於簡單、直觀。但缺點 一是 在訓練上花費很少的時間(具體見下,訓練只是快取了訓練集),但在預測上花費很多時間(有擴充套件的演算法可以減少預測時間,以正確率作為代價,例如Approximate Nearest Neighbor (ANN) ,具體可參考
4、可以先從一個很簡單的例子開始:knn簡單例子。
程式碼主體:
part1 資料的匯入以及預處理
import random import numpy as np from cs231n.data_utils import load_CIFAR10 import matplotlib.pyplot as plt from __future__ import print_function """ 這一步只是進行配置 """ # Run some setup code for this notebook. # This is a bit of magic to make matplotlib figures appear inline in the notebook # rather than in a new window. %matplotlib inline plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray' # Some more magic so that the notebook will reload external python modules; # see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython %load_ext autoreload %autoreload 2 # Load the raw CIFAR-10 data. cifar10_dir = 'cs231n/datasets/cifar-10-batches-py' # Cleaning up variables to prevent loading data multiple times (which may cause memory issue) try: del X_train, y_train del X_test, y_test print('Clear previously loaded data.') except: pass # cs231n.data_utils 中的load_CIFAR10 實現在後面 X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir) # As a sanity check, we print out the size of the training and test data. # print('Training data shape: ', X_train.shape) >> Training data shape: (50000, 32, 32, 3) # print('Training labels shape: ', y_train.shape) >> Training labels shape: (50000,) # print('Test data shape: ', X_test.shape) >> Test data shape: (10000, 32, 32, 3) # print('Test labels shape: ', y_test.shape) >> Test labels shape: (10000,) # Subsample the data for more efficient code execution in this exercise num_training = 5000 mask = range(num_training) # 下面兩種方式都可以實現篩選前5000行,注意,若資料是高維的,也可以實現 # 例如若資料是(10000,32,32,3),那麼下面的程式碼執行後,維數變為(5000,32,32,3) #X_train = X_train[mask] X_train = X_train[np.arange(num_training)] y_train = y_train[mask] #print(X_train.shape) >> (5000, 32, 32, 3) num_test = 500 mask = list(range(num_test)) X_test = X_test[mask] y_test = y_test[mask] # Reshape the image data into rows X_train = np.reshape(X_train, (X_train.shape[0], -1)) X_test = np.reshape(X_test, (X_test.shape[0], -1)) # print(X_train.shape, X_test.shape) >> (5000, 3072) (500, 3072)
part2 knn訓練以及預測
from cs231n.classifiers import KNearestNeighbor
# the Classifier simply remembers the data and does no further processing
# 具體實現見後面
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)
# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
#plt.imshow(dists, interpolation='none')
#plt.show()
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)
# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))
"""
接下來是用不同的方式計算dist,並進行比較,可以不看
"""
# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)
# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
#矩陣A的Frobenius範數(ord='fro')定義為矩陣A各項元素的絕對值平方的總和
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
print('Good! The distance matrices are the same')
else:
print('Uh-oh! The distance matrices are different')
# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)
# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
print('Good! The distance matrices are the same')
else:
print('Uh-oh! The distance matrices are different')
# Let's compare how fast the implementations are
def time_function(f, *args):
"""
Call a function f with args and return the time (in seconds) that it took to execute.
"""
import time
tic = time.time()
f(*args)
toc = time.time()
return toc - tic
two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)
one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)
no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)
# you should see significantly faster performance with the fully vectorized implementation
具體實現
對於評分矩陣而言,理論上是二重迴圈,但可以利用python函式的特性,將二重迴圈減少為一重,甚至是直接一次性完成。
3.1 資料匯入
from __future__ import print_function
from six.moves import cPickle as pickle
import numpy as np
import os
from scipy.misc import imread
import platform
def load_pickle(f):
version = platform.python_version_tuple()
if version[0] == '2':
return pickle.load(f)
elif version[0] == '3':
return pickle.load(f, encoding='latin1')
raise ValueError("invalid python version: {}".format(version))
def load_CIFAR_batch(filename):
""" load single batch of cifar """
with open(filename, 'rb') as f:
datadict = load_pickle(f)
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
Y = np.array(Y)
return X, Y
def load_CIFAR10(ROOT):
""" load all of cifar """
xs = []
ys = []
for b in range(1,6):
f = os.path.join(ROOT, 'data_batch_%d' % (b, ))
X, Y = load_CIFAR_batch(f)
xs.append(X)
ys.append(Y)
Xtr = np.concatenate(xs)
Ytr = np.concatenate(ys)
del X, Y
Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
return Xtr, Ytr, Xte, Yte
def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000,
subtract_mean=True):
"""
Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
it for classifiers. These are the same steps as we used for the SVM, but
condensed to a single function.
"""
# Load the raw CIFAR-10 data
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# Subsample the data
mask = list(range(num_training, num_training + num_validation))
X_val = X_train[mask]
y_val = y_train[mask]
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]
# Normalize the data: subtract the mean image
if subtract_mean:
mean_image = np.mean(X_train, axis=0)
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
# Transpose so that channels come first
X_train = X_train.transpose(0, 3, 1, 2).copy()
X_val = X_val.transpose(0, 3, 1, 2).copy()
X_test = X_test.transpose(0, 3, 1, 2).copy()
# Package data into a dictionary
return {
'X_train': X_train, 'y_train': y_train,
'X_val': X_val, 'y_val': y_val,
'X_test': X_test, 'y_test': y_test,
}
3.2 計算dist(三種實現,由易到難,原理相同)
二重迴圈(最簡單直觀的實現)
"""test for two loops
x=np.array([[1,2],[3,1]])
y=np.array([2,2]).reshape(1,2)
a=np.zeros((y.shape[0],x.shape[0]))
for i in range(y.shape[0]):
for j in range(x.shape[0]):
a[i,j]=np.sum((y[i,:]-x[j,:])**2)
"""
def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
for j in range(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
# L2 distance.
dists[i,j] = np.sqrt(np.sum((X[i,:]-self.X_train[j,:])**2))
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
一重迴圈
# test for one loop
#x=np.array([[1,2],[3,1]]) # 訓練集
#y=np.array([[2,2],[3,3]]) # 測試集
#a=np.zeros((y.shape[0],x.shape[0]))
"""
# y的第0行與x矩陣中的每一行進行操作
y[0,:]-x
Out[19]:
array([[ 1, 0],
[-1, 1]])
# 表示按行進行求和操作,不寫axis會變成整個矩陣求和
np.sum((y[0,:]-x),axis=1)
Out[21]: array([1, 0])
"""
#for i in range(y.shape[0]):
# a[i,:]=np.sum((y[i,:]-x)**2,axis=1)
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in range(num_test):
#######################################################################
# TODO: #
# Compute the l2 distance between the ith test point and all training #
# points, and store the result in dists[i, :]. #
#######################################################################
# L2 distance.
dists[i,:] = np.sqrt(np.sum((X[i,:] - self.X_train)**2, axis = 1))
#######################################################################
# END OF YOUR CODE #
#######################################################################
return dists
沒有迴圈(利用了(x-y)^2=x^2+y^2-2xy的數學性質,注意這兒是矩陣,得做一些改動)
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy. #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
"""
np.sum(y**2,axis=1)
Out[23]: array([ 8, 18])
"""
# L2 distance vectorized.
X_squared = np.sum(X**2,axis=1)
Y_squared = np.sum(self.X_train**2,axis=1)
XY = np.dot(X, self.X_train.T)
# Expand L2 distance formula to get L2(X,Y) = sqrt((X-Y)^2) = sqrt(X^2 + Y^2 -2XY)
"""
X_squared[:,np.newaxis]增加了一維,例如原本為(10,)現在變為(10,1),可以在直覺上理解為轉置操作
但因為python對一維向量轉置還是它自己(因為實際上一維就是陣列,陣列的轉置就是它本身),所以不能直接
使用T的操作
"""
dists = np.sqrt(X_squared[:,np.newaxis] + Y_squared -2*XY)
# Also useful https://medium.com/dataholiks-distillery/l2-distance-matrix-vectorization-trick-26aa3247ac6c
#########################################################################
# END OF YOUR CODE #
#########################################################################
return dists
3.3 預測
def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point.
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0]
y_pred = np.zeros(num_test)
for i in range(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = []
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort. #
#########################################################################
# Select a test row.
test_row = dists[i,:]
# np.argsort returns indices of sorted input.
# 返回輸入從小到大的排列的索引號
sorted_row = np.argsort(test_row)
# Get the k closest indices.
closest_y = self.y_train[sorted_row[0:k]]
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
# Find the most occuring index in our closest k.
"""
np.bincount([2,2,2,2,1,1,1])
Out[54]: array([0, 3, 4]) 即0出現0次,1出現3次,2出現4次
"""
y_pred[i] = np.argmax(np.bincount(closest_y))
#########################################################################
# END OF YOUR CODE #
#########################################################################
return y_pred
"""
或者寫成彙總的形式
"""
def predict(self, X, k=1, num_loops=0):
"""
Predict labels for test data using this classifier.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data consisting
of num_test samples each of dimension D.
- k: The number of nearest neighbors that vote for the predicted labels.
- num_loops: Determines which implementation to use to compute distances
between training points and testing points.
Returns:
- y: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y[i] is the predicted label for the test point X[i].
"""
if num_loops == 0:
dists = self.compute_distances_no_loops(X)
elif num_loops == 1:
dists = self.compute_distances_one_loop(X)
elif num_loops == 2:
dists = self.compute_distances_two_loops(X)
else:
raise ValueError('Invalid value %d for num_loops' % num_loops)
return self.predict_labels(dists, k=k)
通過Cross-validation選取hyperparameters
為什麼需要validation set呢?因為我們有預設引數,為了評判引數的好壞,就用validation set(validation set是train set的一部分)。一定要注意的是,test set 只能當作別人擁有的集合,拿來測試自己模型的好壞,自己不能用(或者說自己只能在模型釋出前最後使用,用test set測試出來的模型正確率即為釋出時宣佈的模型正確率)
而corss-validation的意思就是,首先可以把train set分成幾份,這兒假設分成了5份,那麼首先取第一份作為validation set,其餘為train set,評判當前選擇的k值好壞,接著選取第二份作為validation set,其餘為train set,類似的操作進行5次。
那麼,最後關於每個k值,我們都可以得到5個關於它好壞的分數。
視覺化結果如下,其中垂直的線代表每一個k值的5個結果的方差,而整個圖的趨勢線就是平均值的連線。
一般而言,實際中並不會使用cross-validation,因為計算上消耗太大,但是如果我們的訓練集小的時候,可以使用這種方法預設引數。
In particular, we cannot use the test set for the purpose of tweaking hyperparameters.
Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at all. The idea is to split our training set in two: a slightly smaller training set, and what we call a validation set.
Cross-validation. In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called cross-validation.
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)
################################################################################
# END OF YOUR CODE #
################################################################################
# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
for k in k_choices:
for n in range(num_folds):
# Concat all our folds together except for the nth fold for training.
current_train_fold_x = np.concatenate(tuple([X_train_folds[i] for i in range(num_folds) if i!=n]))
current_train_fold_y = np.concatenate(tuple([y_train_folds[i] for i in range(num_folds) if i!=n]))
# Select the held out fold to be our test data.
current_test_fold_x = X_train_folds[n]
current_test_fold_y = y_train_folds[n]
classifier.train(current_train_fold_x, current_train_fold_y)
# Perform prediction on our test set, default is to use no loop version.
y_test_pred = classifier.predict(current_test_fold_x, k=k)
# Evaluate and store in k_to_accuracies dict.
num_correct = np.sum(y_test_pred == current_test_fold_y)
if k not in k_to_accuracies:
k_to_accuracies[k] = [float(num_correct) / current_test_fold_x.shape[0]]
else:
k_to_accuracies[k].append(float(num_correct) / current_test_fold_x.shape[0])
################################################################################
# END OF YOUR CODE #
################################################################################
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print('k = %d, accuracy = %f' % (k, accuracy))
“”“
# output
k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000
k = 50, accuracy = 0.271000
k = 50, accuracy = 0.288000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.266000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.270000
k = 100, accuracy = 0.263000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.263000
”“”
可以將結果視覺化處理
# plot the raw observations
for k in k_choices:
accuracies = k_to_accuracies[k]
plt.scatter([k] * len(accuracies), accuracies)
# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()
根據圖示選擇最優的k,並預測模型最優的正確率
# Based on the cross-validation results above, choose the best value for k,
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
best_k = 10
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)
# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))