1. 程式人生 > >自己實現knn,進行圖片分類(python)

自己實現knn,進行圖片分類(python)

資料集下載

這是斯坦福課程的作業,根據大綱裡面assignment1內的提示,下載好實驗所需要的資料集。

目錄

前言:

程式碼主體:

part1 資料的匯入以及預處理

part2 knn訓練以及預測

具體實現

3.1 資料匯入

3.2 計算dist(三種實現,由易到難,原理相同)

3.3 預測

通過Cross-validation選取hyperparameters


前言:

1、knn的預處理步驟需要對資料進行Normalize,對於影象而言,可以理解為一個畫素為一個feature。因為畫素的分佈是homogeneous,且不存在widely different distributions,因此這兒不需要data mormalization。

2、實際運用knn時需要對資料的預處理第二步是降維處理,因為knn是通過定義不同的distance metric來vote的,而distance一旦處於高維就是反直覺的,並且在數學上也是能解釋為什麼效果不好,具體參見文章section6。這兒的實驗只是為了學習knn,實際中從來不會用knn進行圖片分類,原因見3

3、knn的優點在於簡單、直觀。但缺點 一是 在訓練上花費很少的時間(具體見下,訓練只是快取了訓練集),但在預測上花費很多時間(有擴充套件的演算法可以減少預測時間,以正確率作為代價,例如Approximate Nearest Neighbor (ANN) ,具體可參考

FLANN),實際的需要與此相反。二是 knn在高維的資料上效果並不好。

4、可以先從一個很簡單的例子開始:knn簡單例子

程式碼主體:

part1 資料的匯入以及預處理

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

from __future__ import print_function

"""
這一步只是進行配置
"""

# Run some setup code for this notebook.

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2


# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'

# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
   del X_train, y_train
   del X_test, y_test
   print('Clear previously loaded data.')
except:
   pass

# cs231n.data_utils 中的load_CIFAR10 實現在後面
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
# print('Training data shape: ', X_train.shape) >> Training data shape:  (50000, 32, 32, 3)
# print('Training labels shape: ', y_train.shape) >> Training labels shape:  (50000,)
# print('Test data shape: ', X_test.shape) >> Test data shape:  (10000, 32, 32, 3)
# print('Test labels shape: ', y_test.shape) >> Test labels shape:  (10000,)


# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = range(num_training)
# 下面兩種方式都可以實現篩選前5000行,注意,若資料是高維的,也可以實現
# 例如若資料是(10000,32,32,3),那麼下面的程式碼執行後,維數變為(5000,32,32,3)
#X_train = X_train[mask]
X_train = X_train[np.arange(num_training)]
y_train = y_train[mask]
#print(X_train.shape) >> (5000, 32, 32, 3)
num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]


# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
# print(X_train.shape, X_test.shape) >> (5000, 3072) (500, 3072)

part2 knn訓練以及預測

from cs231n.classifiers import KNearestNeighbor


# the Classifier simply remembers the data and does no further processing 
# 具體實現見後面
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
#plt.imshow(dists, interpolation='none')
#plt.show()

# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

"""
接下來是用不同的方式計算dist,並進行比較,可以不看
"""

# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
#矩陣A的Frobenius範數(ord='fro')定義為矩陣A各項元素的絕對值平方的總和
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('Difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')



# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# you should see significantly faster performance with the fully vectorized implementation

 

具體實現

對於評分矩陣而言,理論上是二重迴圈,但可以利用python函式的特性,將二重迴圈減少為一重,甚至是直接一次性完成。

3.1 資料匯入

from __future__ import print_function

from six.moves import cPickle as pickle
import numpy as np
import os
from scipy.misc import imread
import platform

def load_pickle(f):
    version = platform.python_version_tuple()
    if version[0] == '2':
        return  pickle.load(f)
    elif version[0] == '3':
        return  pickle.load(f, encoding='latin1')
    raise ValueError("invalid python version: {}".format(version))

def load_CIFAR_batch(filename):
  """ load single batch of cifar """
  with open(filename, 'rb') as f:
    datadict = load_pickle(f)
    X = datadict['data']
    Y = datadict['labels']
    X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
    Y = np.array(Y)
    return X, Y

def load_CIFAR10(ROOT):
  """ load all of cifar """
  xs = []
  ys = []
  for b in range(1,6):
    f = os.path.join(ROOT, 'data_batch_%d' % (b, ))
    X, Y = load_CIFAR_batch(f)
    xs.append(X)
    ys.append(Y)    
  Xtr = np.concatenate(xs)
  Ytr = np.concatenate(ys)
  del X, Y
  Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
  return Xtr, Ytr, Xte, Yte


def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000,
                     subtract_mean=True):
    """
    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare
    it for classifiers. These are the same steps as we used for the SVM, but
    condensed to a single function.
    """
    # Load the raw CIFAR-10 data
    cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
        
    # Subsample the data
    mask = list(range(num_training, num_training + num_validation))
    X_val = X_train[mask]
    y_val = y_train[mask]
    mask = list(range(num_training))
    X_train = X_train[mask]
    y_train = y_train[mask]
    mask = list(range(num_test))
    X_test = X_test[mask]
    y_test = y_test[mask]

    # Normalize the data: subtract the mean image
    if subtract_mean:
      mean_image = np.mean(X_train, axis=0)
      X_train -= mean_image
      X_val -= mean_image
      X_test -= mean_image
    
    # Transpose so that channels come first
    X_train = X_train.transpose(0, 3, 1, 2).copy()
    X_val = X_val.transpose(0, 3, 1, 2).copy()
    X_test = X_test.transpose(0, 3, 1, 2).copy()

    # Package data into a dictionary
    return {
      'X_train': X_train, 'y_train': y_train,
      'X_val': X_val, 'y_val': y_val,
      'X_test': X_test, 'y_test': y_test,
    }
    

 

3.2 計算dist(三種實現,由易到難,原理相同)

二重迴圈(最簡單直觀的實現)

"""test for two loops
x=np.array([[1,2],[3,1]])
y=np.array([2,2]).reshape(1,2)
a=np.zeros((y.shape[0],x.shape[0]))
for i in range(y.shape[0]):
    for j in range(x.shape[0]):
        a[i,j]=np.sum((y[i,:]-x[j,:])**2)
"""        

def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      for j in range(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        
        # L2 distance.
        dists[i,j] = np.sqrt(np.sum((X[i,:]-self.X_train[j,:])**2))
        
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
        
    return dists

一重迴圈

# test for one loop
#x=np.array([[1,2],[3,1]]) # 訓練集
#y=np.array([[2,2],[3,3]]) # 測試集
#a=np.zeros((y.shape[0],x.shape[0]))
"""
# y的第0行與x矩陣中的每一行進行操作
y[0,:]-x
Out[19]: 
array([[ 1,  0],
       [-1,  1]])
# 表示按行進行求和操作,不寫axis會變成整個矩陣求和
np.sum((y[0,:]-x),axis=1)
Out[21]: array([1, 0])
"""
#for i in range(y.shape[0]):
#        a[i,:]=np.sum((y[i,:]-x)**2,axis=1)


  def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in range(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      # L2 distance.
      dists[i,:] = np.sqrt(np.sum((X[i,:] - self.X_train)**2, axis = 1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists

沒有迴圈(利用了(x-y)^2=x^2+y^2-2xy的數學性質,注意這兒是矩陣,得做一些改動)

  def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################
    """
    np.sum(y**2,axis=1)
    Out[23]: array([ 8, 18])
    
    """
    # L2 distance vectorized.
    X_squared = np.sum(X**2,axis=1)
    Y_squared = np.sum(self.X_train**2,axis=1)
    XY = np.dot(X, self.X_train.T)

    # Expand L2 distance formula to get L2(X,Y) = sqrt((X-Y)^2) = sqrt(X^2 + Y^2 -2XY)
    """
    X_squared[:,np.newaxis]增加了一維,例如原本為(10,)現在變為(10,1),可以在直覺上理解為轉置操作
    但因為python對一維向量轉置還是它自己(因為實際上一維就是陣列,陣列的轉置就是它本身),所以不能直接
    使用T的操作
    """
    dists = np.sqrt(X_squared[:,np.newaxis] + Y_squared -2*XY)

    # Also useful https://medium.com/dataholiks-distillery/l2-distance-matrix-vectorization-trick-26aa3247ac6c
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

3.3 預測

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      # Select a test row.
      test_row = dists[i,:]
      
      # np.argsort returns indices of sorted input.
      # 返回輸入從小到大的排列的索引號
      sorted_row = np.argsort(test_row)

      # Get the k closest indices.
      closest_y = self.y_train[sorted_row[0:k]]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################
      # Find the most occuring index in our closest k.
      """
      np.bincount([2,2,2,2,1,1,1])
      Out[54]: array([0, 3, 4]) 即0出現0次,1出現3次,2出現4次
      """
      y_pred[i] = np.argmax(np.bincount(closest_y))
      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred


"""
或者寫成彙總的形式
"""
  def predict(self, X, k=1, num_loops=0):
    """
    Predict labels for test data using this classifier.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data consisting
         of num_test samples each of dimension D.
    - k: The number of nearest neighbors that vote for the predicted labels.
    - num_loops: Determines which implementation to use to compute distances
      between training points and testing points.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    if num_loops == 0:
      dists = self.compute_distances_no_loops(X)
    elif num_loops == 1:
      dists = self.compute_distances_one_loop(X)
    elif num_loops == 2:
      dists = self.compute_distances_two_loops(X)
    else:
      raise ValueError('Invalid value %d for num_loops' % num_loops)

    return self.predict_labels(dists, k=k)

通過Cross-validation選取hyperparameters

為什麼需要validation set呢?因為我們有預設引數,為了評判引數的好壞,就用validation set(validation set是train set的一部分)。一定要注意的是,test set 只能當作別人擁有的集合,拿來測試自己模型的好壞,自己不能用(或者說自己只能在模型釋出前最後使用,用test set測試出來的模型正確率即為釋出時宣佈的模型正確率)

而corss-validation的意思就是,首先可以把train set分成幾份,這兒假設分成了5份,那麼首先取第一份作為validation set,其餘為train set,評判當前選擇的k值好壞,接著選取第二份作為validation set,其餘為train set,類似的操作進行5次。

那麼,最後關於每個k值,我們都可以得到5個關於它好壞的分數。

視覺化結果如下,其中垂直的線代表每一個k值的5個結果的方差,而整個圖的趨勢線就是平均值的連線。

一般而言,實際中並不會使用cross-validation,因為計算上消耗太大,但是如果我們的訓練集小的時候,可以使用這種方法預設引數。

In particular, we cannot use the test set for the purpose of tweaking hyperparameters.

Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at all. The idea is to split our training set in two: a slightly smaller training set, and what we call a validation set

Cross-validation. In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called cross-validation

 

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################

X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################

for k in k_choices:
    for n in range(num_folds):
        # Concat all our folds together except for the nth fold for training.
        current_train_fold_x = np.concatenate(tuple([X_train_folds[i] for i in range(num_folds) if i!=n]))
        current_train_fold_y = np.concatenate(tuple([y_train_folds[i] for i in range(num_folds) if i!=n]))
        
        # Select the held out fold to be our test data.
        current_test_fold_x = X_train_folds[n]
        current_test_fold_y = y_train_folds[n]
        
        classifier.train(current_train_fold_x, current_train_fold_y)
        
        # Perform prediction on our test set, default is to use no loop version.
        y_test_pred = classifier.predict(current_test_fold_x, k=k)
        
        # Evaluate and store in k_to_accuracies dict.
        num_correct = np.sum(y_test_pred == current_test_fold_y)
        if k not in k_to_accuracies:
            k_to_accuracies[k] = [float(num_correct) / current_test_fold_x.shape[0]]
        else:
            k_to_accuracies[k].append(float(num_correct) / current_test_fold_x.shape[0])

################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

“”“
# output
k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.239000
k = 3, accuracy = 0.249000
k = 3, accuracy = 0.240000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.254000
k = 5, accuracy = 0.248000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.280000
k = 5, accuracy = 0.292000
k = 5, accuracy = 0.280000
k = 8, accuracy = 0.262000
k = 8, accuracy = 0.282000
k = 8, accuracy = 0.273000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.273000
k = 10, accuracy = 0.265000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.284000
k = 10, accuracy = 0.280000
k = 12, accuracy = 0.260000
k = 12, accuracy = 0.295000
k = 12, accuracy = 0.279000
k = 12, accuracy = 0.283000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.252000
k = 15, accuracy = 0.289000
k = 15, accuracy = 0.278000
k = 15, accuracy = 0.282000
k = 15, accuracy = 0.274000
k = 20, accuracy = 0.270000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.279000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.285000
k = 50, accuracy = 0.271000
k = 50, accuracy = 0.288000
k = 50, accuracy = 0.278000
k = 50, accuracy = 0.269000
k = 50, accuracy = 0.266000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.270000
k = 100, accuracy = 0.263000
k = 100, accuracy = 0.256000
k = 100, accuracy = 0.263000
”“”

可以將結果視覺化處理

# plot the raw observations
for k in k_choices:
    accuracies = k_to_accuracies[k]
    plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

根據圖示選擇最優的k,並預測模型最優的正確率

# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.
best_k = 10

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))