K 近鄰演算法識別手寫數字（Numpy寫法）

阿新 • • 發佈：2018-12-08

在 Kaggle 上面的 Notebook 給可愛的學弟學妹們用於參考... 程式碼這個東西一定要自己多寫，我一邊聽著林宥嘉的《想自由》，一邊寫出了大致的實現。K 近鄰演算法大概做的是一件什麼事情呢？你去商店買衣服的時候，突然忘記了自己要買的衣服多大尺碼比較合適（S/M/L/XL 這種）。這個時候你就要找幾個身材和你差不多的幾個店內顧客問一問了，結果你發現你這樣的身材的人大多買的是 XL 的衣服，所以你最後告訴老闆你也買 XL 的衣服，果然是機智聰明啊。

下面是解決該問題的思路重點：

你需要找幾個身材和你相似的人？（K 的個數）
你是怎麼判斷其它人身材和你的相似程度的？（距離度量方式）

你最終參考的是最多被購買的尺碼。（畢竟不同的人建議可能不同）

關於資料集的讀入

MNIST 資料集可以在這裡獲取：THE MNIST DATABASE of handwritten digits . 你一定很好奇——為什麼在 Kaggle 裡面資料集是 CSV 格式，而在資料集官網提供的是四個壓縮檔案？這沒什麼好稀奇的，你只需要根據不同的資料格式採用不同的資料讀取套路就好了，只要最後的資料格式的維度一致即可。

資料集已經解壓

def load_mnist(path, kind='train'):

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,'%s-labels-idx1-ubyte'% kind)
    images_path = os.path.join(path,'%s-images-idx3-ubyte'% kind)
    
    with open(labels_path,'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)
    
    with open(images_path,'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

資料集未解壓

def load_mnist(path, kind='train'):

    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

資料集已經被轉為 CSV 格式且無測試集標籤（Kaggle）

# It will takes about 1 ~ 2 minutes (depends on CPU)
train_data = np.genfromtxt('../input/train.csv', delimiter=',',
                  skip_header=1).astype(np.dtype('uint8'))
X_train = train_data[:,1:]
y_train = train_data[:,:1]

X_test = np.genfromtxt('../input/test.csv', delimiter=',',
                  skip_header=1).astype(np.dtype('uint8'))

檢查資料匯入是否順利

np.random.seed(0);
indices = list(np.random.randint(m_train, size=9))
for i in range(9):
    plt.subplot(3,3,i + 1)
    plt.imshow(X_train[indices[i]].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Index {} Class {}".format(indices[i], y_train[indices[i]]))
    plt.tight_layout()

定義距離度量

def euclidean_distance(vector1, vector2):
    return np.sqrt(np.sum(np.power(vector1 - vector2, 2)))
def absolute_distance(vector1, vector2):
    return np.sum(np.absolute(vector1 - vector2))

找相似鄰居（K Neighbours）

import operator
def get_neighbours(X_train, test_instance, k):
    distances = []
    neighbors = []
    for i in range(0, X_train.shape[0]):
        dist = euclidean_distance(X_train[i], test_instance)
        distances.append((i, dist))
    distances.sort(key=operator.itemgetter(1))
    for x in range(k):
        # print(distances[x])
        neighbors.append(distances[x][0])
    return neighbors

得到投票最多的建議

def predictkNNClass(output, y_train):
    classVotes = {}
    for i in range(len(output)):
    # print(output[i], y_train[output[i]])
        if y_train[output[i]][0] in classVotes:
            classVotes[y_train[output[i]][0]] += 1
        else:
            classVotes[y_train[output[i]][0]] = 1
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    # print(sortedVotes)
    return sortedVotes[0][0]

拿例項來進行測試

instance_num = 666
k = 9
plt.imshow(X_test[instance_num].reshape(28,28), cmap='gray', interpolation='none')
instance_neighbours = get_neighbours(X_train, X_test[instance_num], 9)
indices = instance_neighbours
for i in range(9):
    plt.subplot(3,3,i + 1)
    plt.imshow(X_train[indices[i]].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Index {} Class {}".format(indices[i], y_train[indices[i]]))
    plt.tight_layout()
predictkNNClass(instance_neighbours, y_train)

K 近鄰演算法識別手寫數字（Numpy寫法）

關於資料集的讀入

資料集已經解壓

資料集未解壓

資料集已經被轉為 CSV 格式且無測試集標籤（Kaggle）

檢查資料匯入是否順利

定義距離度量

找相似鄰居（K Neighbours）

得到投票最多的建議

拿例項來進行測試

K 近鄰演算法識別手寫數字（Numpy寫法）

k-近鄰演算法--使用k-近鄰演算法識別手寫數字

利用Python實現k最近鄰演算法並識別手寫數字（詳細註釋）

K-近鄰演算法之手寫數字識別系統

機器學習實戰（第二篇）-k-近鄰演算法開發手寫識別系統

用k-近鄰演算法：手寫識別系統

CSDN機器學習筆記十二 k-近鄰演算法實現手寫識別系統

TensorFlow在MNIST中的應用識別手寫數字（OpenCV+TensorFlow+CNN）

教你用TensorFlow和自編碼器模型生成手寫數字（附程式碼）

MachineLearning— (KNN)k Nearest Neighbor實現手寫數字識別（三）

KNN演算法——實現手寫數字識別（Sklearn實現）

k近鄰算法--手寫識別系統

[神經網絡與深度學習（一）]使用神經網絡識別手寫數字

基於神經網路（多層感知機）識別手寫數字

Tensorflow案例5：CNN演算法-Mnist手寫數字識別

機器學習Tensorflow基於MNIST資料集識別自己的手寫數字（讀取和測試自己的模型）

【人工智慧】利用C語言實現KNN演算法進行手寫數字識別

TensorFlow筆記（二）---多層感知機識別手寫數字

KNN 演算法-實戰篇-如何識別手寫數字

學習筆記TF024:TensorFlow實現Softmax Regression(回歸)識別手寫數字

K 近鄰演算法識別手寫數字（Numpy寫法）

關於資料集的讀入

資料集已經解壓

資料集未解壓

資料集已經被轉為 CSV 格式且無測試集標籤（Kaggle）

檢查資料匯入是否順利

定義距離度量

找相似鄰居（K Neighbours）

得到投票最多的建議

拿例項來進行測試

相關推薦